OpenGL

Rendering Volume Filling Triangles in OpenGL (with no buffers)

Posted on Updated on

This is the promised follow-up to Rendering a Screen Covering Triangle in OpenGL (with no buffers), except this time the goal is to write a shader that accesses every location in a 3d texture (volume).  We use the same screen covering trick as before to draw a triangle to cover a viewport match to the X and Y dimensions of the volume, and we use instanced rendering to draw repeated triangles for each layer in the Z-dimension.

The vertex shader looks the same as before with the addition of the instanceID.

flat out int instanceID;

void main()
{
	float x = -1.0 + float((gl_VertexID & 1) << 2);
	float y = -1.0 + float((gl_VertexID & 2) << 1);
	instanceID  = gl_InstanceID;
	gl_Position = vec4(x, y, 0, 1);
}

The fragment shader can then recover the voxel coordinates from gl_FragCoord and the instanceID.

flat in int instanceID;

void main()
{
	ivec3 voxelCoord = ivec3(gl_FragCoord.xy, instanceID);
	voxelOperation(voxelCoord);
}

Very similar to drawing the single screen covering triangle, we set our viewport to the XY-dimensions of the volume, bind a junk VAO to make certain graphics drivers happy, and call glDrawArraysInstanced with the Z-dimension of the volume, so that we draw a triangle per-slice of the volume.

glViewport(0, 0, width, height);
glBindVertexArray(junkVAO);
glDrawArraysInstanced(GL_TRIANGLE_STRIP, 0, 3, depth);

Which would look sort of like the following:

VolumeFillingTriangles

This can be useful for quickly processing a volume. Initially, I used this as an OpenGL 4.2 fallback (instead of compute shaders) so that I could still use the NSight debugger, until I realized this approach was actually outperforming the compute shader. Of course, when to use compute shaders, and how to use them effectively deserves a post of its own.

Rendering a Screen Covering Triangle in OpenGL (with no buffers)

Posted on Updated on

This one has been on the backlog for ages now.  Anyway, this is an OpenGL adaptation of a clever trick that’s been around for quite awhile and described in DirectX terms by Cort Stratton (@postgoodism) in the “An interesting vertex shader trick” on #AltDevBlogADay.

It describes a method for rendering a triangle that covers the screen with no buffer inputs.  All vertex and texture coordinate information are generated solely from the vertexID.  Unfortunately, because OpenGL uses a right-handed coordinate system while DirectX uses a left-handed coordinate system the same vertexID transformation used for DirectX won’t work in OpenGL.  Basically, we need to reverse the order of the triangle vertices so that they are traversed counter-clockwise as opposed to clockwise in the original implementation. So, after a bit of experimentation I came up with the following adaptation for OpenGL:

void main()
{
	float x = -1.0 + float((gl_VertexID & 1) << 2);
	float y = -1.0 + float((gl_VertexID & 2) << 1);
	gl_Position = vec4(x, y, 0, 1);
}

This transforms the gl_VertexID as follows:

gl_VertexID=0 -> (-1,-1)
gl_VertexID=1 -> ( 3,-1)
gl_VertexID=2 -> (-1, 3)

We can easily add texture coordinates to this as well:

out vec2 texCoord;

void main()
{
	float x = -1.0 + float((gl_VertexID & 1) << 2);
	float y = -1.0 + float((gl_VertexID & 2) << 1);
	texCoord.x = (x+1.0)*0.5;
	texCoord.y = (y+1.0)*0.5;
	gl_Position = vec4(x, y, 0, 1);
}

Which is going to provide in that homogeneous clip space region a position value varying from -1 to 1 and texture coordinates varying from 0 to 1 exactly as OpenGL would expect, all without need to any create any buffers. All you have to do is make single call to glDrawArrays and tell it to render 3 vertices:

glDrawArrays( GL_TRIANGLES, 0, 3 );

This draw a triangle that looks like the following:

glScreenSpaceTriangle

It’s surprising how often this comes in handy, in a later post I’ll describe how to adapt this trick to efficiently access the elements of a 3D texture.  It also amuses me greatly that Iñigo Quilez’s amazing demo/presentation “Rendering World’s With Two Triangles” could actually be renamed “Renderings Worlds With One Triangle.”

Bindless textures can “store”

Posted on Updated on

I don’t know how I missed this when Nvidia released NV_bindless_texture, I guess because all the samples I saw used bindless textures to demonstrate a ridiculous number of texture reads. But I realized when reading the recently released ARB_bindless_texture extension that they can also be used to “store,” or write, to a very large number of textures (using ARB_shader_image_load_store functionality). Which finally gets rid of that extremely pesky MAX_IMAGE_UNITS limitation I’ve been complaining about. The only downside is that I can no longer run my program at home on my GTX 480.

AtomicCounters & IndirectBufferCommands

Posted on

I’ve made use of Atomic Counters and Indirect Buffers in the past, but always in the most straightforward manner. I.e. create a dedicated buffer for the atomic counter, and another for the Indirect Command Buffer, increment the counter in a shader then write the Atomic Counter value into the Indirect Command Buffer using the Image API, ending up with a shader that looks something like below.

#version 420

layout(location = 0) in ivec3 inputBuffer;

layout(r32ui, binding = 0) uniform uimageBuffer outputBuffer;
layout(r32ui, binding = 1) uniform uimageBuffer indirectArrayCommand;
layout(       binding = 0) uniform atomic_uint  atomicCounter;

void main()
{
	// ...
	// do some stuff
	// ...

	if(someCondition == true)
	{
		//increment counter
		int index = int(atomicCounterIncrement(atomicCounter));

		//store stuff in output buffer
		imageStore(outputBuffer, index, uvec4(someStuff)));
	}

	memoryBarrier();

	//Store the atomicCounter value to the count (the first element) of the DrawArraysIndirect command
	imageStore(indirectArrayCommand, 0, uvec4(atomicCounter(atomicCounter)));
}

This works fine, but one annoying thing about this approach is that it consumes an extra image unit (of the max 8 available). Fortunately, it turns out that it is unnecessary to create an extra atomic counter and perform the synchronization with the indirect draw command. It is possible to simply bind the appropriate element of the indirect draw buffer directly to the atomic counter.

// This binds the count element of the Indirect Array Command Buffer directly as an atomic counter in the shader
// (no need for copy from dedicated atomic counter)
glBindBufferRange(GL_ATOMIC_COUNTER_BUFFER,        // Target buffer is the atomic counter
                  0,                               // Binding point, must match the shader
                  IndirectArrayCommandBuffer_id,   // Source buffer is the Indirect Draw Command Buffer
                  0,                               // Offset, 0 for count, 1 for primCount (instances), etc...
                  sizeof(GLuint));

This allows us to get rid of Indirect Buffers image unit binding, which simplifies the shader as shown below. The main reason I’ve found to do this is reduce the number of image units required by the shader, as its very easy to hit the limit of 8.

#version 420

layout(location = 0) in ivec3 inputBuffer;

layout(r32ui, binding = 0) uniform uimageBuffer outputBuffer;
layout(       binding = 0) uniform atomic_uint  atomicCounter;

void main()
{
	// ...
	// do some stuff
	// ...

	if(someCondition == true)
	{
		//increment counter
		int index = int(atomicCounterIncrement(atomicCounter));

		//store stuff in output buffer
		imageStore(outputBuffer, index, uvec4(someStuff)));
	}
}

CUDA 5 and OpenGL Interop and Dynamic Parallelism

Posted on Updated on

I seem to revisit this every time every time Nvidia releases a new version of of CUDA.

The good news…

The old methods still work, the whole register, map, bind, etc… process I described in my now two year old post Writing to 3D OpenGL textures in CUDA 4.1 with 3D Surface writes still works.  Ideally, the new version number shouldn’t introduce any new problems…

The bad news…

Unfortunately, if you try to write to a globally scoped CUDA surface from a device-side launched kernel (i.e. a dynamic kernel), nothing will happen.  You’ll scratch your head and wonder why code that works perfectly fine when launched from the host-side, fails silently when launched device-side.

I only discovered the reason when I decided to read, word for word, the CUDA Dynamic Parallelism Programming Guide. On page 14, in the “Textures & Surfaces” section is this note:

NOTE: The device runtime does not support legacy module-scope (i.e. Fermi-style)
textures and surfaces within a kernel launched from the device. Module-scope (legacy)
textures may be created from the host and used in device code as for any kernel, but
may only be used by a top-level kernel (i.e. the one which is launched from the host).

So now the old way of dealing with textures is considered “Legacy” but apparently not quite deprecated yet.  So don’t use them if you plan on using dynamic parallelism.  Additional Note: if you so much call a function that attempts to perform a “Fermi-style” surface write you’re kernel will fail silently, so I highly recommend removing all “Fermi-style” textures and surfaces if you plan on using dynamic parallelism.

So what’s the “New style” of textures and surfaces, well also on page 14 is a footnote saying:

Dynamically created texture and surface objects are an addition to the CUDA memory model
introduced with CUDA 5.0. Please see the CUDA Programming Guide for details.

So I guess they’re called “Dynamically created textures and surfaces”, which is a mouthful so I’m going to refer to them as “Kepler-style” textures and surfaces.  In the actual API they are cudaTextureObject_t and cudaSurfaceObject_t, and you can pass them around as parameters instead of having to declare them at file scope.

OpenGL Interop

So now we have two distinct methods for dealing with textures and surfaces, “Fermi-style” and “Kepler-style”, but we only know how graphics interoperability works with the old, might-as-well-be-deprecated, “Fermi-style” textures and surfaces.

And while there are some samples showing how the new “Kepler-style” textures and surfaces work (see the Bindless Texture sample), all the interop information still seems to target the old “Fermi-style” textures and surfaces.  Fortunately, there is some common ground between “Kepler-style” and “Fermi-style” textures and surfaces, and that common ground is the cudaArray.

Really, all we have to do is replace Step 6  (binding a cudaArray to a globally scoped surface) from the previous tutorial, with the creation of a cudaSurfaceObject_t. That entails creating a cuda resource description (cudaResourceDesc), and all we have to do is appropriately set the array portion of the cudaResourceDesc to our cudaArray, and then use that cudaResourceDesc to create our cudaSurfaceObject_t, which we can then pass to our kernels, and use to write to our registered and mapped OpenGL textures.

// Create the cuda resource description
struct cudaResourceDesc resoureDescription;
memset(&resDesc, 0, sizeof(resoureDescription));
resDesc.resType = cudaResourceTypeArray;	// be sure to set the resource type to cudaResourceTypeArray
resDesc.res.array.array = yourCudaArray;	// this is the important bit

// Create the surface object
cudaSurfaceObject_t writableSurfaceObject = 0;
cudaCreateSurfaceObject(&writableSurfaceObject, &resoureDescription);

And thats it! Here’s hoping the API doesn’t change again anytime soon.

GLSL Snippet: emulating running atomic average of colors using imageAtomicCompSwap

Posted on Updated on

This is basically straight out of the [Crassin & Greene] chapter from the excellent OpenGL Insights book, which calculates a running average for a RGB voxel color and stores it into a RGBA8 texture (using the alpha component as an access count).  But for whatever reason when I dropped their GLSL snippet into my code I couldn’t get it to work correctly.  So, I attempted to rewrite it as simply as possible, and basically ended up with almost the same thing except I used the provided GLSL functions packUnorm4x8 and the unpackUnorm4x8 instead of rolling my own, so it’s ever so slightly simpler.

Anyway, I’ve verified that this (mostly) works on a GTX 480, I still get a small bit of flickering on a few voxels. Flickering has been fixed, and also works on a GTX Titan.

void imageAtomicAverageRGBA8(layout(r32ui) coherent volatile uimage3D voxels, ivec3 coord, vec3 nextVec3)
{
	uint nextUint = packUnorm4x8(vec4(nextVec3,1.0f/255.0f));
	uint prevUint = 0;
	uint currUint;

	vec4 currVec4;

	vec3 average;
	uint count;

	//"Spin" while threads are trying to change the voxel
	while((currUint = imageAtomicCompSwap(voxels, coord, prevUint, nextUint)) != prevUint)
	{
		prevUint = currUint;					//store packed rgb average and count
		currVec4 = unpackUnorm4x8(currUint);	//unpack stored rgb average and count

		average =      currVec4.rgb;		//extract rgb average
		count   = uint(currVec4.a*255.0f);	//extract count

		//Compute the running average
		average = (average*count + nextVec3) / (count+1);

		//Pack new average and incremented count back into a uint
		nextUint = packUnorm4x8(vec4(average, (count+1)/255.0f));
	}
}

This works by using the imageAtomicCompSwap function to effectively implement a spinlock, which “spins” until all threads trying to access the voxel are done.

Apparently, the compiler can be quite picky about how things like this are written (don’t use “break” statements), see this thread GLSL loop ‘break’ instruction not executed for more information, and I can’t guarantee this will work on Kepler or any other architectures, and it definitely works fine for both Fermi and Kepler architectures, if anyone can let me know how it works on an AMD architecture I’ll add that information here.

Edit/Update: So I had a few mistakes in my previous implementation which weren’t very noticeable in a sparsely tessellated model (like the Dwarf), but became much more noticeable as triangle density increased (like in the curtains and plants of the Sponza model).  Anyway, it turned out I hadn’t considered the effects of the packUnorm4x8 and unpackUnorm4x8 functions correctly. The packUnorm4x8 function clamps input components from 0 to 1, so the count updates were getting discarded, and obviously the average was coming out wrong.  Anyway, the solution was to divide by 255 when “packing” the count, and multiply by 255 when unpacking.  This method should work with up to 255 threads attempting to write to the same voxel location.

References
[Crassin & Greene] Octree-Based Sparse Voxelization Using the GPU Hardware Rasterizer http://www.seas.upenn.edu/%7Epcozzi/OpenGLInsights/OpenGLInsights-SparseVoxelization.pdf