OpenGL

Rendering Volume Filling Triangles in OpenGL (with no buffers)

Posted on Updated on

This is the promised follow-up to Rendering a Screen Covering Triangle in OpenGL (with no buffers), except this time the goal is to write a shader that accesses every location in a 3d texture (volume).  We use the same screen covering trick as before to draw a triangle to cover a viewport match to the X and Y dimensions of the volume, and we use instanced rendering to draw repeated triangles for each layer in the Z-dimension.

The vertex shader looks the same as before with the addition of the instanceID.

flat out int instanceID;

void main()
{
	float x = -1.0 + float((gl_VertexID & 1) << 2);
	float y = -1.0 + float((gl_VertexID & 2) << 1);
	instanceID  = gl_InstanceID;
	gl_Position = vec4(x, y, 0, 1);
}

The fragment shader can then recover the voxel coordinates from gl_FragCoord and the instanceID.

flat in int instanceID;

void main()
{
	ivec3 voxelCoord = ivec3(gl_FragCoord.xy, instanceID);
	voxelOperation(voxelCoord);
}

Very similar to drawing the single screen covering triangle, we set our viewport to the XY-dimensions of the volume, bind a junk VAO to make certain graphics drivers happy, and call glDrawArraysInstanced with the Z-dimension of the volume, so that we draw a triangle per-slice of the volume.

glViewport(0, 0, width, height);
glBindVertexArray(junkVAO);
glDrawArraysInstanced(GL_TRIANGLE_STRIP, 0, 3, depth);

Which would look sort of like the following:

VolumeFillingTriangles

This can be useful for quickly processing a volume. Initially, I used this as an OpenGL 4.2 fallback (instead of compute shaders) so that I could still use the NSight debugger, until I realized this approach was actually outperforming the compute shader. Of course, when to use compute shaders, and how to use them effectively deserves a post of its own.

Rendering a Screen Covering Triangle in OpenGL (with no buffers)

Posted on Updated on

This one has been on the backlog for ages now.  Anyway, this is an OpenGL adaptation of a clever trick that’s been around for quite awhile and described in DirectX terms by Cort Stratton (@postgoodism) in the “An interesting vertex shader trick” on #AltDevBlogADay.

It describes a method for rendering a triangle that covers the screen with no buffer inputs.  All vertex and texture coordinate information are generated solely from the vertexID.  Unfortunately, because OpenGL uses a right-handed coordinate system while DirectX uses a left-handed coordinate system the same vertexID transformation used for DirectX won’t work in OpenGL.  Basically, we need to reverse the order of the triangle vertices so that they are traversed counter-clockwise as opposed to clockwise in the original implementation. So, after a bit of experimentation I came up with the following adaptation for OpenGL:

void main()
{
	float x = -1.0 + float((gl_VertexID & 1) << 2);
	float y = -1.0 + float((gl_VertexID & 2) << 1);
	gl_Position = vec4(x, y, 0, 1);
}

This transforms the gl_VertexID as follows:

gl_VertexID=0 -> (-1,-1)
gl_VertexID=1 -> ( 3,-1)
gl_VertexID=2 -> (-1, 3)

We can easily add texture coordinates to this as well:

out vec2 texCoord;

void main()
{
	float x = -1.0 + float((gl_VertexID & 1) << 2);
	float y = -1.0 + float((gl_VertexID & 2) << 1);
	texCoord.x = (x+1.0)*0.5;
	texCoord.y = (y+1.0)*0.5;
	gl_Position = vec4(x, y, 0, 1);
}

Which is going to provide in that homogeneous clip space region a position value varying from -1 to 1 and texture coordinates varying from 0 to 1 exactly as OpenGL would expect, all without need to any create any buffers. All you have to do is make single call to glDrawArrays and tell it to render 3 vertices:

glDrawArrays( GL_TRIANGLES, 0, 3 );

This draw a triangle that looks like the following:

glScreenSpaceTriangle

It’s surprising how often this comes in handy, in a later post I’ll describe how to adapt this trick to efficiently access the elements of a 3D texture.  It also amuses me greatly that Iñigo Quilez’s amazing demo/presentation “Rendering World’s With Two Triangles” could actually be renamed “Renderings Worlds With One Triangle.”

Bindless textures can “store”

Posted on Updated on

I don’t know how I missed this when Nvidia released NV_bindless_texture, I guess because all the samples I saw used bindless textures to demonstrate a ridiculous number of texture reads. But I realized when reading the recently released ARB_bindless_texture extension that they can also be used to “store,” or write, to a very large number of textures (using ARB_shader_image_load_store functionality). Which finally gets rid of that extremely pesky MAX_IMAGE_UNITS limitation I’ve been complaining about. The only downside is that I can no longer run my program at home on my GTX 480.

AtomicCounters & IndirectBufferCommands

Posted on

I’ve made use of Atomic Counters and Indirect Buffers in the past, but always in the most straightforward manner. I.e. create a dedicated buffer for the atomic counter, and another for the Indirect Command Buffer, increment the counter in a shader then write the Atomic Counter value into the Indirect Command Buffer using the Image API, ending up with a shader that looks something like below.

#version 420

layout(location = 0) in ivec3 inputBuffer;

layout(r32ui, binding = 0) uniform uimageBuffer outputBuffer;
layout(r32ui, binding = 1) uniform uimageBuffer indirectArrayCommand;
layout(       binding = 0) uniform atomic_uint  atomicCounter;

void main()
{
	// ...
	// do some stuff
	// ...

	if(someCondition == true)
	{
		//increment counter
		int index = int(atomicCounterIncrement(atomicCounter));

		//store stuff in output buffer
		imageStore(outputBuffer, index, uvec4(someStuff)));
	}

	memoryBarrier();

	//Store the atomicCounter value to the count (the first element) of the DrawArraysIndirect command
	imageStore(indirectArrayCommand, 0, uvec4(atomicCounter(atomicCounter)));
}

This works fine, but one annoying thing about this approach is that it consumes an extra image unit (of the max 8 available). Fortunately, it turns out that it is unnecessary to create an extra atomic counter and perform the synchronization with the indirect draw command. It is possible to simply bind the appropriate element of the indirect draw buffer directly to the atomic counter.

// This binds the count element of the Indirect Array Command Buffer directly as an atomic counter in the shader
// (no need for copy from dedicated atomic counter)
glBindBufferRange(GL_ATOMIC_COUNTER_BUFFER,        // Target buffer is the atomic counter
                  0,                               // Binding point, must match the shader
                  IndirectArrayCommandBuffer_id,   // Source buffer is the Indirect Draw Command Buffer
                  0,                               // Offset, 0 for count, 1 for primCount (instances), etc...
                  sizeof(GLuint));

This allows us to get rid of Indirect Buffers image unit binding, which simplifies the shader as shown below. The main reason I’ve found to do this is reduce the number of image units required by the shader, as its very easy to hit the limit of 8.

#version 420

layout(location = 0) in ivec3 inputBuffer;

layout(r32ui, binding = 0) uniform uimageBuffer outputBuffer;
layout(       binding = 0) uniform atomic_uint  atomicCounter;

void main()
{
	// ...
	// do some stuff
	// ...

	if(someCondition == true)
	{
		//increment counter
		int index = int(atomicCounterIncrement(atomicCounter));

		//store stuff in output buffer
		imageStore(outputBuffer, index, uvec4(someStuff)));
	}
}

CUDA 5 and OpenGL Interop and Dynamic Parallelism

Posted on Updated on

I seem to revisit this every time every time Nvidia releases a new version of of CUDA.

The good news…

The old methods still work, the whole register, map, bind, etc… process I described in my now two year old post Writing to 3D OpenGL textures in CUDA 4.1 with 3D Surface writes still works.  Ideally, the new version number shouldn’t introduce any new problems…

The bad news…

Unfortunately, if you try to write to a globally scoped CUDA surface from a device-side launched kernel (i.e. a dynamic kernel), nothing will happen.  You’ll scratch your head and wonder why code that works perfectly fine when launched from the host-side, fails silently when launched device-side.

I only discovered the reason when I decided to read, word for word, the CUDA Dynamic Parallelism Programming Guide. On page 14, in the “Textures & Surfaces” section is this note:

NOTE: The device runtime does not support legacy module-scope (i.e. Fermi-style)
textures and surfaces within a kernel launched from the device. Module-scope (legacy)
textures may be created from the host and used in device code as for any kernel, but
may only be used by a top-level kernel (i.e. the one which is launched from the host).

So now the old way of dealing with textures is considered “Legacy” but apparently not quite deprecated yet.  So don’t use them if you plan on using dynamic parallelism.  Additional Note: if you so much call a function that attempts to perform a “Fermi-style” surface write you’re kernel will fail silently, so I highly recommend removing all “Fermi-style” textures and surfaces if you plan on using dynamic parallelism.

So what’s the “New style” of textures and surfaces, well also on page 14 is a footnote saying:

Dynamically created texture and surface objects are an addition to the CUDA memory model
introduced with CUDA 5.0. Please see the CUDA Programming Guide for details.

So I guess they’re called “Dynamically created textures and surfaces”, which is a mouthful so I’m going to refer to them as “Kepler-style” textures and surfaces.  In the actual API they are cudaTextureObject_t and cudaSurfaceObject_t, and you can pass them around as parameters instead of having to declare them at file scope.

OpenGL Interop

So now we have two distinct methods for dealing with textures and surfaces, “Fermi-style” and “Kepler-style”, but we only know how graphics interoperability works with the old, might-as-well-be-deprecated, “Fermi-style” textures and surfaces.

And while there are some samples showing how the new “Kepler-style” textures and surfaces work (see the Bindless Texture sample), all the interop information still seems to target the old “Fermi-style” textures and surfaces.  Fortunately, there is some common ground between “Kepler-style” and “Fermi-style” textures and surfaces, and that common ground is the cudaArray.

Really, all we have to do is replace Step 6  (binding a cudaArray to a globally scoped surface) from the previous tutorial, with the creation of a cudaSurfaceObject_t. That entails creating a cuda resource description (cudaResourceDesc), and all we have to do is appropriately set the array portion of the cudaResourceDesc to our cudaArray, and then use that cudaResourceDesc to create our cudaSurfaceObject_t, which we can then pass to our kernels, and use to write to our registered and mapped OpenGL textures.

// Create the cuda resource description
struct cudaResourceDesc resoureDescription;
memset(&resDesc, 0, sizeof(resoureDescription));
resDesc.resType = cudaResourceTypeArray;	// be sure to set the resource type to cudaResourceTypeArray
resDesc.res.array.array = yourCudaArray;	// this is the important bit

// Create the surface object
cudaSurfaceObject_t writableSurfaceObject = 0;
cudaCreateSurfaceObject(&writableSurfaceObject, &resoureDescription);

And thats it! Here’s hoping the API doesn’t change again anytime soon.

GLSL Snippet: emulating running atomic average of colors using imageAtomicCompSwap

Posted on Updated on

This is basically straight out of the [Crassin & Greene] chapter from the excellent OpenGL Insights book, which calculates a running average for a RGB voxel color and stores it into a RGBA8 texture (using the alpha component as an access count).  But for whatever reason when I dropped their GLSL snippet into my code I couldn’t get it to work correctly.  So, I attempted to rewrite it as simply as possible, and basically ended up with almost the same thing except I used the provided GLSL functions packUnorm4x8 and the unpackUnorm4x8 instead of rolling my own, so it’s ever so slightly simpler.

Anyway, I’ve verified that this (mostly) works on a GTX 480, I still get a small bit of flickering on a few voxels. Flickering has been fixed, and also works on a GTX Titan.

void imageAtomicAverageRGBA8(layout(r32ui) coherent volatile uimage3D voxels, ivec3 coord, vec3 nextVec3)
{
	uint nextUint = packUnorm4x8(vec4(nextVec3,1.0f/255.0f));
	uint prevUint = 0;
	uint currUint;

	vec4 currVec4;

	vec3 average;
	uint count;

	//"Spin" while threads are trying to change the voxel
	while((currUint = imageAtomicCompSwap(voxels, coord, prevUint, nextUint)) != prevUint)
	{
		prevUint = currUint;					//store packed rgb average and count
		currVec4 = unpackUnorm4x8(currUint);	//unpack stored rgb average and count

		average =      currVec4.rgb;		//extract rgb average
		count   = uint(currVec4.a*255.0f);	//extract count

		//Compute the running average
		average = (average*count + nextVec3) / (count+1);

		//Pack new average and incremented count back into a uint
		nextUint = packUnorm4x8(vec4(average, (count+1)/255.0f));
	}
}

This works by using the imageAtomicCompSwap function to effectively implement a spinlock, which “spins” until all threads trying to access the voxel are done.

Apparently, the compiler can be quite picky about how things like this are written (don’t use “break” statements), see this thread GLSL loop ‘break’ instruction not executed for more information, and I can’t guarantee this will work on Kepler or any other architectures, and it definitely works fine for both Fermi and Kepler architectures, if anyone can let me know how it works on an AMD architecture I’ll add that information here.

Edit/Update: So I had a few mistakes in my previous implementation which weren’t very noticeable in a sparsely tessellated model (like the Dwarf), but became much more noticeable as triangle density increased (like in the curtains and plants of the Sponza model).  Anyway, it turned out I hadn’t considered the effects of the packUnorm4x8 and unpackUnorm4x8 functions correctly. The packUnorm4x8 function clamps input components from 0 to 1, so the count updates were getting discarded, and obviously the average was coming out wrong.  Anyway, the solution was to divide by 255 when “packing” the count, and multiply by 255 when unpacking.  This method should work with up to 255 threads attempting to write to the same voxel location.

References
[Crassin & Greene] Octree-Based Sparse Voxelization Using the GPU Hardware Rasterizer http://www.seas.upenn.edu/%7Epcozzi/OpenGLInsights/OpenGLInsights-SparseVoxelization.pdf

Writing to 3-components buffers using the image API in OpenGL

Posted on Updated on

As I’ve describe in detail in another blogpost, atomic counters used in conjunction with the image API and indirect draw buffers can be an excellent and highly performant alternative/replacement to the transformFeedback mechanism (oh wait, I still haven’t published that previous blogpost… and performant is not actually a real word).

Anyway, one place where this atomic counter + image API + indirect buffers approach becomes a little cumbersome, is its slightly less than elegant handling of 3-components buffer texture formats.

In the OpenGL 4.2 spec the list of supported buffer texture formats is listed in table 3.15, while the list of supported image unit formats is listed in table 3.21.  The takeaway from comparing these tables is that the supported image unit formats generally omit 3 components formats (other than the GL_R11F_G11F_B10F format).  So how to deal with this if you have a say a GL_RGB32F, or GL_RGB32UI internal format? Well, its actually pretty easy; just bind the proxy texture as the one component version of the internal format (GL_R32F, or GL_R32UI).

glBindImageTexture(0, buffer_proxy_tex, 0, GL_TRUE, 0, GL_WRITE_ONLY, GL_R32F);

Then in the shader put a 3-component stride on the atomic counter, then store each component with its own imageStore operation.

layout(binding = 0)         uniform atomic_uint atomicCount;
layout(rgb32f, binding = 0) uniform imageBuffer positionBuffer;

void main()
{
  //Some other code...

  int index = 3*int(atomicCounterIncrement(atomicCount));

  imageStore(positionBuffer, index+0, vec4(x));
  imageStore(positionBuffer, index+1, vec4(y));
  imageStore(positionBuffer, index+2, vec4(z));

  //Some more code...
}

And that actually works great, in my experience, replacing transformFeedback with this approach has been as fast or faster despite the multiple imageStore calls.

OpenGL 4.3 released

Posted on Updated on

OpenGL 4.3 has just been released and almost instantly G-Truc (Christophe Riccio) posted another excellent of his excellent OpenGL reviews.  Additionally Mike Bailey has already made available some slides on the new Compute shaders.  And as usual nvidia already has beta drivers available.

With over 20 new extensions this is a rather large update for a point release, and while I’m extremely grateful that people have had a chance preview these extensions and provide an alternative to the pain of reading through the extensions themselves (though I will probably do that anyway), I can’t help but wish that this sort of access wasn’t so exclusive.  I find the current method of dumping a bunch of new extensions out each SIGGRAPH and shouting “Surprise!” a bit jarring, and more tragically there is no mechanism to allow the OpenGL community at large to provide feedback (unless you are a member of Khronos, I guess).

If you look at the extensions I think they lend themselves perfectly to publication on some sort of official Wiki, revision history would be managed automatically, the OpenGL community could provide feedback on the discussion page, and Khronos members would have permission to make the actual edits to the extension.  I guess what I am saying, in particular regards to the publication of new extensions, is that I wish OpenGL were a little bit more “open.”

Extensions:

OpenGL should support loading shader files

Posted on Updated on

OpenGL’s shader system is purely string based. Just pass it a couple of strings worth of shader code, compile, link, and go.

Its not actually that bad, but it gets progressively more annoying the more advanced your shader code gets. It precludes the convenient use of #include, because OpenGL has no idea where that string came from (which directory/file). All the sudden you find yourself terribly missing the ability to factor out some useful utility code into a header file, and just #include it wherever you need it.

Why am I griping about this now? Because I just wrote some code that runs through my files line by line looking for #include‘s, loading and substituting the correct included source into the original source. Honestly, it wasn’t that bad, but it still *feels* like a hack, and something I really shouldn’t have to do.

In reality I was only half done. I had rendered the shader error log meaningless, since the source it had compiled didn’t match the file I was working on. This meant I still had to read the error log generated by OpenGL whenever shader compilation failed, parse that, extract line numbers, and then figure out the correct line and file associated with the error message so that it would actually be meaningful. It works, but again, its annoying, and doesn’t seem like something OpenGL programmers should have to concern themselves with.

But what about ARB_shading_language_include?

Yes, I am aware there an extension allowing shader includes, but its all wrong. It is again string based, it introduces 6 OpenGL functions, and requires its own compilation step. A #include should be a 100% preprocessor operation. I don’t want to have to recompile my project just to include a file in my shader. And its not the way OpenGL is headed, prevailingly, more and more is being defined in the shader code itself rather than in the calling OpenGL program (which I think is great).

If its not that hard for me to hack into real #include support I imagine the OpenGL driver writers should be able to able handle it as well, and probably do a much better job of it.

So, instead of glShaderSource, I propose glShaderFile, which instead of taking in a string of shader source, it takes in a string of a shader file name, from which it extracts the directory such that the shader compiler knows where to look every time #include is used.  Optionally, it could take another string explicitly defining the shader include directory.  Alternately, another version of glShaderSource, say glShaderSourceDir could take a shader string and have a parameter to explicitly define the shader include directory.

Anyway, that’s my rant.  Its not a huge deal, but I actually think this simple addition would have a fairly large impact on the usability of glsl shaders.