GLSL

AtomicCounters & IndirectBufferCommands

Posted on

I’ve made use of Atomic Counters and Indirect Buffers in the past, but always in the most straightforward manner. I.e. create a dedicated buffer for the atomic counter, and another for the Indirect Command Buffer, increment the counter in a shader then write the Atomic Counter value into the Indirect Command Buffer using the Image API, ending up with a shader that looks something like below.

#version 420

layout(location = 0) in ivec3 inputBuffer;

layout(r32ui, binding = 0) uniform uimageBuffer outputBuffer;
layout(r32ui, binding = 1) uniform uimageBuffer indirectArrayCommand;
layout(       binding = 0) uniform atomic_uint  atomicCounter;

void main()
{
	// ...
	// do some stuff
	// ...

	if(someCondition == true)
	{
		//increment counter
		int index = int(atomicCounterIncrement(atomicCounter));

		//store stuff in output buffer
		imageStore(outputBuffer, index, uvec4(someStuff)));
	}

	memoryBarrier();

	//Store the atomicCounter value to the count (the first element) of the DrawArraysIndirect command
	imageStore(indirectArrayCommand, 0, uvec4(atomicCounter(atomicCounter)));
}

This works fine, but one annoying thing about this approach is that it consumes an extra image unit (of the max 8 available). Fortunately, it turns out that it is unnecessary to create an extra atomic counter and perform the synchronization with the indirect draw command. It is possible to simply bind the appropriate element of the indirect draw buffer directly to the atomic counter.

// This binds the count element of the Indirect Array Command Buffer directly as an atomic counter in the shader
// (no need for copy from dedicated atomic counter)
glBindBufferRange(GL_ATOMIC_COUNTER_BUFFER,        // Target buffer is the atomic counter
                  0,                               // Binding point, must match the shader
                  IndirectArrayCommandBuffer_id,   // Source buffer is the Indirect Draw Command Buffer
                  0,                               // Offset, 0 for count, 1 for primCount (instances), etc...
                  sizeof(GLuint));

This allows us to get rid of Indirect Buffers image unit binding, which simplifies the shader as shown below. The main reason I’ve found to do this is reduce the number of image units required by the shader, as its very easy to hit the limit of 8.

#version 420

layout(location = 0) in ivec3 inputBuffer;

layout(r32ui, binding = 0) uniform uimageBuffer outputBuffer;
layout(       binding = 0) uniform atomic_uint  atomicCounter;

void main()
{
	// ...
	// do some stuff
	// ...

	if(someCondition == true)
	{
		//increment counter
		int index = int(atomicCounterIncrement(atomicCounter));

		//store stuff in output buffer
		imageStore(outputBuffer, index, uvec4(someStuff)));
	}
}

Fixed imageAtomicAverageRGBA8

Posted on Updated on

So I fixed some issues I had in my previous implementation of imageAtomicAverageRGBA8, see the previous post for an explanation of what I got wrong.  Reposting the corrected code here, and sorry to anyone who was trying to use the broken version.

void imageAtomicAverageRGBA8(layout(r32ui) coherent volatile uimage3D voxels, ivec3 coord, vec3 nextVec3)
{
	uint nextUint = packUnorm4x8(vec4(nextVec3,1.0f/255.0f));
	uint prevUint = 0;
	uint currUint;

	vec4 currVec4;

	vec3 average;
	uint count;

	//"Spin" while threads are trying to change the voxel
	while((currUint = imageAtomicCompSwap(voxels, coord, prevUint, nextUint)) != prevUint)
	{
		prevUint = currUint;					//store packed rgb average and count
		currVec4 = unpackUnorm4x8(currUint);	//unpack stored rgb average and count

		average =      currVec4.rgb;			//extract rgb average
		count   = uint(currVec4.a*255.0f);		//extract count

		//Compute the running average
		average = (average*count + nextVec3) / (count+1);

		//Pack new average and incremented count back into a uint
		nextUint = packUnorm4x8(vec4(average, (count+1)/255.0f));
	}
}

Anyway, original credit for this technique should go to Cyril Crassin, whose implementation in [Crassin & Greene] deftly avoided the mistakes I made by implementing his own pack/unpack functions. Still not sure why his implementation doesn’t work for me though. Note: I tried to debug these in the Nsight shader debugger and got the message “Not a debuggable shader”, so either it doesn’t support atomics (unverified), or these “spinlock” type shaders are too clever for the debugger somehow (for now).

References
[Crassin & Greene] Octree-Based Sparse Voxelization Using the GPU Hardware Rasterizer http://www.seas.upenn.edu/%7Epcozzi/OpenGLInsights/OpenGLInsights-SparseVoxelization.pdf

GLSL Snippet: emulating running atomic average of colors using imageAtomicCompSwap

Posted on Updated on

This is basically straight out of the [Crassin & Greene] chapter from the excellent OpenGL Insights book, which calculates a running average for a RGB voxel color and stores it into a RGBA8 texture (using the alpha component as an access count).  But for whatever reason when I dropped their GLSL snippet into my code I couldn’t get it to work correctly.  So, I attempted to rewrite it as simply as possible, and basically ended up with almost the same thing except I used the provided GLSL functions packUnorm4x8 and the unpackUnorm4x8 instead of rolling my own, so it’s ever so slightly simpler.

Anyway, I’ve verified that this (mostly) works on a GTX 480, I still get a small bit of flickering on a few voxels. Flickering has been fixed, and also works on a GTX Titan.

void imageAtomicAverageRGBA8(layout(r32ui) coherent volatile uimage3D voxels, ivec3 coord, vec3 nextVec3)
{
	uint nextUint = packUnorm4x8(vec4(nextVec3,1.0f/255.0f));
	uint prevUint = 0;
	uint currUint;

	vec4 currVec4;

	vec3 average;
	uint count;

	//"Spin" while threads are trying to change the voxel
	while((currUint = imageAtomicCompSwap(voxels, coord, prevUint, nextUint)) != prevUint)
	{
		prevUint = currUint;					//store packed rgb average and count
		currVec4 = unpackUnorm4x8(currUint);	//unpack stored rgb average and count

		average =      currVec4.rgb;		//extract rgb average
		count   = uint(currVec4.a*255.0f);	//extract count

		//Compute the running average
		average = (average*count + nextVec3) / (count+1);

		//Pack new average and incremented count back into a uint
		nextUint = packUnorm4x8(vec4(average, (count+1)/255.0f));
	}
}

This works by using the imageAtomicCompSwap function to effectively implement a spinlock, which “spins” until all threads trying to access the voxel are done.

Apparently, the compiler can be quite picky about how things like this are written (don’t use “break” statements), see this thread GLSL loop ‘break’ instruction not executed for more information, and I can’t guarantee this will work on Kepler or any other architectures, and it definitely works fine for both Fermi and Kepler architectures, if anyone can let me know how it works on an AMD architecture I’ll add that information here.

Edit/Update: So I had a few mistakes in my previous implementation which weren’t very noticeable in a sparsely tessellated model (like the Dwarf), but became much more noticeable as triangle density increased (like in the curtains and plants of the Sponza model).  Anyway, it turned out I hadn’t considered the effects of the packUnorm4x8 and unpackUnorm4x8 functions correctly. The packUnorm4x8 function clamps input components from 0 to 1, so the count updates were getting discarded, and obviously the average was coming out wrong.  Anyway, the solution was to divide by 255 when “packing” the count, and multiply by 255 when unpacking.  This method should work with up to 255 threads attempting to write to the same voxel location.

References
[Crassin & Greene] Octree-Based Sparse Voxelization Using the GPU Hardware Rasterizer http://www.seas.upenn.edu/%7Epcozzi/OpenGLInsights/OpenGLInsights-SparseVoxelization.pdf

Writing to 3-components buffers using the image API in OpenGL

Posted on Updated on

As I’ve describe in detail in another blogpost, atomic counters used in conjunction with the image API and indirect draw buffers can be an excellent and highly performant alternative/replacement to the transformFeedback mechanism (oh wait, I still haven’t published that previous blogpost… and performant is not actually a real word).

Anyway, one place where this atomic counter + image API + indirect buffers approach becomes a little cumbersome, is its slightly less than elegant handling of 3-components buffer texture formats.

In the OpenGL 4.2 spec the list of supported buffer texture formats is listed in table 3.15, while the list of supported image unit formats is listed in table 3.21.  The takeaway from comparing these tables is that the supported image unit formats generally omit 3 components formats (other than the GL_R11F_G11F_B10F format).  So how to deal with this if you have a say a GL_RGB32F, or GL_RGB32UI internal format? Well, its actually pretty easy; just bind the proxy texture as the one component version of the internal format (GL_R32F, or GL_R32UI).

glBindImageTexture(0, buffer_proxy_tex, 0, GL_TRUE, 0, GL_WRITE_ONLY, GL_R32F);

Then in the shader put a 3-component stride on the atomic counter, then store each component with its own imageStore operation.

layout(binding = 0)         uniform atomic_uint atomicCount;
layout(rgb32f, binding = 0) uniform imageBuffer positionBuffer;

void main()
{
  //Some other code...

  int index = 3*int(atomicCounterIncrement(atomicCount));

  imageStore(positionBuffer, index+0, vec4(x));
  imageStore(positionBuffer, index+1, vec4(y));
  imageStore(positionBuffer, index+2, vec4(z));

  //Some more code...
}

And that actually works great, in my experience, replacing transformFeedback with this approach has been as fast or faster despite the multiple imageStore calls.

OpenGL should support loading shader files

Posted on Updated on

OpenGL’s shader system is purely string based. Just pass it a couple of strings worth of shader code, compile, link, and go.

Its not actually that bad, but it gets progressively more annoying the more advanced your shader code gets. It precludes the convenient use of #include, because OpenGL has no idea where that string came from (which directory/file). All the sudden you find yourself terribly missing the ability to factor out some useful utility code into a header file, and just #include it wherever you need it.

Why am I griping about this now? Because I just wrote some code that runs through my files line by line looking for #include‘s, loading and substituting the correct included source into the original source. Honestly, it wasn’t that bad, but it still *feels* like a hack, and something I really shouldn’t have to do.

In reality I was only half done. I had rendered the shader error log meaningless, since the source it had compiled didn’t match the file I was working on. This meant I still had to read the error log generated by OpenGL whenever shader compilation failed, parse that, extract line numbers, and then figure out the correct line and file associated with the error message so that it would actually be meaningful. It works, but again, its annoying, and doesn’t seem like something OpenGL programmers should have to concern themselves with.

But what about ARB_shading_language_include?

Yes, I am aware there an extension allowing shader includes, but its all wrong. It is again string based, it introduces 6 OpenGL functions, and requires its own compilation step. A #include should be a 100% preprocessor operation. I don’t want to have to recompile my project just to include a file in my shader. And its not the way OpenGL is headed, prevailingly, more and more is being defined in the shader code itself rather than in the calling OpenGL program (which I think is great).

If its not that hard for me to hack into real #include support I imagine the OpenGL driver writers should be able to able handle it as well, and probably do a much better job of it.

So, instead of glShaderSource, I propose glShaderFile, which instead of taking in a string of shader source, it takes in a string of a shader file name, from which it extracts the directory such that the shader compiler knows where to look every time #include is used.  Optionally, it could take another string explicitly defining the shader include directory.  Alternately, another version of glShaderSource, say glShaderSourceDir could take a shader string and have a parameter to explicitly define the shader include directory.

Anyway, that’s my rant.  Its not a huge deal, but I actually think this simple addition would have a fairly large impact on the usability of glsl shaders.

GLSL sign function

Posted on Updated on

The GLSL sign function always seems a great way to remove some unnecessary if statements from my shaders, but I never seem to get to use it because I always need to consider zero as either positive or negative, and not its own special value.

Anyway, I just realized you can accomplish the same thing with the step function.

step(0, x)*2 - 1;

This will return -1.0 if x < 0, and 1.0 if x >= 0.

Which is not terribly readable, hence this overly verbose function

//returns -1.0 if x < 0, and 1.0 if x >= 0
float signGreaterEqualZero(float x)
{
	return step(0, x)*2 - 1;
}