Rendering Volume Filling Triangles in OpenGL (with no buffers)

Posted on Updated on

This is the promised follow-up to Rendering a Screen Covering Triangle in OpenGL (with no buffers), except this time the goal is to write a shader that accesses every location in a 3d texture (volume).  We use the same screen covering trick as before to draw a triangle to cover a viewport match to the X and Y dimensions of the volume, and we use instanced rendering to draw repeated triangles for each layer in the Z-dimension.

The vertex shader looks the same as before with the addition of the instanceID.

flat out int instanceID;

void main()
{
	float x = -1.0 + float((gl_VertexID & 1) << 2);
	float y = -1.0 + float((gl_VertexID & 2) << 1);
	instanceID  = gl_InstanceID;
	gl_Position = vec4(x, y, 0, 1);
}

The fragment shader can then recover the voxel coordinates from gl_FragCoord and the instanceID.

flat in int instanceID;

void main()
{
	ivec3 voxelCoord = ivec3(gl_FragCoord.xy, instanceID);
	voxelOperation(voxelCoord);
}

Very similar to drawing the single screen covering triangle, we set our viewport to the XY-dimensions of the volume, bind a junk VAO to make certain graphics drivers happy, and call glDrawArraysInstanced with the Z-dimension of the volume, so that we draw a triangle per-slice of the volume.

glViewport(0, 0, width, height);
glBindVertexArray(junkVAO);
glDrawArraysInstanced(GL_TRIANGLE_STRIP, 0, 3, depth);

Which would look sort of like the following:

VolumeFillingTriangles

This can be useful for quickly processing a volume. Initially, I used this as an OpenGL 4.2 fallback (instead of compute shaders) so that I could still use the NSight debugger, until I realized this approach was actually outperforming the compute shader. Of course, when to use compute shaders, and how to use them effectively deserves a post of its own.

Rendering a Screen Covering Triangle in OpenGL (with no buffers)

Posted on Updated on

This one has been on the backlog for ages now.  Anyway, this is an OpenGL adaptation of a clever trick that’s been around for quite awhile and described in DirectX terms by Cort Stratton (@postgoodism) in the “An interesting vertex shader trick” on #AltDevBlogADay.

It describes a method for rendering a triangle that covers the screen with no buffer inputs.  All vertex and texture coordinate information are generated solely from the vertexID.  Unfortunately, because OpenGL uses a right-handed coordinate system while DirectX uses a left-handed coordinate system the same vertexID transformation used for DirectX won’t work in OpenGL.  Basically, we need to reverse the order of the triangle vertices so that they are traversed counter-clockwise as opposed to clockwise in the original implementation. So, after a bit of experimentation I came up with the following adaptation for OpenGL:

void main()
{
	float x = -1.0 + float((gl_VertexID & 1) << 2);
	float y = -1.0 + float((gl_VertexID & 2) << 1);
	gl_Position = vec4(x, y, 0, 1);
}

This transforms the gl_VertexID as follows:

gl_VertexID=0 -> (-1,-1)
gl_VertexID=1 -> ( 3,-1)
gl_VertexID=2 -> (-1, 3)

We can easily add texture coordinates to this as well:

out vec2 texCoord;

void main()
{
	float x = -1.0 + float((gl_VertexID & 1) << 2);
	float y = -1.0 + float((gl_VertexID & 2) << 1);
	texCoord.x = (x+1.0)*0.5;
	texCoord.y = (y+1.0)*0.5;
	gl_Position = vec4(x, y, 0, 1);
}

Which is going to provide in that homogeneous clip space region a position value varying from -1 to 1 and texture coordinates varying from 0 to 1 exactly as OpenGL would expect, all without need to any create any buffers. All you have to do is make single call to glDrawArrays and tell it to render 3 vertices:

glDrawArrays( GL_TRIANGLES, 0, 3 );

This draw a triangle that looks like the following:

glScreenSpaceTriangle

It’s surprising how often this comes in handy, in a later post I’ll describe how to adapt this trick to efficiently access the elements of a 3D texture.  It also amuses me greatly that Iñigo Quilez’s amazing demo/presentation “Rendering World’s With Two Triangles” could actually be renamed “Renderings Worlds With One Triangle.”

Readings on physically-based rendering

Posted on Updated on

Another nice collections of links and papers too valuable to lose among all my bookmarks.  This time on physically-based rendering, put together by Kostas Anagnostou (@thinkinggamer).

List maintained and updated over at his blog Interplay of Light:

http://interplayoflight.wordpress.com/2013/12/30/readings-on-physically-based-rendering/

 

Layered Reflective Shadow Maps for Voxel-based Indirect Illumination

Posted on Updated on

So, a lot has happened. I completed my Doctorate, almost moved to Norway, but then ended up moving to Canada instead (Victoria, BC). I now work for the Advanced Technology Group at Intel, where I was very fortunate enough  to have the opportunity to assist a new colleague of mine, Masamichi Sugihara (@masasugihara), with his publication “Layered Reflective Shadow Maps for Voxel-based Indirect Illumination,” which has been accepted to HPG 2014.


Check out the preprint here

We introduce a novel voxel-based algorithm that interactively simulates both diffuse and glossy single-bounce indirect illumination. Our algorithm generates high quality images similar to the reference solution while using only a fraction of the memory of previous methods. The key idea in our work is to decouple occlusion data, stored in voxels, from lighting and geometric data, encoded in a new per-light data structure called layered reflective shadow maps (LRSMs). We use voxel cone tracing for visibility determination and integrate outgoing radiance by performing lookups in a pre-filtered LRSM. Finally we demonstrate that our simple data structures are easy to implement and can be rebuilt every frame to support both dynamic lights and scenes.

Hire Masamichi!

Due to some rather shortsighted reorganization, Masasmichi is currently pursuing employment opportunities that will either; allow him to stay in Canada, or return to Japan. If you are interested in hiring a top-notch graphics coder, please get in touch.

Realtime Global Illumination techniques collection

Posted on Updated on

Martin Thomas (@0martint) has put together a very nice collection of links and papers for realtime global illumination techniques.

Check it out over at his blog:
http://extremeistan.wordpress.com/2014/05/11/realtime-global-illumination-techniques-collection/

Bindless textures can “store”

Posted on Updated on

I don’t know how I missed this when Nvidia released NV_bindless_texture, I guess because all the samples I saw used bindless textures to demonstrate a ridiculous number of texture reads. But I realized when reading the recently released ARB_bindless_texture extension that they can also be used to “store,” or write, to a very large number of textures (using ARB_shader_image_load_store functionality). Which finally gets rid of that extremely pesky MAX_IMAGE_UNITS limitation I’ve been complaining about. The only downside is that I can no longer run my program at home on my GTX 480.

AtomicCounters & IndirectBufferCommands

Posted on

I’ve made use of Atomic Counters and Indirect Buffers in the past, but always in the most straightforward manner. I.e. create a dedicated buffer for the atomic counter, and another for the Indirect Command Buffer, increment the counter in a shader then write the Atomic Counter value into the Indirect Command Buffer using the Image API, ending up with a shader that looks something like below.

#version 420

layout(location = 0) in ivec3 inputBuffer;

layout(r32ui, binding = 0) uniform uimageBuffer outputBuffer;
layout(r32ui, binding = 1) uniform uimageBuffer indirectArrayCommand;
layout(       binding = 0) uniform atomic_uint  atomicCounter;

void main()
{
	// ...
	// do some stuff
	// ...

	if(someCondition == true)
	{
		//increment counter
		int index = int(atomicCounterIncrement(atomicCounter));

		//store stuff in output buffer
		imageStore(outputBuffer, index, uvec4(someStuff)));
	}

	memoryBarrier();

	//Store the atomicCounter value to the count (the first element) of the DrawArraysIndirect command
	imageStore(indirectArrayCommand, 0, uvec4(atomicCounter(atomicCounter)));
}

This works fine, but one annoying thing about this approach is that it consumes an extra image unit (of the max 8 available). Fortunately, it turns out that it is unnecessary to create an extra atomic counter and perform the synchronization with the indirect draw command. It is possible to simply bind the appropriate element of the indirect draw buffer directly to the atomic counter.

// This binds the count element of the Indirect Array Command Buffer directly as an atomic counter in the shader
// (no need for copy from dedicated atomic counter)
glBindBufferRange(GL_ATOMIC_COUNTER_BUFFER,        // Target buffer is the atomic counter
                  0,                               // Binding point, must match the shader
                  IndirectArrayCommandBuffer_id,   // Source buffer is the Indirect Draw Command Buffer
                  0,                               // Offset, 0 for count, 1 for primCount (instances), etc...
                  sizeof(GLuint));

This allows us to get rid of Indirect Buffers image unit binding, which simplifies the shader as shown below. The main reason I’ve found to do this is reduce the number of image units required by the shader, as its very easy to hit the limit of 8.

#version 420

layout(location = 0) in ivec3 inputBuffer;

layout(r32ui, binding = 0) uniform uimageBuffer outputBuffer;
layout(       binding = 0) uniform atomic_uint  atomicCounter;

void main()
{
	// ...
	// do some stuff
	// ...

	if(someCondition == true)
	{
		//increment counter
		int index = int(atomicCounterIncrement(atomicCounter));

		//store stuff in output buffer
		imageStore(outputBuffer, index, uvec4(someStuff)));
	}
}

Fixed imageAtomicAverageRGBA8

Posted on Updated on

So I fixed some issues I had in my previous implementation of imageAtomicAverageRGBA8, see the previous post for an explanation of what I got wrong.  Reposting the corrected code here, and sorry to anyone who was trying to use the broken version.

void imageAtomicAverageRGBA8(layout(r32ui) coherent volatile uimage3D voxels, ivec3 coord, vec3 nextVec3)
{
	uint nextUint = packUnorm4x8(vec4(nextVec3,1.0f/255.0f));
	uint prevUint = 0;
	uint currUint;

	vec4 currVec4;

	vec3 average;
	uint count;

	//"Spin" while threads are trying to change the voxel
	while((currUint = imageAtomicCompSwap(voxels, coord, prevUint, nextUint)) != prevUint)
	{
		prevUint = currUint;					//store packed rgb average and count
		currVec4 = unpackUnorm4x8(currUint);	//unpack stored rgb average and count

		average =      currVec4.rgb;			//extract rgb average
		count   = uint(currVec4.a*255.0f);		//extract count

		//Compute the running average
		average = (average*count + nextVec3) / (count+1);

		//Pack new average and incremented count back into a uint
		nextUint = packUnorm4x8(vec4(average, (count+1)/255.0f));
	}
}

Anyway, original credit for this technique should go to Cyril Crassin, whose implementation in [Crassin & Greene] deftly avoided the mistakes I made by implementing his own pack/unpack functions. Still not sure why his implementation doesn’t work for me though. Note: I tried to debug these in the Nsight shader debugger and got the message “Not a debuggable shader”, so either it doesn’t support atomics (unverified), or these “spinlock” type shaders are too clever for the debugger somehow (for now).

References
[Crassin & Greene] Octree-Based Sparse Voxelization Using the GPU Hardware Rasterizer http://www.seas.upenn.edu/%7Epcozzi/OpenGLInsights/OpenGLInsights-SparseVoxelization.pdf

CUDA 5 and OpenGL Interop and Dynamic Parallelism

Posted on Updated on

I seem to revisit this every time every time Nvidia releases a new version of of CUDA.

The good news…

The old methods still work, the whole register, map, bind, etc… process I described in my now two year old post Writing to 3D OpenGL textures in CUDA 4.1 with 3D Surface writes still works.  Ideally, the new version number shouldn’t introduce any new problems…

The bad news…

Unfortunately, if you try to write to a globally scoped CUDA surface from a device-side launched kernel (i.e. a dynamic kernel), nothing will happen.  You’ll scratch your head and wonder why code that works perfectly fine when launched from the host-side, fails silently when launched device-side.

I only discovered the reason when I decided to read, word for word, the CUDA Dynamic Parallelism Programming Guide. On page 14, in the “Textures & Surfaces” section is this note:

NOTE: The device runtime does not support legacy module-scope (i.e. Fermi-style)
textures and surfaces within a kernel launched from the device. Module-scope (legacy)
textures may be created from the host and used in device code as for any kernel, but
may only be used by a top-level kernel (i.e. the one which is launched from the host).

So now the old way of dealing with textures is considered “Legacy” but apparently not quite deprecated yet.  So don’t use them if you plan on using dynamic parallelism.  Additional Note: if you so much call a function that attempts to perform a “Fermi-style” surface write you’re kernel will fail silently, so I highly recommend removing all “Fermi-style” textures and surfaces if you plan on using dynamic parallelism.

So what’s the “New style” of textures and surfaces, well also on page 14 is a footnote saying:

Dynamically created texture and surface objects are an addition to the CUDA memory model
introduced with CUDA 5.0. Please see the CUDA Programming Guide for details.

So I guess they’re called “Dynamically created textures and surfaces”, which is a mouthful so I’m going to refer to them as “Kepler-style” textures and surfaces.  In the actual API they are cudaTextureObject_t and cudaSurfaceObject_t, and you can pass them around as parameters instead of having to declare them at file scope.

OpenGL Interop

So now we have two distinct methods for dealing with textures and surfaces, “Fermi-style” and “Kepler-style”, but we only know how graphics interoperability works with the old, might-as-well-be-deprecated, “Fermi-style” textures and surfaces.

And while there are some samples showing how the new “Kepler-style” textures and surfaces work (see the Bindless Texture sample), all the interop information still seems to target the old “Fermi-style” textures and surfaces.  Fortunately, there is some common ground between “Kepler-style” and “Fermi-style” textures and surfaces, and that common ground is the cudaArray.

Really, all we have to do is replace Step 6  (binding a cudaArray to a globally scoped surface) from the previous tutorial, with the creation of a cudaSurfaceObject_t. That entails creating a cuda resource description (cudaResourceDesc), and all we have to do is appropriately set the array portion of the cudaResourceDesc to our cudaArray, and then use that cudaResourceDesc to create our cudaSurfaceObject_t, which we can then pass to our kernels, and use to write to our registered and mapped OpenGL textures.

// Create the cuda resource description
struct cudaResourceDesc resoureDescription;
memset(&resDesc, 0, sizeof(resoureDescription));
resDesc.resType = cudaResourceTypeArray;	// be sure to set the resource type to cudaResourceTypeArray
resDesc.res.array.array = yourCudaArray;	// this is the important bit

// Create the surface object
cudaSurfaceObject_t writableSurfaceObject = 0;
cudaCreateSurfaceObject(&writableSurfaceObject, &resoureDescription);

And thats it! Here’s hoping the API doesn’t change again anytime soon.

CUDA 5: Enabling Dynamic Parallelism

Posted on Updated on

I finally got a GPU capable of dynamic parallelism, so I finally decided to mess around with CUDA 5.  But I discovered a couple of configuration options that are required if you want to enable dynamic parallelism.  You’ll know you haven’t configured things correctly if you attempt to call a kernel from the device and you get the following error message:

ptxas : fatal error : Unresolved extern function ‘cudaGetParameterBuffer’

Note: this assume you have already selected the appropriate CUDA 5 build customizations for your project

Open the project project properties

  1. Make sure to set “Generate Relocatable Device Code” to “Yes (-rdc=true)”yes
  2. Set “code generation” to compute_35,sm_3″compute
  3. Finally add “cudadevrt.lib” to the CUDA Linker’s “Additional Dependencies”cudadevrt