Hybrid Computational Voxelization Using the Graphics Pipeline

Posted on Updated on


Got a paper published in the Journal of Computer Graphics Techniques, see it here

This paper presents an efficient computational voxelization approach that utilizes the graphics pipeline. Our approach is hybrid in that it performs a precise gap-free computational voxelization, employs fixed-function components of the GPU, and utilizes the stages of the graphics pipeline to improve parallelism. This approach makes use of the latest features of OpenGL and fully supports both conservative and thin-surface voxelization. In contrast to other computational voxelization approaches, our approach is implemented entirely in OpenGL and achieves both triangle and fragment parallelism through its use of geometry and fragment shaders. By exploiting features of the existing graphics pipeline, we are able to rapidly compute accurate scene voxelizations in a manner that integrates well with existing OpenGL applications, is robust across many different models, and eschews the need for complex work/load-balancing schemes.

GLSL Snippet: emulating running atomic average of colors using imageAtomicCompSwap

Posted on Updated on

This is basically straight out of the [Crassin & Greene] chapter from the excellent OpenGL Insights book, which calculates a running average for a RGB voxel color and stores it into a RGBA8 texture (using the alpha component as an access count).  But for whatever reason when I dropped their GLSL snippet into my code I couldn’t get it to work correctly.  So, I attempted to rewrite it as simply as possible, and basically ended up with almost the same thing except I used the provided GLSL functions packUnorm4x8 and the unpackUnorm4x8 instead of rolling my own, so it’s ever so slightly simpler.

Anyway, I’ve verified that this (mostly) works on a GTX 480, I still get a small bit of flickering on a few voxels. Flickering has been fixed, and also works on a GTX Titan.

void imageAtomicAverageRGBA8(layout(r32ui) coherent volatile uimage3D voxels, ivec3 coord, vec3 nextVec3)
	uint nextUint = packUnorm4x8(vec4(nextVec3,1.0f/255.0f));
	uint prevUint = 0;
	uint currUint;

	vec4 currVec4;

	vec3 average;
	uint count;

	//"Spin" while threads are trying to change the voxel
	while((currUint = imageAtomicCompSwap(voxels, coord, prevUint, nextUint)) != prevUint)
		prevUint = currUint;					//store packed rgb average and count
		currVec4 = unpackUnorm4x8(currUint);	//unpack stored rgb average and count

		average =      currVec4.rgb;		//extract rgb average
		count   = uint(currVec4.a*255.0f);	//extract count

		//Compute the running average
		average = (average*count + nextVec3) / (count+1);

		//Pack new average and incremented count back into a uint
		nextUint = packUnorm4x8(vec4(average, (count+1)/255.0f));

This works by using the imageAtomicCompSwap function to effectively implement a spinlock, which “spins” until all threads trying to access the voxel are done.

Apparently, the compiler can be quite picky about how things like this are written (don’t use “break” statements), see this thread GLSL loop ‘break’ instruction not executed for more information, and I can’t guarantee this will work on Kepler or any other architectures, and it definitely works fine for both Fermi and Kepler architectures, if anyone can let me know how it works on an AMD architecture I’ll add that information here.

Edit/Update: So I had a few mistakes in my previous implementation which weren’t very noticeable in a sparsely tessellated model (like the Dwarf), but became much more noticeable as triangle density increased (like in the curtains and plants of the Sponza model).  Anyway, it turned out I hadn’t considered the effects of the packUnorm4x8 and unpackUnorm4x8 functions correctly. The packUnorm4x8 function clamps input components from 0 to 1, so the count updates were getting discarded, and obviously the average was coming out wrong.  Anyway, the solution was to divide by 255 when “packing” the count, and multiply by 255 when unpacking.  This method should work with up to 255 threads attempting to write to the same voxel location.

[Crassin & Greene] Octree-Based Sparse Voxelization Using the GPU Hardware Rasterizer http://www.seas.upenn.edu/%7Epcozzi/OpenGLInsights/OpenGLInsights-SparseVoxelization.pdf

Writing to 3-components buffers using the image API in OpenGL

Posted on Updated on

As I’ve describe in detail in another blogpost, atomic counters used in conjunction with the image API and indirect draw buffers can be an excellent and highly performant alternative/replacement to the transformFeedback mechanism (oh wait, I still haven’t published that previous blogpost… and performant is not actually a real word).

Anyway, one place where this atomic counter + image API + indirect buffers approach becomes a little cumbersome, is its slightly less than elegant handling of 3-components buffer texture formats.

In the OpenGL 4.2 spec the list of supported buffer texture formats is listed in table 3.15, while the list of supported image unit formats is listed in table 3.21.  The takeaway from comparing these tables is that the supported image unit formats generally omit 3 components formats (other than the GL_R11F_G11F_B10F format).  So how to deal with this if you have a say a GL_RGB32F, or GL_RGB32UI internal format? Well, its actually pretty easy; just bind the proxy texture as the one component version of the internal format (GL_R32F, or GL_R32UI).

glBindImageTexture(0, buffer_proxy_tex, 0, GL_TRUE, 0, GL_WRITE_ONLY, GL_R32F);

Then in the shader put a 3-component stride on the atomic counter, then store each component with its own imageStore operation.

layout(binding = 0)         uniform atomic_uint atomicCount;
layout(rgb32f, binding = 0) uniform imageBuffer positionBuffer;

void main()
  //Some other code...

  int index = 3*int(atomicCounterIncrement(atomicCount));

  imageStore(positionBuffer, index+0, vec4(x));
  imageStore(positionBuffer, index+1, vec4(y));
  imageStore(positionBuffer, index+2, vec4(z));

  //Some more code...

And that actually works great, in my experience, replacing transformFeedback with this approach has been as fast or faster despite the multiple imageStore calls.

OpenGL 4.3 released

Posted on Updated on

OpenGL 4.3 has just been released and almost instantly G-Truc (Christophe Riccio) posted another excellent of his excellent OpenGL reviews.  Additionally Mike Bailey has already made available some slides on the new Compute shaders.  And as usual nvidia already has beta drivers available.

With over 20 new extensions this is a rather large update for a point release, and while I’m extremely grateful that people have had a chance preview these extensions and provide an alternative to the pain of reading through the extensions themselves (though I will probably do that anyway), I can’t help but wish that this sort of access wasn’t so exclusive.  I find the current method of dumping a bunch of new extensions out each SIGGRAPH and shouting “Surprise!” a bit jarring, and more tragically there is no mechanism to allow the OpenGL community at large to provide feedback (unless you are a member of Khronos, I guess).

If you look at the extensions I think they lend themselves perfectly to publication on some sort of official Wiki, revision history would be managed automatically, the OpenGL community could provide feedback on the discussion page, and Khronos members would have permission to make the actual edits to the extension.  I guess what I am saying, in particular regards to the publication of new extensions, is that I wish OpenGL were a little bit more “open.”


OpenGL should support loading shader files

Posted on Updated on

OpenGL’s shader system is purely string based. Just pass it a couple of strings worth of shader code, compile, link, and go.

Its not actually that bad, but it gets progressively more annoying the more advanced your shader code gets. It precludes the convenient use of #include, because OpenGL has no idea where that string came from (which directory/file). All the sudden you find yourself terribly missing the ability to factor out some useful utility code into a header file, and just #include it wherever you need it.

Why am I griping about this now? Because I just wrote some code that runs through my files line by line looking for #include‘s, loading and substituting the correct included source into the original source. Honestly, it wasn’t that bad, but it still *feels* like a hack, and something I really shouldn’t have to do.

In reality I was only half done. I had rendered the shader error log meaningless, since the source it had compiled didn’t match the file I was working on. This meant I still had to read the error log generated by OpenGL whenever shader compilation failed, parse that, extract line numbers, and then figure out the correct line and file associated with the error message so that it would actually be meaningful. It works, but again, its annoying, and doesn’t seem like something OpenGL programmers should have to concern themselves with.

But what about ARB_shading_language_include?

Yes, I am aware there an extension allowing shader includes, but its all wrong. It is again string based, it introduces 6 OpenGL functions, and requires its own compilation step. A #include should be a 100% preprocessor operation. I don’t want to have to recompile my project just to include a file in my shader. And its not the way OpenGL is headed, prevailingly, more and more is being defined in the shader code itself rather than in the calling OpenGL program (which I think is great).

If its not that hard for me to hack into real #include support I imagine the OpenGL driver writers should be able to able handle it as well, and probably do a much better job of it.

So, instead of glShaderSource, I propose glShaderFile, which instead of taking in a string of shader source, it takes in a string of a shader file name, from which it extracts the directory such that the shader compiler knows where to look every time #include is used.  Optionally, it could take another string explicitly defining the shader include directory.  Alternately, another version of glShaderSource, say glShaderSourceDir could take a shader string and have a parameter to explicitly define the shader include directory.

Anyway, that’s my rant.  Its not a huge deal, but I actually think this simple addition would have a fairly large impact on the usability of glsl shaders.

GLSL sign function

Posted on Updated on

The GLSL sign function always seems a great way to remove some unnecessary if statements from my shaders, but I never seem to get to use it because I always need to consider zero as either positive or negative, and not its own special value.

Anyway, I just realized you can accomplish the same thing with the step function.

step(0, x)*2 - 1;

This will return -1.0 if x < 0, and 1.0 if x >= 0.

Which is not terribly readable, hence this overly verbose function

//returns -1.0 if x < 0, and 1.0 if x >= 0
float signGreaterEqualZero(float x)
	return step(0, x)*2 - 1;

Writing to 3D OpenGL textures in CUDA 4.1 with 3D surface writes

Posted on Updated on

Edit: For how this works in CUDA 5 see my new post CUDA 5 and OpenGL Interop and Dynamic Parallelism.

CUDA 4.1 has been released, and with it, and they’ve added support for writing to 3D surfaces. And thanks to some pointers from some very helpful Nvidia engineers (thanks Gernot!), I was able to write to a 3D OpenGL texture with a CUDA kernel, without having to copy any data between the host and the device.

The new toolkit has an excellent volumeFiltering sample that shows how to write to 3D surfaces, which was very helpful, but there are still a couple of gotchas to watch out for.

OpenGL interop

The sample uses cudaMalloc3DArray to directly allocate data for the 3D surfaces, so it doesn’t show the process for 3D surface writes in which the allocation has occurred by creating an OpenGL texture. Fortunately, that takes just a few extra steps.

The Steps

  1. Create an OpenGL 3D Texture
  2. Register the texture as an “image” with CUDA
  3. Map the “image” to a CUDA graphics resource
  4. Get a cudaArray pointer from the resource
  5. Pass the cudaArray pointer to the device
  6. Bind the cudaArray to a globally scoped CUDA surface
  7. Call a CUDA kernel
  8. Write to the surface using surf3Dwrite
  9. Unmap the resource
  10. Unregister the texture

Step 1: Create an OpenGL 3D texture

Hopefully most people know how to do this, just watch out that you are using a texture format that is CUDA compatible, I’m not entirely sure all which textures are supported, but this forum post shows a couple that definitely work.

glGenTextures(1, &texID);
glBindTexture(GL_TEXTURE_3D, texID);

	glTexImage3D(GL_TEXTURE_3D, 0, GL_RGBA32F, textureDim.x, textureDim.y, textureDim.z, 0, GL_RGBA, GL_FLOAT, NULL);
glBindTexture(GL_TEXTURE_3D, 0);

Step 2: Register the texture as an “image” with CUDA

This is done with cudaGraphicsGLRegisterImage, just make sure you specify the cudaGraphicsRegisterFlagsSurfaceLoadStore flag as this tell CUDA that you want to bind this image/texture to a surface reference. If you wrap this in a cutilSafeCall and you used an unsupported texture format, you’ll probably get an error message.

cutilSafeCall(cudaGraphicsGLRegisterImage(&cuda_image_resource, texID, GL_TEXTURE_3D, cudaGraphicsRegisterFlagsSurfaceLoadStore));

Step 3: Map the “image” to a CUDA graphics resource

You must map the resource with cudaGraphicsMapResources before you can get a cudaArray from it.

cutilSafeCall(cudaGraphicsMapResources(1, &cuda_image_resource, 0));

Step 4: Get a cudaArray pointer from the resource

Unlike with buffers, we won’t get a raw pointer from CUDA, instead we get a mapped cudaArray type by calling cudaGraphicsSubResourceGetMappedArray. The cudaArray pointer is only guaranteed valid while “mapped”.

cutilSafeCall(cudaGraphicsSubResourceGetMappedArray(&cuda_array, cuda_image_resource, 0, 0));

Step 5: Pass the cudaArray pointer to the device

Getting the cudaArray pointer is pretty much the last thing we do on the host side. Once we have the pointer we pass it over to the device side code (in the .cu file)

launch_kernel(cuda_image_array, textureDim);

Step 6: Bind the cudaArray to a globally scoped CUDA surface

Once we have the cudaArray pointer on the device side we bind it to the surface reference. For some reason the surface reference must be declared in the global scope. There is no cudaUnbindSurface, so don’t worry about that.

cutilSafeCall(cudaBindSurfaceToArray(surfaceWrite, cuda_array));

Step 7: Call a CUDA kernel

Now that we have a surface reference to work with we can call our CUDA kernel. Make sure not to use too large of block for your kernel launch, which is pretty easy to do if your specifying the dimension in 3D. I believe the limit is 1024 on current gen hardware. If you exceed the limit the kernel will fail to launch, you can catch this with cutilCheckMsg.

dim3 block_dim(8, 8, 8);
dim3 grid_dim(texture_dim.x/block_dim.x, texture_dim.y/block_dim.y, texture_dim.z/block_dim.z);

kernel<<< grid_dim, block_dim >>>(texture_dim);

cutilCheckMsg("kernel failed");

Step 8: Write to the surface using surf3Dwrite

Now that we’ve launched our CUDA kernel we can write to the globally scoped surface with surf3Dwrite. I got tripped up at this point because I didn’t realize that surface memory uses byte addressing. This means that the x-coordinate used to access a surface element needs to be multiplied by the byte size of the element. This is easy to miss if you’re going by the SDK sample, since it uses a 1-byte surface of unsigned char‘s.

void kernel(dim3 texture_dim)
	int x = blockIdx.x*blockDim.x + threadIdx.x;
	int y = blockIdx.y*blockDim.y + threadIdx.y;
	int z = blockIdx.z*blockDim.z + threadIdx.z;

	if(x >= texture_dim.x || y >= texture_dim.y || z >= texture_dim.z)

	float4 element = make_float4(1.0f, 1.0f, 1.0f, 1.0f);
	surf3Dwrite(element, surfaceWrite, x*sizeof(float4), y, z);

Step 9: Unmap the resource

Make sure to unmap the resource with cudaGraphicsUnmapResources before you try to do anything else with the texture, like use it in OpenGL. If you surface writes were misaligned you’ll probably get an “unknown error” when trying to unmap the resource, if it was called with cutilSafeCall.

cutilSafeCall(cudaGraphicsUnmapResources(1, &cuda_image_resource, 0));

Step 10: Unregister the texture

This is just more cleanup, be sure to unregister the texture/image resource with cudaGraphicsUnregisterResource, you probably don’t want to do this until you are done with the texture.


Conclusion & Source

This is a feature I’ve been looking forward to for quite awhile, and I’m very glad to see it implemented in the newest CUDA release. Hopefully I’ve managed to describe to process clearly enough that other people can avoid the mistakes I made. If you still having trouble make sure you’ve called cudaGLSetGLDevice. I created a very simple source example from an SDK sample, so hopefully it will work/compile if you extract it in your SDK sample directory (C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\src\).

How to use CUDA 3.0’s new Graphics Interoperability API with OpenGL

Posted on Updated on

It always bothered me that whenever I took a look at using CUDA in my graphics applications there didn’t seem to be an elegant way to use textures from OpenGL with CUDA without doing potentially expensive copies. But that is finally no longer necessary with CUDA 3.0’s new graphics interoperability API.

The only real documentation is the online doxygen generated stuff, the best place to start is at the Graphics Interoperability page. Unfortunately there is no documentation for the cudaGraphicsResource struct that all these new functions seem to use. And while there is a API agnostic cudaGraphicsUnregisterResource function, there is no function to actually register a resource unless you look in the API specific modules, which you might first assume, as I did, are deprecated, but it’s only the modules that say [DEPRECATED] real big across the top that are actually deprecated, the new non-deprecated modules simply have a link to the deprecated modules. So for OpenGL you simply have to look at the OpenGL Interoperability page to find the rest of the functions you’ll need, there are similar pages for whatever other API you would like to use.

So basically the process is to register a resource, generally a texture or a buffer via the cudaGraphicsGLRegisterImage and cudaGraphicsGLRegisterBuffer functions respectively. These functions assign a valid pointer to your cudaGraphicsResource pointer. Then create a CUDA stream with cudaStreamCreate, map your graphics resource to the CUDA stream with cudaGraphicsMapResources, and at this pointer you can recover a pointer to your texture or buffer data in your CUDA code using the cudaGraphicsSubResourceGetMappedArray and cudaGraphicsResourceGetMappedPointer functions respectively.

However, if you map a texture to a resource you can can only get a pointer to a cudaArray, which is read-only, whereas with a buffer, you can get a pointer to actual data and write to it as well, and since my entire goal in this endeavor was to use CUDA kernels to write to textures as a replacements for my clunky GLSL shaders, thats what I needed to use.

Fortunately there is a workaround called Texture Buffer Objects, which I like to thing of as simply an API to map a Pixel Buffer Object as the data of a Texture. You simply have to remember to create a CUDA stream and map your resources to the CUDA stream before calling any CUDA function that use that resource. So anyway, I’ll just post the most relevant bits of code and hopefully it’ll help someone.


//CUDA graphics resource
cudaGraphicsResource *resources[1];

GLuint pbo;
GLuint tbo_tex;

static GLuint width  = 512;
static GLuint height = 512;

void init_cuda()
//Create your Pixel Buffer Object
glGenBuffers(1, &pbo);
glBufferData(GL_PIXEL_UNPACK_BUFFER_ARB, width*height*sizeof(float4), NULL, GL_DYNAMIC_DRAW);

//Create your Texture
glGenTextures(1, &_tbo_tex);
glBindTexture(GL_TEXTURE_BUFFER_EXT, tbo_tex); //bind Texture

//Attach Pixel Buffer Object to the Texture

glBindTexture(GL_TEXTURE_BUFFER_EXT, 0); //unbind Texture

//Setup CUDA

//Register Pixel Buffer Object as CUDA graphics resource
cudaGraphicsGLRegisterBuffer(resources, pbo, cudaGraphicsMapFlagsNone);

cudaStream_t cuda_stream;

//Create CUDA stream

//Map the graphics resource to the CUDA stream
cudaGraphicsMapResources(1, resources, cuda_stream);

//Call CUDA function
map_texture(resources[0], width, height);

//Unmap the CUDA stream
cudaGraphicsUnmapResources(1, resources, cuda_stream);

//Destroy the CUDA stream

void cuda_test()    //Call this in your draw loop to animate
dim3 blockSize(16, 16);
dim3 gridSize(width / blockSize.x, height / blockSize.y);

cudaStream_t cuda_stream;

//Create CUDA stream

//Map the graphics resource to the CUDA stream
cudaGraphicsMapResources(1, resources, cuda_stream);

//Call CUDA function
test_cuda(width, height, blockSize, gridSize, cuda_stream);

//Unmap the CUDA stream
cudaGraphicsUnmapResources(1, resources, cuda_stream);

//Destroy the CUDA stream


#ifndef _TEST_CU_
#define _TEST_CU_


float4 *cuda_data = NULL;

extern "C" void map_texture(cudaGraphicsResource *resource, int w, int h)
size_t size;
cudaGraphicsResourceGetMappedPointer((void **)(&cuda_data), &size, resource);

__global__ void test_kernel(float4 *cuda_data, int width, int height, int frame_counter)
uint x = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;
uint y = __umul24(blockIdx.y, blockDim.y) + threadIdx.y;
uint i = __umul24(y, width) + x;

if((x < width) && (y < height))
//Create a checkerboard pattern with 32x32 pixel squares
cuda_data[i] = ((((x+frame_counter)/32 + (y+frame_counter)/32 ) & (int)(0x1)) == 0) ? make_float4(1.0, 1.0, 1.0, 1.0) : make_float4(0.0, 0.0, 0.0, 1.0);

static int frame_counter = 0;

extern "C" void test_cuda(int width, int height, dim3 blockSize, dim3 gridSize, cudaStream_t &cuda_stream)
test_kernel<<>>(cuda_data, width, height, frame_counter);


Since there is no fixed function functionality for drawing texture buffer objects you must write a shader for displaying your buffer, which is pretty easy to do as seen below.



varying vec2 st;

void main()
st = gl_MultiTexCoord0.xy;
gl_Position = gl_ModelViewProjectionMatrix * gl_Vertex;


#version 120
#extension GL_EXT_gpu_shader4 : enable

varying vec2 st;

uniform samplerBuffer buffer;
uniform ivec2 dim;

void main()
int i = int(st.x * float(dim.x));
int j = int(st.y * float(dim.y));

gl_FragData[0] = texelFetch(buffer, i+dim.x*j);

Of course, there is no reason to display the buffer if your just doing computations on it, and there is no reason you can’t use this technique on Vertex or other buffers. And finally, I don’t have much CUDA experience so I can’t guarantee that I’m not doing anything suboptimal in the above code. I would also recommend wrapping all the CUDA functions cutilSafeCall functions.

OpenCV and OpenGL

Posted on Updated on

So I started using OpenCV for my Computer Vision class, but I didn’t want to give up my OpenGL based framework, and since I had such a hard time finding any hints on how to convert OpenCV Images to OpenGL textures, I’m going to post the technique I used here.  What I did eventually find was this, which didn’t immediately work for me as written.

So OpenCV images are stored in these IplImage structs, and they’re actually pretty great because they load just about anything

IplImage *image = cvLoadImage("filename");

So after you create you OpenCV Image, how do you get an OpenGL texture.  Well, OpenCV images are stored as unsigned bytes so so you’re going to want your texturetype to be GL_UNSIGNED_BYTE, and most of the other parameters  to pass to glTexImage2D come right out of the IplImage struct, the only thing to be wary of is swapping the RGB colors, if you don’t, red will look blue, and blue will look red.  So be sure to set internalFormat to GL_RGB, and format to GL_BGR like so

glTexImage2D(GL_TEXTURE_2D,        //target
             0,                    //level
             GL_RGB,               //internalFormat
             image->width,         //width
             image->height,        //height
             0,                    //border
             GL_BGR,               //format
             GL_UNSIGNED_BYTE,     //type
             image->imageData);    //pointer to image data

Of course, this only works if your Image is color, if your Image is grayscale your going to want to change GL_BGR to GL_LUMINANCE

glTexImage2D(GL_TEXTURE_2D,        //target
             0,                    //level
             GL_RGB,               //internalFormat
             image->width,         //width
             image->height,        //height
             0,                    //border
             GL_LUMINANCE,         //format
             GL_UNSIGNED_BYTE,     //type
             image->imageData);    //pointer to image data

And you could probably change the internal format of the OpenGL texture as well, but I don’t presume to know what you want to do with this.  And one more snippet for good measure, this time loading a color image and converting it to a gray scale image all in OpenCV.

IplImage *color_image = cvLoadImage("filename");
IplImage *grayscale = cvCreateImage(cvGetSize(color_image), 8, 1);
cvCvtColor(color_image, grayscale, CV_BGR2GRAY);