Month: December 2011

Writing to 3D OpenGL textures in CUDA 4.1 with 3D surface writes

Posted on Updated on

Edit: For how this works in CUDA 5 see my new post CUDA 5 and OpenGL Interop and Dynamic Parallelism.

CUDA 4.1 has been released, and with it, and they’ve added support for writing to 3D surfaces. And thanks to some pointers from some very helpful Nvidia engineers (thanks Gernot!), I was able to write to a 3D OpenGL texture with a CUDA kernel, without having to copy any data between the host and the device.

The new toolkit has an excellent volumeFiltering sample that shows how to write to 3D surfaces, which was very helpful, but there are still a couple of gotchas to watch out for.

OpenGL interop

The sample uses cudaMalloc3DArray to directly allocate data for the 3D surfaces, so it doesn’t show the process for 3D surface writes in which the allocation has occurred by creating an OpenGL texture. Fortunately, that takes just a few extra steps.

The Steps

  1. Create an OpenGL 3D Texture
  2. Register the texture as an “image” with CUDA
  3. Map the “image” to a CUDA graphics resource
  4. Get a cudaArray pointer from the resource
  5. Pass the cudaArray pointer to the device
  6. Bind the cudaArray to a globally scoped CUDA surface
  7. Call a CUDA kernel
  8. Write to the surface using surf3Dwrite
  9. Unmap the resource
  10. Unregister the texture

Step 1: Create an OpenGL 3D texture

Hopefully most people know how to do this, just watch out that you are using a texture format that is CUDA compatible, I’m not entirely sure all which textures are supported, but this forum post shows a couple that definitely work.

glGenTextures(1, &texID);
glBindTexture(GL_TEXTURE_3D, texID);
{
	glTexParameteri(GL_TEXTURE_3D, GL_TEXTURE_MIN_FILTER, GL_NEAREST        );
	glTexParameteri(GL_TEXTURE_3D, GL_TEXTURE_MAG_FILTER, GL_NEAREST        );
	glTexParameteri(GL_TEXTURE_3D, GL_TEXTURE_WRAP_S,     GL_CLAMP_TO_BORDER);
	glTexParameteri(GL_TEXTURE_3D, GL_TEXTURE_WRAP_T,     GL_CLAMP_TO_BORDER);
	glTexParameteri(GL_TEXTURE_3D, GL_TEXTURE_WRAP_T,     GL_CLAMP_TO_BORDER);

	glTexImage3D(GL_TEXTURE_3D, 0, GL_RGBA32F, textureDim.x, textureDim.y, textureDim.z, 0, GL_RGBA, GL_FLOAT, NULL);
}
glBindTexture(GL_TEXTURE_3D, 0);

Step 2: Register the texture as an “image” with CUDA

This is done with cudaGraphicsGLRegisterImage, just make sure you specify the cudaGraphicsRegisterFlagsSurfaceLoadStore flag as this tell CUDA that you want to bind this image/texture to a surface reference. If you wrap this in a cutilSafeCall and you used an unsupported texture format, you’ll probably get an error message.

cutilSafeCall(cudaGraphicsGLRegisterImage(&cuda_image_resource, texID, GL_TEXTURE_3D, cudaGraphicsRegisterFlagsSurfaceLoadStore));

Step 3: Map the “image” to a CUDA graphics resource

You must map the resource with cudaGraphicsMapResources before you can get a cudaArray from it.

cutilSafeCall(cudaGraphicsMapResources(1, &cuda_image_resource, 0));

Step 4: Get a cudaArray pointer from the resource

Unlike with buffers, we won’t get a raw pointer from CUDA, instead we get a mapped cudaArray type by calling cudaGraphicsSubResourceGetMappedArray. The cudaArray pointer is only guaranteed valid while “mapped”.

cutilSafeCall(cudaGraphicsSubResourceGetMappedArray(&cuda_array, cuda_image_resource, 0, 0));

Step 5: Pass the cudaArray pointer to the device

Getting the cudaArray pointer is pretty much the last thing we do on the host side. Once we have the pointer we pass it over to the device side code (in the .cu file)

launch_kernel(cuda_image_array, textureDim);

Step 6: Bind the cudaArray to a globally scoped CUDA surface

Once we have the cudaArray pointer on the device side we bind it to the surface reference. For some reason the surface reference must be declared in the global scope. There is no cudaUnbindSurface, so don’t worry about that.

cutilSafeCall(cudaBindSurfaceToArray(surfaceWrite, cuda_array));

Step 7: Call a CUDA kernel

Now that we have a surface reference to work with we can call our CUDA kernel. Make sure not to use too large of block for your kernel launch, which is pretty easy to do if your specifying the dimension in 3D. I believe the limit is 1024 on current gen hardware. If you exceed the limit the kernel will fail to launch, you can catch this with cutilCheckMsg.

dim3 block_dim(8, 8, 8);
dim3 grid_dim(texture_dim.x/block_dim.x, texture_dim.y/block_dim.y, texture_dim.z/block_dim.z);

kernel<<< grid_dim, block_dim >>>(texture_dim);

cutilCheckMsg("kernel failed");

Step 8: Write to the surface using surf3Dwrite

Now that we’ve launched our CUDA kernel we can write to the globally scoped surface with surf3Dwrite. I got tripped up at this point because I didn’t realize that surface memory uses byte addressing. This means that the x-coordinate used to access a surface element needs to be multiplied by the byte size of the element. This is easy to miss if you’re going by the SDK sample, since it uses a 1-byte surface of unsigned char‘s.

__global__
void kernel(dim3 texture_dim)
{
	int x = blockIdx.x*blockDim.x + threadIdx.x;
	int y = blockIdx.y*blockDim.y + threadIdx.y;
	int z = blockIdx.z*blockDim.z + threadIdx.z;

	if(x >= texture_dim.x || y >= texture_dim.y || z >= texture_dim.z)
	{
		return;
	}

	float4 element = make_float4(1.0f, 1.0f, 1.0f, 1.0f);
	surf3Dwrite(element, surfaceWrite, x*sizeof(float4), y, z);
}

Step 9: Unmap the resource

Make sure to unmap the resource with cudaGraphicsUnmapResources before you try to do anything else with the texture, like use it in OpenGL. If you surface writes were misaligned you’ll probably get an “unknown error” when trying to unmap the resource, if it was called with cutilSafeCall.

cutilSafeCall(cudaGraphicsUnmapResources(1, &cuda_image_resource, 0));

Step 10: Unregister the texture

This is just more cleanup, be sure to unregister the texture/image resource with cudaGraphicsUnregisterResource, you probably don’t want to do this until you are done with the texture.

cutilSafeCall(cudaGraphicsUnregisterResource(cuda_image_resource));

Conclusion & Source

This is a feature I’ve been looking forward to for quite awhile, and I’m very glad to see it implemented in the newest CUDA release. Hopefully I’ve managed to describe to process clearly enough that other people can avoid the mistakes I made. If you still having trouble make sure you’ve called cudaGLSetGLDevice. I created a very simple source example from an SDK sample, so hopefully it will work/compile if you extract it in your SDK sample directory (C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\src\).