CUDA

CUDA 5 and OpenGL Interop and Dynamic Parallelism

Posted on Updated on

I seem to revisit this every time every time Nvidia releases a new version of of CUDA.

The good news…

The old methods still work, the whole register, map, bind, etc… process I described in my now two year old post Writing to 3D OpenGL textures in CUDA 4.1 with 3D Surface writes still works.  Ideally, the new version number shouldn’t introduce any new problems…

The bad news…

Unfortunately, if you try to write to a globally scoped CUDA surface from a device-side launched kernel (i.e. a dynamic kernel), nothing will happen.  You’ll scratch your head and wonder why code that works perfectly fine when launched from the host-side, fails silently when launched device-side.

I only discovered the reason when I decided to read, word for word, the CUDA Dynamic Parallelism Programming Guide. On page 14, in the “Textures & Surfaces” section is this note:

NOTE: The device runtime does not support legacy module-scope (i.e. Fermi-style)
textures and surfaces within a kernel launched from the device. Module-scope (legacy)
textures may be created from the host and used in device code as for any kernel, but
may only be used by a top-level kernel (i.e. the one which is launched from the host).

So now the old way of dealing with textures is considered “Legacy” but apparently not quite deprecated yet.  So don’t use them if you plan on using dynamic parallelism.  Additional Note: if you so much call a function that attempts to perform a “Fermi-style” surface write you’re kernel will fail silently, so I highly recommend removing all “Fermi-style” textures and surfaces if you plan on using dynamic parallelism.

So what’s the “New style” of textures and surfaces, well also on page 14 is a footnote saying:

Dynamically created texture and surface objects are an addition to the CUDA memory model
introduced with CUDA 5.0. Please see the CUDA Programming Guide for details.

So I guess they’re called “Dynamically created textures and surfaces”, which is a mouthful so I’m going to refer to them as “Kepler-style” textures and surfaces.  In the actual API they are cudaTextureObject_t and cudaSurfaceObject_t, and you can pass them around as parameters instead of having to declare them at file scope.

OpenGL Interop

So now we have two distinct methods for dealing with textures and surfaces, “Fermi-style” and “Kepler-style”, but we only know how graphics interoperability works with the old, might-as-well-be-deprecated, “Fermi-style” textures and surfaces.

And while there are some samples showing how the new “Kepler-style” textures and surfaces work (see the Bindless Texture sample), all the interop information still seems to target the old “Fermi-style” textures and surfaces.  Fortunately, there is some common ground between “Kepler-style” and “Fermi-style” textures and surfaces, and that common ground is the cudaArray.

Really, all we have to do is replace Step 6  (binding a cudaArray to a globally scoped surface) from the previous tutorial, with the creation of a cudaSurfaceObject_t. That entails creating a cuda resource description (cudaResourceDesc), and all we have to do is appropriately set the array portion of the cudaResourceDesc to our cudaArray, and then use that cudaResourceDesc to create our cudaSurfaceObject_t, which we can then pass to our kernels, and use to write to our registered and mapped OpenGL textures.

// Create the cuda resource description
struct cudaResourceDesc resoureDescription;
memset(&resDesc, 0, sizeof(resoureDescription));
resDesc.resType = cudaResourceTypeArray;	// be sure to set the resource type to cudaResourceTypeArray
resDesc.res.array.array = yourCudaArray;	// this is the important bit

// Create the surface object
cudaSurfaceObject_t writableSurfaceObject = 0;
cudaCreateSurfaceObject(&writableSurfaceObject, &resoureDescription);

And thats it! Here’s hoping the API doesn’t change again anytime soon.

CUDA 5: Enabling Dynamic Parallelism

Posted on Updated on

I finally got a GPU capable of dynamic parallelism, so I finally decided to mess around with CUDA 5.  But I discovered a couple of configuration options that are required if you want to enable dynamic parallelism.  You’ll know you haven’t configured things correctly if you attempt to call a kernel from the device and you get the following error message:

ptxas : fatal error : Unresolved extern function ‘cudaGetParameterBuffer’

Note: this assume you have already selected the appropriate CUDA 5 build customizations for your project

Open the project project properties

  1. Make sure to set “Generate Relocatable Device Code” to “Yes (-rdc=true)”yes
  2. Set “code generation” to compute_35,sm_3″compute
  3. Finally add “cudadevrt.lib” to the CUDA Linker’s “Additional Dependencies”cudadevrt

Writing to 3D OpenGL textures in CUDA 4.1 with 3D surface writes

Posted on Updated on

Edit: For how this works in CUDA 5 see my new post CUDA 5 and OpenGL Interop and Dynamic Parallelism.

CUDA 4.1 has been released, and with it, and they’ve added support for writing to 3D surfaces. And thanks to some pointers from some very helpful Nvidia engineers (thanks Gernot!), I was able to write to a 3D OpenGL texture with a CUDA kernel, without having to copy any data between the host and the device.

The new toolkit has an excellent volumeFiltering sample that shows how to write to 3D surfaces, which was very helpful, but there are still a couple of gotchas to watch out for.

OpenGL interop

The sample uses cudaMalloc3DArray to directly allocate data for the 3D surfaces, so it doesn’t show the process for 3D surface writes in which the allocation has occurred by creating an OpenGL texture. Fortunately, that takes just a few extra steps.

The Steps

  1. Create an OpenGL 3D Texture
  2. Register the texture as an “image” with CUDA
  3. Map the “image” to a CUDA graphics resource
  4. Get a cudaArray pointer from the resource
  5. Pass the cudaArray pointer to the device
  6. Bind the cudaArray to a globally scoped CUDA surface
  7. Call a CUDA kernel
  8. Write to the surface using surf3Dwrite
  9. Unmap the resource
  10. Unregister the texture

Step 1: Create an OpenGL 3D texture

Hopefully most people know how to do this, just watch out that you are using a texture format that is CUDA compatible, I’m not entirely sure all which textures are supported, but this forum post shows a couple that definitely work.

glGenTextures(1, &texID);
glBindTexture(GL_TEXTURE_3D, texID);
{
	glTexParameteri(GL_TEXTURE_3D, GL_TEXTURE_MIN_FILTER, GL_NEAREST        );
	glTexParameteri(GL_TEXTURE_3D, GL_TEXTURE_MAG_FILTER, GL_NEAREST        );
	glTexParameteri(GL_TEXTURE_3D, GL_TEXTURE_WRAP_S,     GL_CLAMP_TO_BORDER);
	glTexParameteri(GL_TEXTURE_3D, GL_TEXTURE_WRAP_T,     GL_CLAMP_TO_BORDER);
	glTexParameteri(GL_TEXTURE_3D, GL_TEXTURE_WRAP_T,     GL_CLAMP_TO_BORDER);

	glTexImage3D(GL_TEXTURE_3D, 0, GL_RGBA32F, textureDim.x, textureDim.y, textureDim.z, 0, GL_RGBA, GL_FLOAT, NULL);
}
glBindTexture(GL_TEXTURE_3D, 0);

Step 2: Register the texture as an “image” with CUDA

This is done with cudaGraphicsGLRegisterImage, just make sure you specify the cudaGraphicsRegisterFlagsSurfaceLoadStore flag as this tell CUDA that you want to bind this image/texture to a surface reference. If you wrap this in a cutilSafeCall and you used an unsupported texture format, you’ll probably get an error message.

cutilSafeCall(cudaGraphicsGLRegisterImage(&cuda_image_resource, texID, GL_TEXTURE_3D, cudaGraphicsRegisterFlagsSurfaceLoadStore));

Step 3: Map the “image” to a CUDA graphics resource

You must map the resource with cudaGraphicsMapResources before you can get a cudaArray from it.

cutilSafeCall(cudaGraphicsMapResources(1, &cuda_image_resource, 0));

Step 4: Get a cudaArray pointer from the resource

Unlike with buffers, we won’t get a raw pointer from CUDA, instead we get a mapped cudaArray type by calling cudaGraphicsSubResourceGetMappedArray. The cudaArray pointer is only guaranteed valid while “mapped”.

cutilSafeCall(cudaGraphicsSubResourceGetMappedArray(&cuda_array, cuda_image_resource, 0, 0));

Step 5: Pass the cudaArray pointer to the device

Getting the cudaArray pointer is pretty much the last thing we do on the host side. Once we have the pointer we pass it over to the device side code (in the .cu file)

launch_kernel(cuda_image_array, textureDim);

Step 6: Bind the cudaArray to a globally scoped CUDA surface

Once we have the cudaArray pointer on the device side we bind it to the surface reference. For some reason the surface reference must be declared in the global scope. There is no cudaUnbindSurface, so don’t worry about that.

cutilSafeCall(cudaBindSurfaceToArray(surfaceWrite, cuda_array));

Step 7: Call a CUDA kernel

Now that we have a surface reference to work with we can call our CUDA kernel. Make sure not to use too large of block for your kernel launch, which is pretty easy to do if your specifying the dimension in 3D. I believe the limit is 1024 on current gen hardware. If you exceed the limit the kernel will fail to launch, you can catch this with cutilCheckMsg.

dim3 block_dim(8, 8, 8);
dim3 grid_dim(texture_dim.x/block_dim.x, texture_dim.y/block_dim.y, texture_dim.z/block_dim.z);

kernel<<< grid_dim, block_dim >>>(texture_dim);

cutilCheckMsg("kernel failed");

Step 8: Write to the surface using surf3Dwrite

Now that we’ve launched our CUDA kernel we can write to the globally scoped surface with surf3Dwrite. I got tripped up at this point because I didn’t realize that surface memory uses byte addressing. This means that the x-coordinate used to access a surface element needs to be multiplied by the byte size of the element. This is easy to miss if you’re going by the SDK sample, since it uses a 1-byte surface of unsigned char‘s.

__global__
void kernel(dim3 texture_dim)
{
	int x = blockIdx.x*blockDim.x + threadIdx.x;
	int y = blockIdx.y*blockDim.y + threadIdx.y;
	int z = blockIdx.z*blockDim.z + threadIdx.z;

	if(x >= texture_dim.x || y >= texture_dim.y || z >= texture_dim.z)
	{
		return;
	}

	float4 element = make_float4(1.0f, 1.0f, 1.0f, 1.0f);
	surf3Dwrite(element, surfaceWrite, x*sizeof(float4), y, z);
}

Step 9: Unmap the resource

Make sure to unmap the resource with cudaGraphicsUnmapResources before you try to do anything else with the texture, like use it in OpenGL. If you surface writes were misaligned you’ll probably get an “unknown error” when trying to unmap the resource, if it was called with cutilSafeCall.

cutilSafeCall(cudaGraphicsUnmapResources(1, &cuda_image_resource, 0));

Step 10: Unregister the texture

This is just more cleanup, be sure to unregister the texture/image resource with cudaGraphicsUnregisterResource, you probably don’t want to do this until you are done with the texture.

cutilSafeCall(cudaGraphicsUnregisterResource(cuda_image_resource));

Conclusion & Source

This is a feature I’ve been looking forward to for quite awhile, and I’m very glad to see it implemented in the newest CUDA release. Hopefully I’ve managed to describe to process clearly enough that other people can avoid the mistakes I made. If you still having trouble make sure you’ve called cudaGLSetGLDevice. I created a very simple source example from an SDK sample, so hopefully it will work/compile if you extract it in your SDK sample directory (C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\src\).

How to use CUDA 3.0’s new Graphics Interoperability API with OpenGL

Posted on Updated on

It always bothered me that whenever I took a look at using CUDA in my graphics applications there didn’t seem to be an elegant way to use textures from OpenGL with CUDA without doing potentially expensive copies. But that is finally no longer necessary with CUDA 3.0’s new graphics interoperability API.

The only real documentation is the online doxygen generated stuff, the best place to start is at the Graphics Interoperability page. Unfortunately there is no documentation for the cudaGraphicsResource struct that all these new functions seem to use. And while there is a API agnostic cudaGraphicsUnregisterResource function, there is no function to actually register a resource unless you look in the API specific modules, which you might first assume, as I did, are deprecated, but it’s only the modules that say [DEPRECATED] real big across the top that are actually deprecated, the new non-deprecated modules simply have a link to the deprecated modules. So for OpenGL you simply have to look at the OpenGL Interoperability page to find the rest of the functions you’ll need, there are similar pages for whatever other API you would like to use.

So basically the process is to register a resource, generally a texture or a buffer via the cudaGraphicsGLRegisterImage and cudaGraphicsGLRegisterBuffer functions respectively. These functions assign a valid pointer to your cudaGraphicsResource pointer. Then create a CUDA stream with cudaStreamCreate, map your graphics resource to the CUDA stream with cudaGraphicsMapResources, and at this pointer you can recover a pointer to your texture or buffer data in your CUDA code using the cudaGraphicsSubResourceGetMappedArray and cudaGraphicsResourceGetMappedPointer functions respectively.

However, if you map a texture to a resource you can can only get a pointer to a cudaArray, which is read-only, whereas with a buffer, you can get a pointer to actual data and write to it as well, and since my entire goal in this endeavor was to use CUDA kernels to write to textures as a replacements for my clunky GLSL shaders, thats what I needed to use.

Fortunately there is a workaround called Texture Buffer Objects, which I like to thing of as simply an API to map a Pixel Buffer Object as the data of a Texture. You simply have to remember to create a CUDA stream and map your resources to the CUDA stream before calling any CUDA function that use that resource. So anyway, I’ll just post the most relevant bits of code and hopefully it’ll help someone.

Test.cpp


//CUDA graphics resource
cudaGraphicsResource *resources[1];

GLuint pbo;
GLuint tbo_tex;

static GLuint width  = 512;
static GLuint height = 512;

void init_cuda()
{
//Create your Pixel Buffer Object
glGenBuffers(1, &pbo);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo);
glBufferData(GL_PIXEL_UNPACK_BUFFER_ARB, width*height*sizeof(float4), NULL, GL_DYNAMIC_DRAW);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0);

//Create your Texture
glGenTextures(1, &_tbo_tex);
glBindTexture(GL_TEXTURE_BUFFER_EXT, tbo_tex); //bind Texture

//Attach Pixel Buffer Object to the Texture
glTexBufferEXT(GL_TEXTURE_BUFFER_EXT, GL_RGBA32F_ARB, pbo);

glBindTexture(GL_TEXTURE_BUFFER_EXT, 0); //unbind Texture

//Setup CUDA
cudaSetDevice(cutGetMaxGflopsDeviceId());
cudaGLSetGLDevice(cutGetMaxGflopsDeviceId());

//Register Pixel Buffer Object as CUDA graphics resource
cudaGraphicsGLRegisterBuffer(resources, pbo, cudaGraphicsMapFlagsNone);

cudaStream_t cuda_stream;

//Create CUDA stream
cudaStreamCreate(&cuda_stream);

//Map the graphics resource to the CUDA stream
cudaGraphicsMapResources(1, resources, cuda_stream);

//Call CUDA function
map_texture(resources[0], width, height);

//Unmap the CUDA stream
cudaGraphicsUnmapResources(1, resources, cuda_stream);

//Destroy the CUDA stream
cudaStreamDestroy(cuda_stream);
}

void cuda_test()    //Call this in your draw loop to animate
{
dim3 blockSize(16, 16);
dim3 gridSize(width / blockSize.x, height / blockSize.y);

cudaStream_t cuda_stream;

//Create CUDA stream
cudaStreamCreate(&cuda_stream);

//Map the graphics resource to the CUDA stream
cudaGraphicsMapResources(1, resources, cuda_stream);

//Call CUDA function
test_cuda(width, height, blockSize, gridSize, cuda_stream);

//Unmap the CUDA stream
cudaGraphicsUnmapResources(1, resources, cuda_stream);

//Destroy the CUDA stream
cudaStreamDestroy(cuda_stream);
}

Test.cu

#ifndef _TEST_CU_
#define _TEST_CU_

#include
#include
#include

float4 *cuda_data = NULL;

extern "C" void map_texture(cudaGraphicsResource *resource, int w, int h)
{
size_t size;
cudaGraphicsResourceGetMappedPointer((void **)(&cuda_data), &size, resource);
}

__global__ void test_kernel(float4 *cuda_data, int width, int height, int frame_counter)
{
uint x = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;
uint y = __umul24(blockIdx.y, blockDim.y) + threadIdx.y;
uint i = __umul24(y, width) + x;

if((x < width) && (y < height))
{
//Create a checkerboard pattern with 32x32 pixel squares
cuda_data[i] = ((((x+frame_counter)/32 + (y+frame_counter)/32 ) & (int)(0x1)) == 0) ? make_float4(1.0, 1.0, 1.0, 1.0) : make_float4(0.0, 0.0, 0.0, 1.0);
}
}

static int frame_counter = 0;

extern "C" void test_cuda(int width, int height, dim3 blockSize, dim3 gridSize, cudaStream_t &cuda_stream)
{
test_kernel<<>>(cuda_data, width, height, frame_counter);
frame_counter++;
}

#endif

Since there is no fixed function functionality for drawing texture buffer objects you must write a shader for displaying your buffer, which is pretty easy to do as seen below.

tbo_shader.glsl

///////////////////////////////////////////////////////////////////////////////
VERTEX
///////////////////////////////////////////////////////////////////////////////

varying vec2 st;

void main()
{
st = gl_MultiTexCoord0.xy;
gl_Position = gl_ModelViewProjectionMatrix * gl_Vertex;
}

///////////////////////////////////////////////////////////////////////////////
FRAGMENT
///////////////////////////////////////////////////////////////////////////////

#version 120
#extension GL_EXT_gpu_shader4 : enable

varying vec2 st;

uniform samplerBuffer buffer;
uniform ivec2 dim;

void main()
{
int i = int(st.x * float(dim.x));
int j = int(st.y * float(dim.y));

gl_FragData[0] = texelFetch(buffer, i+dim.x*j);
}

Of course, there is no reason to display the buffer if your just doing computations on it, and there is no reason you can’t use this technique on Vertex or other buffers. And finally, I don’t have much CUDA experience so I can’t guarantee that I’m not doing anything suboptimal in the above code. I would also recommend wrapping all the CUDA functions cutilSafeCall functions.