Week 6

CUDA and OpenCL




We are going to be using OpenCL for the send project in the class but CUDA may be a good alternative for project 3.


Getting Started with CUDA


The best way to get started is to download CUDA from http://www.nvidia.com/object/cuda_get.html

You will probably want to create your own projects within the SDK hierarchy. Inside the projects directory you will find a variety of examples. You can 'cp -r' one of the existing projects, say the simpleGL one, to create a new project. In this new directory you should then rename the cu and _kernel.cu files, update the kernel #include line in the .cu file and update the EXECUTABLE and CUFILES lines in the Makefile. When you compile, the executable will be found in the bin/linux (or whatever) / release directory.


Here are some notes from James: http://www.evl.uic.edu/sjames/cs525/cuda.html

If you want to build this on windows with visual studio here is a web page that describes the process: http://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/




You can use the deviceQuery program that comes with the SDK to see the values for your card:

./deviceQuery
CUDA Device Query (Runtime API) version (CUDART static linking)

                                                2008 MacBook pro    2009 Mac Mini    2010 MacBook Pro

Device 0:                                       "GeForce 8600M GT"  "GeForce 9400"   "GeForce GT 330M"
  CUDA Driver Version / Runtime Version:         2.30 / 2.30        2.30 / 2.3       4.10 / 4.0
  CUDA Capability Major/Minor revision number:   1.1                1.1              1.2
  Total amount of global memory:                 128M bytes         256M bytes       512M bytes
  Number of multiprocessors:                     4                  2                6
  Number of cores:                               32                 16               48
  Total amount of constant memory:               64K bytes          64K bytes        64K bytes
  Total amount of shared memory per block:       16K bytes          16K bytes        16K bytes
  Total # of registers available per block:      8192               8192             16384
  Warp size:                                     32                 32               32
  Maximum number of threads per block:           512                512              512
  Maximum sizes of each dimension of a block:    512 x 512 x 64     512 x 512 x 64   512 x 512 x 64
  Maximum sizes of each dimension of a grid:     64K x 64K x 1      64K x 64K x 1    64K x 64K x 1
  Maximum memory pitch:                          256K bytes         256K bytes       2G bytes
  Texture alignment:                             256 bytes          256 bytes        256 bytes
  Clock rate:                                    0.94 GHz           1.10 GHz         1.10 GHz
  Concurrent copy and execution:                 Yes                No               Yes
  Run time limit on kernels:                     Yes                Yes              Yes
  Integrated:                                    No                 Yes              No
  Support host page-locked memory mapping:       No                 Yes              Yes
  Compute mode:                                  Default            Default          Default

Test PASSED




If you don't have a compatible card you will see:

Device 0: "Device Emulation (CPU)"


OpenCL has a similar DeviceQuery that I ran on my 2010 MacBook Pro:


  CL_DEVICE_NAME:                     Intel(R) Core(TM) i7 CPU M 620  @ 2.67GHz        GeForce GT 330M

  CL_DEVICE_VENDOR:                   Intel                                            NVIDIA
  CL_DRIVER_VERSION:                  1.1                                              CLH 1.0
  CL_DEVICE_VERSION:                  OpenCL 1.1                                       OpenCL 1.0
  CL_DEVICE_TYPE:                     CL_DEVICE_TYPE_CPU                               CL_DEVICE_TYPE_GPU
  CL_DEVICE_MAX_COMPUTE_UNITS:        4                                                6
  CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3                                                3
  CL_DEVICE_MAX_WORK_ITEM_SIZES:      1024 / 1 / 1                                     512 / 512 / 64

  CL_DEVICE_MAX_WORK_GROUP_SIZE:      1024                                             512
  CL_DEVICE_MAX_CLOCK_FREQUENCY:      2660 MHz                                         1100 MHz

  CL_DEVICE_ADDRESS_BITS:             64                                               32
  CL_DEVICE_MAX_MEM_ALLOC_SIZE:       2048 MByte                                       128 MByte

  CL_DEVICE_GLOBAL_MEM_SIZE:          8192 MByte                                       512 MByte
  CL_DEVICE_ERROR_CORRECTION_SUPPORT: no                                               no
  CL_DEVICE_LOCAL_MEM_TYPE:           global                                           local
  CL_DEVICE_LOCAL_MEM_SIZE:           32 KByte                                         16 KByte

  CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte                                         64 KByte
  CL_DEVICE_QUEUE_PROPERTIES:         CL_QUEUE_PROFILING_ENABLE                        CL_QUEUE_PROFILING_ENABLE
  CL_DEVICE_IMAGE_SUPPORT:            1                                                1
  CL_DEVICE_MAX_READ_IMAGE_ARGS:      128                                              128
  CL_DEVICE_MAX_WRITE_IMAGE_ARGS:     8                                                8
  CL_DEVICE_SINGLE_FP_CONFIG:         denorms INF-quietNaNs round-to-nearest           INF-quietNaNs round-to-nearest

                                      round-to-zero round-to-inf fma                   round-to-zero round-to-inf




A very nice introduction to CUDA can be found in the following series of articles from Dr Dobb's Journal: http://www.ddj.com/hpc-high-performance-computing/207200659




Two good examples to start with are the matrix transpose example and the simpleGL example.

older transpose.cu  or newer transpose.cu and transpose_kernel.cu

older simpler simpleGL.cu or the newer simpleGL.cxx and simpleGL_kernel.cu



You will notice that these simple programs typically have two parts - program.cu (which runs on the host) and program_kernel.cu (which runs on the device)

Here is the main part of transpose.cu:

void runTest( int argc, char** argv)
{
    // size of the matrix
    const unsigned int size_x = 256;
    const unsigned int size_y = 4096;

    // size of memory required to store the matrix of floats
    const unsigned int mem_size = sizeof(float) * size_x * size_y;
   

    // cutil routines are in the CUDA Utiltity Library
    // it has gone through a couple iterations so you may see different versions of the functions

    // use command-line specified CUDA device, otherwise use device with highest Gflops/s
    if( cutCheckCmdLineFlag(argc, (const char**)argv, "device") )
        cutilDeviceInit(argc, argv);
    else
        cudaSetDevice( cutGetMaxGflopsDeviceId() );


    // allocate host memory
    float* h_idata = (float*) malloc(mem_size);

    // initalize the memory
    srand(15235911);
    for( unsigned int i = 0; i < (size_x * size_y); ++i)
    {
        h_idata[i] = (float) i;    // rand();
    }

    // allocate device memory (global memory of the GPU)
    float* d_idata;
    float* d_odata;
    cutilSafeCall( cudaMalloc( (void**) &d_idata, mem_size));
    cutilSafeCall( cudaMalloc( (void**) &d_odata, mem_size));

    // copy host memory to device
    cutilSafeCall( cudaMemcpy( d_idata, h_idata, mem_size,
                                cudaMemcpyHostToDevice) );

    // setup execution parameters
    dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1);
    dim3 threads(BLOCK_DIM, BLOCK_DIM, 1);

    // here is one way to solve the problem
    transpose_naive<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);
 
    cudaThreadSynchronize();
 
    // here is a faster more complicated way
    transpose<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);
 
    cudaThreadSynchronize();

 
    // check if kernel execution generated and error
    cutilCheckMsg("Kernel execution failed");


    // copy result from device to host
    float* h_odata = (float*) malloc(mem_size);
    cutilSafeCall( cudaMemcpy( h_odata, d_odata, mem_size,
                                cudaMemcpyDeviceToHost) );


    // cleanup memory
    free(h_idata);
    free(h_odata);
    free( reference);

    cutilSafeCall(cudaFree(d_idata));
    cutilSafeCall(cudaFree(d_odata));

    cudaThreadExit();

}


int main( int argc, char** argv)
{
    runTest( argc, argv);

    cutilExit(argc, argv);
}



and then the kernel which has both a naive and optimized version

#define BLOCK_DIM 16


// This naive transpose kernel suffers from completely non-coalesced writes.
// it is also accessing data in global memory on the card not shared memory in the block

__global__ void transpose_naive(float *odata, float* idata, int width, int height)
{
    unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;
    unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;
  
   if (xIndex < width && yIndex < height)
   {
       unsigned int index_in  = xIndex + width  * yIndex;
       unsigned int index_out = yIndex + height * xIndex;

       odata[index_out] = idata[index_in];
   }
}



// This kernel is optimized to ensure all global reads and writes are coalesced,
// and to avoid bank conflicts in shared memory.  This kernel is up to 11x faster
// than the naive kernel.  Note that the shared memory array is sized to
// (BLOCK_DIM+1)*BLOCK_DIM.  This pads each row of the 2D block in shared memory
// so that bank conflicts do not occur when threads address the array column-wise.

__global__ void transpose(float *odata, float *idata, int width, int height)

{
    __shared__ float block[BLOCK_DIM][BLOCK_DIM+1];

   
    // read the matrix tile into shared memory

    unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;

    unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;

    if((xIndex < width) && (yIndex < height))

    {
        unsigned int index_in  = xIndex + width  * yIndex;

        block[threadIdx.y][threadIdx.x] = idata[index_in];
    }

    __syncthreads(); // synchronizes all threads within a block
    // everyone needs to have computed their part of shared memory before we can go on

    // write the transposed matrix tile to global memory

    xIndex = blockIdx.y * BLOCK_DIM + threadIdx.x;

    yIndex = blockIdx.x * BLOCK_DIM + threadIdx.y;
    if((xIndex < height) && (yIndex < width))
    {
        unsigned int index_out = xIndex + height * yIndex;
        odata[index_out] = block[threadIdx.x][threadIdx.y];
    }
}



lets also take a look at the same two pieces of code written in OpenCL ... which look almost identical

// This naive transpose kernel suffers from completely non-coalesced writes.
// It can be up to 10x slower than the kernel above for large matrices.
__kernel void transpose_naive(__global float *odata, __global float* idata, int offset, int width, int height)
{
    unsigned int xIndex = get_global_id(0);
    unsigned int yIndex = get_global_id(1);
   
    if (xIndex + offset < width && yIndex < height)
    {
        unsigned int index_in  = xIndex + offset + width * yIndex;
        unsigned int index_out = yIndex + height * xIndex;

        odata[index_out] = idata[index_in];
    }
}

and then the optimized one in OpenCL:

// This kernel is optimized to ensure all global reads and writes are coalesced,
// and to avoid bank conflicts in shared memory.  This kernel is up to 11x faster
// than the naive kernel below.  Note that the shared memory array is sized to
// (BLOCK_DIM+1)*BLOCK_DIM.  This pads each row of the 2D block in shared memory
// so that bank conflicts do not occur when threads address the array column-wise.
__kernel void transpose(__global float *odata, __global float *idata, int offset, int width, int height, __local float* block)
{
    // read the matrix tile into shared memory
    unsigned int xIndex = get_global_id(0);
    unsigned int yIndex = get_global_id(1);

    if((xIndex + offset < width) && (yIndex < height))
    {
        unsigned int index_in = yIndex * width + xIndex + offset;
        block[get_local_id(1)*(BLOCK_DIM+1)+get_local_id(0)] = idata[index_in];
    }

    barrier(CLK_LOCAL_MEM_FENCE);

    // write the transposed matrix tile to global memory
    xIndex = get_group_id(1) * BLOCK_DIM + get_local_id(0);
    yIndex = get_group_id(0) * BLOCK_DIM + get_local_id(1);
    if((xIndex < height) && (yIndex + offset < width))
    {
        unsigned int index_out = yIndex * height + xIndex;
        odata[index_out] = block[get_local_id(0)*(BLOCK_DIM+1)+get_local_id(1)];
    }
}




and for completeness here are the relevant parts of ocLTranspose.cxx

int runTest( const int argc, const char** argv)
{
    cl_int ciErrNum;
    cl_uint ciDeviceCount;
    unsigned int size_x = 256;
    unsigned int size_y = 4096;

    // size of memory required to store the matrix
    const size_t mem_size = sizeof(float) * size_x * size_y;

    //Get the NVIDIA platform
    ciErrNum = oclGetPlatformID(&cpPlatform);
    oclCheckError(ciErrNum, CL_SUCCESS);

    //Get the devices
    ciErrNum = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, 0, NULL, &uiNumDevices);
    oclCheckError(ciErrNum, CL_SUCCESS);
    cdDevices = (cl_device_id *)malloc(uiNumDevices * sizeof(cl_device_id) );
    ciErrNum = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, uiNumDevices, cdDevices, NULL);
    oclCheckError(ciErrNum, CL_SUCCESS);

    //Create the context
    cxGPUContext = clCreateContext(0, uiNumDevices, cdDevices, NULL, NULL, &ciErrNum);
    oclCheckError(ciErrNum, CL_SUCCESS);
 
   
    // Find out how many GPU's to compute on all available GPUs
    size_t nDeviceBytes;
    ciErrNum |= clGetContextInfo(cxGPUContext, CL_CONTEXT_DEVICES, 0, NULL, &nDeviceBytes);
    ciDeviceCount = (cl_uint)nDeviceBytes/sizeof(cl_device_id);

    if (ciErrNum != CL_SUCCESS)
    {
        shrLog(" Error %i in clGetDeviceIDs call !!!\n\n", ciErrNum);
        return ciErrNum;
    }
    else if (ciDeviceCount == 0)
    {
        shrLog(" There are no devices supporting OpenCL (return code %i)\n\n", ciErrNum);
        return -1;
    }

    // create command-queues
    for(unsigned int i = 0; i < ciDeviceCount; ++i)
    {
        // get and print the device for this queue
        cl_device_id device = oclGetDev(cxGPUContext, i);
        shrLog("Device %d: ", i);
        oclPrintDevName(LOGBOTH, device);           
        shrLog("\n");

        // create command queue
        commandQueue[i] = clCreateCommandQueue(cxGPUContext, device, CL_QUEUE_PROFILING_ENABLE, &ciErrNum);
        if (ciErrNum != CL_SUCCESS)
        {
            shrLog(" Error %i in clCreateCommandQueue call !!!\n\n", ciErrNum);
            return ciErrNum;
        }
    }
   
 
    // allocate and initialize host memory
    float* h_idata = (float*) malloc(mem_size);
    float* h_odata = (float*) malloc(mem_size);
    srand(15235911);
    shrFillArray(h_idata, (size_x * size_y));

    // Program Setup
    size_t program_length;
    char* source_path = shrFindFilePath("transpose.cl", argv[0]);
    oclCheckError(source_path != NULL, shrTRUE);
    char *source = oclLoadProgSource(source_path, "", &program_length);
    oclCheckError(source != NULL, shrTRUE);

    // create the program
    cpProgram = clCreateProgramWithSource(cxGPUContext, 1,
                      (const char **)&source, &program_length, &ciErrNum);
    oclCheckError(ciErrNum, CL_SUCCESS);
   
    // build the program
    ciErrNum = clBuildProgram(cpProgram, 0, NULL, "-cl-fast-relaxed-math", NULL, NULL);
    if (ciErrNum != CL_SUCCESS)
    {
        // write out standard error, Build Log and PTX, then return error
        shrLogEx(LOGBOTH | ERRORMSG, ciErrNum, STDERROR);
        oclLogBuildInfo(cpProgram, oclGetFirstDev(cxGPUContext));
        oclLogPtx(cpProgram, oclGetFirstDev(cxGPUContext), "oclTranspose.ptx");
        return(EXIT_FAILURE);
    }
   
    // Run both naive and optimized kernels

    double naiveTime     = transposeGPU("transpose_naive", false, ciDeviceCount, h_idata, h_odata, size_x, size_y);
    double optimizedTime = transposeGPU("transpose",       true,  ciDeviceCount, h_idata, h_odata, size_x, size_y);

 
    // cleanup memory
    free(h_idata);
    free(h_odata);
    free(reference);
    free(source);
    free(source_path);

    // cleanup OpenCL
    ciErrNum = clReleaseProgram(cpProgram);   
    for(unsigned int i = 0; i < ciDeviceCount; ++i)
    {
        ciErrNum |= clReleaseCommandQueue(commandQueue[i]);
    }   
    ciErrNum |= clReleaseContext(cxGPUContext);
    oclCheckError(ciErrNum, CL_SUCCESS);

    return 0;
}



where transposeGPU is:


double transposeGPU(const char* kernelName, bool useLocalMem, cl_uint ciDeviceCount,
    float* h_idata, float* h_odata, unsigned int size_x, unsigned int size_y)

{
    cl_mem d_odata[MAX_GPU_COUNT];
    cl_mem d_idata[MAX_GPU_COUNT];
    cl_kernel ckKernel[MAX_GPU_COUNT];

    size_t szGlobalWorkSize[2];
    size_t szLocalWorkSize[2];
    cl_int ciErrNum;
 
    // Create buffers for each GPU
    // Each GPU will compute sizePerGPU rows of the result
    size_t sizePerGPU = shrRoundUp(BLOCK_DIM, (size_x+ciDeviceCount-1) / ciDeviceCount);
   
    // size of memory required to store the matrix
    const size_t mem_size = sizeof(float) * size_x * size_y;

    for(unsigned int i = 0; i < ciDeviceCount; ++i){
        // allocate device memory and copy host to device memory
        d_idata[i] = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                                    mem_size, h_idata, &ciErrNum);
        oclCheckError(ciErrNum, CL_SUCCESS);

        // create buffer to store output
        d_odata[i] = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY ,
                                    sizePerGPU*size_y*sizeof(float), NULL, &ciErrNum);
        oclCheckError(ciErrNum, CL_SUCCESS);

        // create the naive transpose kernel
        ckKernel[i] = clCreateKernel(cpProgram, kernelName, &ciErrNum);
        oclCheckError(ciErrNum, CL_SUCCESS);
       
        // set the args values for the naive kernel
        size_t offset = i * sizePerGPU;
        ciErrNum  = clSetKernelArg(ckKernel[i], 0, sizeof(cl_mem), (void *) &d_odata[i]);
        ciErrNum |= clSetKernelArg(ckKernel[i], 1, sizeof(cl_mem), (void *) &d_idata[0]);
        ciErrNum |= clSetKernelArg(ckKernel[i], 2, sizeof(int), &offset);
        ciErrNum |= clSetKernelArg(ckKernel[i], 3, sizeof(int), &size_x);
        ciErrNum |= clSetKernelArg(ckKernel[i], 4, sizeof(int), &size_y);
        if(useLocalMem)
        {
            ciErrNum |= clSetKernelArg(ckKernel[i], 5, (BLOCK_DIM + 1) * BLOCK_DIM * sizeof(float), 0 );
        }
    }
    oclCheckError(ciErrNum, CL_SUCCESS);

    // set up execution configuration
    szLocalWorkSize[0] = BLOCK_DIM;
    szLocalWorkSize[1] = BLOCK_DIM;
    szGlobalWorkSize[0] = sizePerGPU;
    szGlobalWorkSize[1] = shrRoundUp(BLOCK_DIM, size_y);
   

    shrLog("\nProcessing a %d by %d matrix of floats...\n\n", size_x, size_y);

    for(unsigned int k=0; k < ciDeviceCount; ++k){
      ciErrNum |= clEnqueueNDRangeKernel(commandQueue[k], ckKernel[k], 2, NULL,
            szGlobalWorkSize, szLocalWorkSize, 0, NULL, NULL);
    }
    oclCheckError(ciErrNum, CL_SUCCESS);
  

    // Block CPU till GPU is done
    for(unsigned int k=0; k < ciDeviceCount; ++k){
        ciErrNum |= clFinish(commandQueue[k]);
    }

    oclCheckError(ciErrNum, CL_SUCCESS);

    // Copy back to host
    for(unsigned int i = 0; i < ciDeviceCount; ++i){
        size_t offset = i * sizePerGPU;
        size_t size = MIN(size_x - i * sizePerGPU, sizePerGPU);

        ciErrNum |= clEnqueueReadBuffer(commandQueue[i], d_odata[i], CL_TRUE, 0,
                                size * size_y * sizeof(float), &h_odata[offset * size_y],
                                0, NULL, NULL);
    }
    oclCheckError(ciErrNum, CL_SUCCESS);

    for(unsigned int i = 0; i < ciDeviceCount; ++i){
        ciErrNum |= clReleaseMemObject(d_idata[i]);
        ciErrNum |= clReleaseMemObject(d_odata[i]);
        ciErrNum |= clReleaseKernel(ckKernel[i]);
    }
    oclCheckError(ciErrNum, CL_SUCCESS);

    return time;
}




This is going to take a 256 x 4096 matrix and create a 4096 x 256 matrix by turning the rows of the input matrix into columns in the output matrix. There are 1,048,576 elements in the matrix. The grid has 16 x 256 blocks and each block has 16 x 16 threads.

In the naive version the first 16 x 16 block does the following reading and writing to / from global memory.



The optimized version of the first 16 x 16 block does the following in the first phase. It looks like the kernels are writing into the same column (bank) of shared memory which would be bad, but the shared memory is declared to be of size [BLOCK_DIM][BLOCK_DIM+1] which avoids the bank conflict problem by adding an extra column to each row which offsets subsequent rows by 1.






shared memory is divided into 16 memory banks. Successive 32bit words fall into successive banks.
 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

warp size is 32 giving us a half-warp size of 16. If multiple threads in the same half-warp try to access the same memory bank then we get bank conflicts, since all 16 threads run at the same time and they need to access different parts of the same memory bank.

if we have a half warp of 16 threads trying to access shared memory 0 through 15 then things are great. If some of the threads in the half warp are trying to access 0 16 32 etc then we are in trouble, and instead of grabbing all 16 32-bit words at once we need to use 16 separate calls. That's bad, and that's what would happen if we didn't add in the offset [BLOCK_DIM][BLOCK_DIM+1]

Given that we have used the offset then the first 'column' of block makes use of 0, 17, 34, etc, 1 in each successive bank, which is nice.


and this in the second phase
where the kernels read from sequential locations in shared memory which is good for avoiding bank conflicts.



note that we have increased the amount of memory we are using, and doubled the number of memory accesses, but yet we have sped up the execution by a factor of 100. In GPU coding you can get dramatic improvements by taking advantage of the way the hardware is organized. Similarly you can lose a lot of speed by being ignorant of how the hardware is organized. As you might expect there is a lot of work for people writing compilers to try and help people optimize naive code.


why is this so much faster?
Shared Memory can be as fast as registers. Local Memory and Global Memory can be 150 times slower.


There is also an improvement in the way global memory is accessed. There is a nice overview of this in part 6 of the Dr Dobbs article:

Global memory delivers the highest memory bandwidth only when the global memory accesses can be coalesced within a half-warp so the hardware can then fetch (or store) the data in the fewest number of transactions. CUDA devices can fetch data in a single 64-byte or 128-byte transaction. If the memory transaction cannot be coalesced, then a separate memory transaction will be issued for each thread in the half-warp, which is undesirable. The performance penalty for non-coalesced memory operations varies according to the size of the data type.  In particular in our case 32-bit data types will be roughly 10x slower.

Global memory access by all threads in the half-warp of a block can be coalesced into efficient memory transactions on a G80 architecture when:




some useful built in variables:
dim3 gridDim - dimension of the grid in blocks (only 2 dimensional for now, no z yet)

dim3 blockDim - dimensions of the block in threads
dim3 blockIdx - block index within grid
dim3 threadIdx - thread index within block



If you want to see how the code runs in cuda emulation mode then you can set the emu environment variable to 1 (e.g. in tcsh: setenv emu 1) and then make in the example's project directory. You will find the executable in bin/linux/emurelease. You will also notice that things run a lot slower.

Device emulation mode is good when you don't have a compatible machine to do your coding on. It is also useful if you want to put some printfs into the kernel to see what is happening. You can't use printf to debug your code when you are running the kernel on the GPU.

Some things to keep in mind when writing a kernel: no recursion, no static variables, no variable number of arguments.

When dealing with arrays CUDA follows the C model with memory allocated in row-major order (first row then second row then third row, etc.) Its usually better to think about arrays as being linearized.



If there is time we will also go over the simpleGL example: simpleGL.cu and simpleGL_kernel.cu



Here are some notes on mapping computational problems to image problems - ie the 'old' way we used to do these things a couple years ago. The main example is similar to the convolution examples we did last week, but a bit more dymanic

old Lecture 5

There is a fairly nice AMD webinar on optimizing Convolution using OpenCL
   - the webinar is at http://developer.amd.com/zones/OpenCLZone/Events/pages/OnDemandWebinars.aspx#/Dev_OpenCL_Webinar_8
   - the slides are at http://developer.amd.com/zones/OpenCLZone/Events/assets/Optimizations-ImageConvolution1.pdf

here is another nice related pdf that gets at the reality of what is going on and how that affects optimizations
http://www.hotchips.org/archives/hc21/1_sun/HC21.23.2.OpenCLTutorial-Epub/HC21.23.230.Houston-AMD-AMD-and-OpenCL.pdf


Coming Next Time

 Case Studies


last revision 2/21/2012