Week 6

CUDA




We are going to be using OpenCL for the send project in the class but CUDA may be a good alternative for project 3 and we are going to start with it since it has a more focused syntax

Getting Started with CUDA


The best way to get started is to download CUDA from http://www.nvidia.com/object/cuda_get.html

You will probably want to create your own projects within the SDK hierarchy. Inside the projects directory you will find a variety of examples. You can 'cp -r' one of the existing projects, say the simpleGL one, to create a new project. In this new directory you should then rename the cu and _kernel.cu files, update the kernel #include line in the .cu file and update the EXECUTABLE and CUFILES lines in the Makefile. When you compile, the executable will be found in the bin/linux (or whatever) / release directory.


Here are some notes from James: http://www.evl.uic.edu/sjames/cs525/cuda.html

If you want to build this on windows with visual studio here is a web page that describes the process: http://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/




You can use the deviceQuery program that comes with the SDK to see the values for your card:

./deviceQuery
CUDA Device Query (Runtime API) version (CUDART static linking)

                                                2008 MacBook pro    2009 Mac Mini    2010 MacBook Pro

Device 0:                                       "GeForce 8600M GT"  "GeForce 9400"   "GeForce GT 330M"
  CUDA Driver Version / Runtime Version:         2.30 / 2.30        2.30 / 2.3       4.10 / 4.0
  CUDA Capability Major/Minor revision number:   1.1                1.1              1.2
  Total amount of global memory:                 128M bytes         256M bytes       512M bytes
  Number of multiprocessors:                     4                  2                6
  Number of cores:                               32                 16               48
  Total amount of constant memory:               64K bytes          64K bytes        64K bytes
  Total amount of shared memory per block:       16K bytes          16K bytes        16K bytes
  Total # of registers available per block:      8192               8192             16384
  Warp size:                                     32                 32               32
  Maximum number of threads per block:           512                512              512
  Maximum sizes of each dimension of a block:    512 x 512 x 64     512 x 512 x 64   512 x 512 x 64
  Maximum sizes of each dimension of a grid:     64K x 64K x 1      64K x 64K x 1    64K x 64K x 1
  Maximum memory pitch:                          256K bytes         256K bytes       2G bytes
  Texture alignment:                             256 bytes          256 bytes        256 bytes
  Clock rate:                                    0.94 GHz           1.10 GHz         1.10 GHz
  Concurrent copy and execution:                 Yes                No               Yes
  Run time limit on kernels:                     Yes                Yes              Yes
  Integrated:                                    No                 Yes              No
  Support host page-locked memory mapping:       No                 Yes              Yes
  Compute mode:                                  Default            Default          Default

Test PASSED




If you don't have a compatible card you will see:

Device 0: "Device Emulation (CPU)"



A very nice introduction to CUDA can be found in the following series of articles from Dr Dobb's Journal: http://www.ddj.com/hpc-high-performance-computing/207200659




Two good examples to start with are the matrix transpose example and the simpleGL example.

older transpose.cu  or newer transpose.cu and transpose_kernel.cu

older simpler simpleGL.cu or the newer simpleGL.cxx and simpleGL_kernel.cu



You will notice that these simple programs typically have two parts - program.cu (which runs on the host) and program_kernel.cu (which runs on the device)

Here is the main part of transpose.cu:

void runTest( int argc, char** argv)
{
    // size of the matrix
    const unsigned int size_x = 256;
    const unsigned int size_y = 4096;

    // size of memory required to store the matrix of floats
    const unsigned int mem_size = sizeof(float) * size_x * size_y;
   

    // cutil routines are in the CUDA Utiltity Library
    // it has gone through a couple iterations so you may see different versions of the functions

    // use command-line specified CUDA device, otherwise use device with highest Gflops/s
    if( cutCheckCmdLineFlag(argc, (const char**)argv, "device") )
        cutilDeviceInit(argc, argv);
    else
        cudaSetDevice( cutGetMaxGflopsDeviceId() );


    // allocate host memory
    float* h_idata = (float*) malloc(mem_size);

    // initalize the memory
    srand(15235911);
    for( unsigned int i = 0; i < (size_x * size_y); ++i)
    {
        h_idata[i] = (float) i;    // rand();
    }

    // allocate device memory (global memory of the GPU)
    float* d_idata;
    float* d_odata;
    cutilSafeCall( cudaMalloc( (void**) &d_idata, mem_size));
    cutilSafeCall( cudaMalloc( (void**) &d_odata, mem_size));

    // copy host memory to device
    cutilSafeCall( cudaMemcpy( d_idata, h_idata, mem_size,
                                cudaMemcpyHostToDevice) );

    // setup execution parameters
    dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1);
    dim3 threads(BLOCK_DIM, BLOCK_DIM, 1);

    // here is one way to solve the problem
    transpose_naive<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);
 
    cudaThreadSynchronize();
 
    // here is a faster more complicated way
    transpose<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);
 
    cudaThreadSynchronize();

 
    // check if kernel execution generated and error
    cutilCheckMsg("Kernel execution failed");


    // copy result from device to host
    float* h_odata = (float*) malloc(mem_size);
    cutilSafeCall( cudaMemcpy( h_odata, d_odata, mem_size,
                                cudaMemcpyDeviceToHost) );


    // cleanup memory
    free(h_idata);
    free(h_odata);
    free( reference);

    cutilSafeCall(cudaFree(d_idata));
    cutilSafeCall(cudaFree(d_odata));

    cudaThreadExit();

}


int main( int argc, char** argv)
{
    runTest( argc, argv);

    cutilExit(argc, argv);
}

and then the kernel which has both a naive and optimized version

#define BLOCK_DIM 16


// This naive transpose kernel suffers from completely non-coalesced writes.
// it is also accessing data in global memory on the card not shared memory in the block

__global__ void transpose_naive(float *odata, float* idata, int width, int height)
{
    unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;
    unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;
  
   if (xIndex < width && yIndex < height)
   {
       unsigned int index_in  = xIndex + width  * yIndex;
       unsigned int index_out = yIndex + height * xIndex;

       odata[index_out] = idata[index_in];
   }
}



// This kernel is optimized to ensure all global reads and writes are coalesced,
// and to avoid bank conflicts in shared memory.  This kernel is up to 11x faster
// than the naive kernel.  Note that the shared memory array is sized to
// (BLOCK_DIM+1)*BLOCK_DIM.  This pads each row of the 2D block in shared memory
// so that bank conflicts do not occur when threads address the array column-wise.

__global__ void transpose(float *odata, float *idata, int width, int height)

{
    __shared__ float block[BLOCK_DIM][BLOCK_DIM+1];

   
    // read the matrix tile into shared memory

    unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;

    unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;

    if((xIndex < width) && (yIndex < height))

    {
        unsigned int index_in  = xIndex + width  * yIndex;

        block[threadIdx.y][threadIdx.x] = idata[index_in];
    }

    __syncthreads(); // synchronizes all threads within a block
    // everyone needs to have computed their part of shared memory before we can go on

    // write the transposed matrix tile to global memory

    xIndex = blockIdx.y * BLOCK_DIM + threadIdx.x;

    yIndex = blockIdx.x * BLOCK_DIM + threadIdx.y;
    if((xIndex < height) && (yIndex < width))
    {
        unsigned int index_out = xIndex + height * yIndex;
        odata[index_out] = block[threadIdx.x][threadIdx.y];
    }
}


This is going to take a 256 x 4096 matrix and create a 4096 x 256 matrix by turning the rows of the input matrix into columns in the output matrix. There are 1,048,576 elements in the matrix. The grid has 16 x 256 blocks and each block has 16 x 16 threads.

In the naive version the first 16 x 16 block does the following reading and writing to / from global memory.



The optimized version of the first 16 x 16 block does the following in the first phase. It looks like the kernels are writing into the same column (bank) of shared memory which would be bad, but the shared memory is declared to be of size [BLOCK_DIM][BLOCK_DIM+1] which avoids the bank conflict problem by adding an extra column to each row which offsets subsequent rows by 1.






shared memory is divided into 16 memory banks. Successive 32bit words fall into successive banks.
 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

warp size is 32 giving us a half-warp size of 16. If multiple threads in the same half-warp try to access the same memory bank then we get bank conflicts, since all 16 threads run at the same time and they need to access different parts of the same memory bank.

if we have a half warp of 16 threads trying to access shared memory 0 through 15 then things are great. If some of the threads in the half warp are trying to access 0 16 32 etc then we are in trouble, and instead of grabbing all 16 32-bit words at once we need to use 16 separate calls. That's bad, and that's what would happen if we didn't add in the offset [BLOCK_DIM][BLOCK_DIM+1]

Given that we have used the offset then the first 'column' of block makes use of 0, 17, 34, etc, 1 in each successive bank, which is nice.


and this in the second phase
where the kernels read from sequential locations in shared memory which is good for avoiding bank conflicts.



note that we have increased the amount of memory we are using, and doubled the number of memory accesses, but yet we have sped up the execution by a factor of 100. In GPU coding you can get dramatic improvements by taking advantage of the way the hardware is orgaznized. Similarly you can lose a lot of speed by being ignorant of how the hardware is organized. As you might expect there is a lot of work for people writing compilers to try and help people optimize naive code.


why is this so much faster?
Shared Memory can be as fast as registers. Local Memory and Global Memory can be 150 times slower.

Page 89 of the latest programming guide goes into this in more detail



There is also an improvement in the way global memory is accessed. There is a nice overview of this in part 6 of the Dr Dobbs article:

Global memory delivers the highest memory bandwidth only when the global memory accesses can be coalesced within a half-warp so the hardware can then fetch (or store) the data in the fewest number of transactions. CUDA devices can fetch data in a single 64-byte or 128-byte transaction. If the memory transaction cannot be coalesced, then a separate memory transaction will be issued for each thread in the half-warp, which is undesirable. The performance penalty for non-coalesced memory operations varies according to the size of the data type.  In particular in our case 32-bit data types will be roughly 10x slower.

Global memory access by all threads in the half-warp of a block can be coalesced into efficient memory transactions on a G80 architecture when:




some useful built in variables:
dim3 gridDim - dimension of the grid in blocks (only 2 dimensional for now, no z yet)

dim3 blockDim - dimensions of the block in threads
dim3 blockIdx - block index within grid
dim3 threadIdx - thread index within block



If you want to see how the code runs in emulation mode then you can set the emu environment variable to 1 (e.g. in tcsh: setenv emu 1) and then make in the example's project directory. You will find the executable in bin/linux/emurelease. You will also notice that things run a lot slower.

Device emulation mode is good when you don't have a compatible machine to do your coding on. It is also useful if you want to put some printfs into the kernel to see what is happening. You can't use printf to debug your code when you are running the kernel on the GPU.

Some things to keep in mind when writing a kernel: no recursion, no static variables, no variable number of arguments.

When dealing with arrays CUDA follows the C model with memory allocated in row-major order (first row then second row then third row, etc.) Its usually better to think about arrays as being linearized.



If there is time we will also go over the simpleGL example: simpleGL.cu and simpleGL_kernel.cu



Here are some notes on mapping computational problems to image problems - ie the 'old' way we used to do these things a couple years ago. The main example is similar to the convolution examples we did last week, but a bit more dymanic

old Lecture 5


Coming Next Time

 Case Studies


last revision 12/22/2011