Week 6
CUDA
We are going to be using OpenCL for
the send project in the class but CUDA may be a good alternative
for project 3 and we are going to start with it since it has a
more focused syntax
Getting
Started
with
CUDA
The best way
to
get started is to download CUDA from
http://www.nvidia.com/object/cuda_get.html
You will
probably
want to create your own projects within the SDK hierarchy. Inside
the
projects directory you will find a variety of examples. You can
'cp -r'
one of the existing projects, say the simpleGL one, to create a
new
project. In this new directory you should then rename the cu and
_kernel.cu files, update the kernel #include line in the .cu file
and
update the EXECUTABLE and CUFILES lines in the Makefile. When you
compile, the executable will be found in the bin/linux (or
whatever) /
release directory.
Here are some notes from James: http://www.evl.uic.edu/sjames/cs525/cuda.html
If you want to build this on windows with visual studio here is a
web
page that describes the process: http://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/
You can use
the deviceQuery program that comes with the SDK to see the values
for your card:
./deviceQuery
CUDA Device Query (Runtime
API) version (CUDART static linking)
2008 MacBook pro 2009 Mac Mini
2010 MacBook Pro
Device
0:
"GeForce
8600M
GT"
"GeForce
9400"
"GeForce
GT 330M"
CUDA Driver
Version / Runtime
Version:
2.30 / 2.30
2.30 / 2.3 4.10 / 4.0
CUDA Capability
Major/Minor revision number: 1.1
1.1
1.2
Total
amount of global
memory:
128M bytes 256M
bytes 512M bytes
Number
of
multiprocessors:
4
2
6
Number
of
cores:
32
16
48
Total amount of
constant
memory:
64K bytes
64K bytes 64K bytes
Total amount of shared
memory per block: 16K
bytes 16K
bytes 16K bytes
Total # of registers
available per block: 8192
8192
16384
Warp
size:
32
32
32
Maximum number of
threads per
block:
512
512
512
Maximum sizes of each
dimension of a block: 512 x 512 x
64 512 x 512 x 64 512 x 512 x
64
Maximum sizes of each
dimension of a grid: 64K x 64K x
1 64K x 64K x 1
64K x 64K x 1
Maximum memory
pitch:
256K bytes 256K
bytes 2G bytes
Texture
alignment:
256 bytes
256 bytes 256 bytes
Clock
rate:
0.94
GHz
1.10 GHz
1.10 GHz
Concurrent copy and
execution:
Yes
No
Yes
Run time limit on
kernels:
Yes
Yes
Yes
Integrated:
No
Yes
No
Support host
page-locked memory mapping:
No
Yes
Yes
Compute
mode:
Default
Default
Default
Test PASSED
If you
don't have a compatible card you will see:
Device 0: "Device Emulation (CPU)"
A very nice
introduction to CUDA can be found in the following series of
articles from Dr Dobb's Journal: http://www.ddj.com/hpc-high-performance-computing/207200659
Two good
examples
to start with are the matrix transpose example and the
simpleGL example.
older transpose.cu or
newer transpose.cu and transpose_kernel.cu
older simpler simpleGL.cu or the newer simpleGL.cxx and simpleGL_kernel.cu
You will
notice that these simple programs typically have two parts -
program.cu (which runs on the host) and program_kernel.cu
(which runs on the device)
Here is the main part of transpose.cu:
void runTest( int argc,
char** argv)
{
//
size of the matrix
const
unsigned int size_x = 256;
const
unsigned int size_y = 4096;
//
size of memory required to store the matrix of floats
const
unsigned int mem_size = sizeof(float) * size_x * size_y;
// cutil routines are in the CUDA
Utiltity Library
// it has gone through a couple
iterations so you may see different versions of the
functions
//
use command-line specified CUDA device, otherwise use device
with highest Gflops/s
if( cutCheckCmdLineFlag(argc, (const
char**)argv, "device") )
cutilDeviceInit(argc,
argv);
else
cudaSetDevice(
cutGetMaxGflopsDeviceId() );
//
allocate host memory
float* h_idata = (float*) malloc(mem_size);
// initalize the memory
srand(15235911);
for(
unsigned int i = 0; i < (size_x * size_y); ++i)
{
h_idata[i] = (float) i; // rand();
}
//
allocate device memory (global memory of the GPU)
float* d_idata;
float* d_odata;
cutilSafeCall( cudaMalloc( (void**) &d_idata,
mem_size));
cutilSafeCall( cudaMalloc( (void**)
&d_odata, mem_size));
//
copy host memory to device
cutilSafeCall( cudaMemcpy( d_idata, h_idata, mem_size,
cudaMemcpyHostToDevice) );
//
setup execution parameters
dim3
grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1);
dim3
threads(BLOCK_DIM, BLOCK_DIM, 1);
// here is one way to solve the problem
transpose_naive<<< grid, threads
>>>(d_odata, d_idata, size_x, size_y);
cudaThreadSynchronize();
// here is a faster more complicated way
transpose<<< grid, threads >>>(d_odata,
d_idata, size_x, size_y);
cudaThreadSynchronize();
//
check if kernel execution generated and error
cutilCheckMsg("Kernel execution failed");
//
copy result from device to host
float* h_odata = (float*) malloc(mem_size);
cutilSafeCall( cudaMemcpy( h_odata,
d_odata, mem_size,
cudaMemcpyDeviceToHost) );
//
cleanup memory
free(h_idata);
free(h_odata);
free(
reference);
cutilSafeCall(cudaFree(d_idata));
cutilSafeCall(cudaFree(d_odata));
cudaThreadExit();
}
int main( int argc, char** argv)
{
runTest( argc, argv);
cutilExit(argc, argv);
}
and then the kernel which has both a naive and optimized
version
#define
BLOCK_DIM 16
// This naive
transpose kernel suffers from completely non-coalesced writes.
// it is also accessing data in global memory on the card not
shared memory in the block
__global__ void
transpose_naive(float *odata, float* idata, int width, int
height)
{
unsigned
int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;
unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;
if (xIndex
< width && yIndex < height)
{
unsigned int index_in = xIndex + width * yIndex;
unsigned int index_out = yIndex + height * xIndex;
odata[index_out] = idata[index_in];
}
}
// This kernel is
optimized to ensure all global reads and writes are coalesced,
// and to avoid bank
conflicts in shared memory. This kernel is up to 11x
faster
// than the naive
kernel. Note that the shared memory array is sized to
//
(BLOCK_DIM+1)*BLOCK_DIM. This pads each row of the 2D
block in shared memory
// so that bank conflicts
do not occur when threads address the array column-wise.
__global__ void transpose(float *odata, float *idata, int
width, int height)
{
__shared__ float
block[BLOCK_DIM][BLOCK_DIM+1];
// read
the matrix tile into shared memory
unsigned int xIndex = blockIdx.x *
BLOCK_DIM + threadIdx.x;
unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;
if((xIndex < width) && (yIndex
< height))
{
unsigned int
index_in = xIndex + width * yIndex;
block[threadIdx.y][threadIdx.x] =
idata[index_in];
}
__syncthreads(); // synchronizes all threads within a block
//
everyone needs to have computed their part of shared memory
before we can go on
//
write the transposed matrix tile to global memory
xIndex = blockIdx.y * BLOCK_DIM +
threadIdx.x;
yIndex
= blockIdx.x * BLOCK_DIM + threadIdx.y;
if((xIndex < height) && (yIndex < width))
{
unsigned int index_out = xIndex + height *
yIndex;
odata[index_out] =
block[threadIdx.x][threadIdx.y];
}
}
This is going to take a 256 x 4096 matrix and create a 4096
x 256 matrix by turning the rows of the input matrix into
columns in the output matrix. There are 1,048,576 elements
in the matrix. The grid has 16 x 256 blocks and each block
has 16 x 16 threads.
In the naive version the first 16 x 16 block does the
following reading and writing to / from global memory.
The
optimized version of the first 16 x 16 block does the
following in the first phase. It looks like the kernels are
writing into the same column (bank) of shared memory which
would be bad, but the shared memory is declared to be of size
[BLOCK_DIM][BLOCK_DIM+1] which avoids the bank conflict
problem by adding an extra column to each row which offsets
subsequent rows by 1.
shared
memory
is divided into 16 memory banks. Successive 32bit words fall
into successive banks.
0
1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31
32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47
warp
size is 32 giving us a half-warp size of 16. If multiple
threads in the same half-warp try to access the same memory
bank then we get bank conflicts, since all 16 threads run at
the same time and they need to access different parts of the
same memory bank.
if
we have a half warp of 16 threads trying to access shared
memory 0 through 15 then things are great. If some of the
threads in the half warp are trying to access 0 16 32 etc then
we are in trouble, and instead of grabbing all 16 32-bit words
at once we need to use 16 separate calls. That's bad, and
that's what would happen if we didn't add in the offset [BLOCK_DIM][BLOCK_DIM+1]
Given that we have used the offset then the first 'column' of
block makes use of 0, 17, 34, etc, 1 in each successive bank,
which is nice.
and this in the second phase where the kernels read from
sequential locations in shared memory which is good for
avoiding bank conflicts.
note that
we have increased the amount of memory we are using, and
doubled the number of memory accesses, but yet we have sped up
the execution by a factor of 100. In GPU coding you can get
dramatic improvements by taking advantage of the way the
hardware is orgaznized. Similarly you can lose a lot of speed
by being ignorant of how the hardware is organized. As you
might expect there is a lot of work for people writing
compilers to try and help people optimize naive code.
why is this so much faster?
Shared Memory can be as fast as registers. Local Memory and
Global Memory can be 150 times slower.
Page 89 of the latest programming guide goes into this in more
detail
There is also an improvement in the way global memory is
accessed. There is a nice overview of this in part 6 of the Dr
Dobbs article:
Global memory delivers the
highest memory bandwidth only when the global memory
accesses can be coalesced within a half-warp so the hardware
can then fetch (or store) the data in the fewest number of
transactions. CUDA devices can fetch data in a single
64-byte or 128-byte transaction. If the memory transaction
cannot be coalesced, then a separate memory transaction will
be issued for each thread in the half-warp, which is
undesirable. The performance penalty for non-coalesced
memory operations varies according to the size of the data
type. In particular in our case 32-bit data types will
be roughly 10x slower.
Global memory access by all
threads in the half-warp of a block can be coalesced into
efficient memory transactions on a G80 architecture when:
- The threads access 32-bit,
64-bit or 128-bit data types.
- All 16 words of the
transaction lie in the same segment of size equal to the
memory transaction size.
- Threads must access the
words in sequence: the kth thread in the half-warp must
access the kth word.
some useful built in variables:
dim3 gridDim - dimension of the grid in blocks (only 2
dimensional for now, no z yet)
dim3
blockDim - dimensions of the block in threads
dim3
blockIdx - block index within grid
dim3
threadIdx - thread index within block
If you want
to see how the code runs in emulation mode then you can set
the emu environment variable to 1 (e.g. in tcsh: setenv emu
1) and then make in the example's project directory. You
will find the executable in bin/linux/emurelease. You will
also notice that things run a lot slower.
Device
emulation mode is good when you don't have a compatible
machine to do your coding on. It is also useful if you want
to put some printfs into the kernel to see what is
happening. You can't use printf to debug your code when you
are running the kernel on the GPU.
Some things to keep in mind when writing a kernel: no
recursion, no static variables, no variable number of
arguments.
When dealing with arrays CUDA follows the C model with
memory allocated in row-major order (first row then second
row then third row, etc.) Its usually better to think about
arrays as being linearized.
If there
is time we will also go over the simpleGL example: simpleGL.cu and simpleGL_kernel.cu
Here are some
notes on mapping computational problems to image problems - ie the
'old' way we used to do these things a couple years ago. The main
example is similar to the convolution examples we did last week,
but a bit more dymanic
old Lecture 5
Coming Next
Time
Case
Studies
last revision
12/22/2011