In this final exam we are going to take another look at
optimizations under CUDA.
This final exam is a 'take home' exam. You are expected to work
on it completely by yourself. It is due Monday December 8th,
2008 at 11:59pm. By that time you should have set up a web page
with your solution, and emailed the location of that web page to
Andy. During the scheduled final exam time we will meet in class
and you will have 10 minutes to briefly describe your work to
the class. We won't have time for a question / answer period;
this way we can get through everyone within two hours.
The code we are going to look at is the the convolution code from CUDA.
There is a very nice paper on optimizing this code included with
the CUDA examples and available on the web here:
http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/convolutionSeparable/doc/convolutionSeparable.pdf
The CUDA examples come with 3 convolution examples. The two
important ones here are convolutionTexture and
convolutionSeparable. convolutionTexture has some optimizations;
convolutionSeparable has more optimizations. Your job is to show
how much of an affect those different optimizations have
compared to a naive CUDA version of the algorithm with no
optimizations.
Your code should be designed to run from the command line
(though launching it from a bat file which contains a command
line is fine.) The code should read in a single image in raw
format, apply a filter, and write out the new image in raw
format. If it makes things easier you can make use of existing
libraries to read in, and write out a standard format (jpeg,
tiff, png, ppm, etc.) The time used to compare optimizations
should be based on the time taken to do the image conversion,
not on the time taken to read in and write out the image. In
order to be able to get large enough values for comparison you
may need to run the convolution kernels multiple times.
Your grade will be based on the number of optimizations that you
evaluate and the quality of the web-based documentation for your
testing. You should start with a naive non-separated version,
and then apply the changes in the paper (separating the
horizontal and vertical work, using shared memory, reducing idle
threads, coalescing memory access, unrolling the loops) or start
from the optimized version and back-off those optimizations in
turn. When reading through your web site someone new to CUDA
should get a better idea where they should focus their energy in
optimizing their code.