Final Exam - Spring 2008

In this final exam we are going to take another look at optimizations under CUDA.

This final exam is a 'take home' exam. You are expected to work on it completely by yourself. It is due Monday December 8th, 2008 at 11:59pm. By that time you should have set up a web page with your solution, and emailed the location of that web page to Andy. During the scheduled final exam time we will meet in class and you will have 10 minutes to briefly describe your work to the class. We won't have time for a question / answer period; this way we can get through everyone within two hours.

The code we are going to look at is the the convolution code from CUDA. There is a very nice paper on optimizing this code included with the CUDA examples and available on the web here:

The CUDA examples come with 3 convolution examples. The two important ones here are convolutionTexture and convolutionSeparable. convolutionTexture has some optimizations; convolutionSeparable has more optimizations. Your job is to show how much of an affect those different optimizations have compared to a naive CUDA version of the algorithm with no optimizations.

Your code should be designed to run from the command line (though launching it from a bat file which contains a command line is fine.) The code should read in a single image in raw format, apply a filter, and write out the new image in raw format. If it makes things easier you can make use of existing libraries to read in, and write out a standard format (jpeg, tiff, png, ppm, etc.) The time used to compare optimizations should be based on the time taken to do the image conversion, not on the time taken to read in and write out the image. In order to be able to get large enough values for comparison you may need to run the convolution kernels multiple times.

Your grade will be based on the number of optimizations that you evaluate and the quality of the web-based documentation for your testing. You should start with a naive non-separated version, and then apply the changes in the paper (separating the horizontal and vertical work, using shared memory, reducing idle threads, coalescing memory access, unrolling the loops) or start from the optimized version and back-off those optimizations in turn. When reading through your web site someone new to CUDA should get a better idea where they should focus their energy in optimizing their code.

Be sure to detail which graphics card you are using.

I've supplied a test image in multiple resolutions in both tif and raw rgb format in and is about 200 MB

This includes images in the following sizes to allow you to test the performance of the code on a variety of image sizes:
    1024 x 1024
    2048 x 2048
    4096 x 4096
    8192 x 8192

The original image is from


The tutorial is due Monday December 8th at 11:59pm.

At the time of the final we will have two hours, and each person will give a brief 10 minute talk on their work. 

last revision 12/12/11