Lecture 5 - old version pre CUDA

GPGPU Concepts and Examples

Here are some notes on mapping computational problems to image problems - ie the 'old' way we used to do these things a couple years ago.

We will start by looking at this paper:

A Survey of General-Purpose Computation on Graphics Hardware by John Owens et al
Proceedings of Eurographics 2005, Dublin, Ireland, Aug 29 - Sep 02, 2005, pp 21-51.
available in pdf format from here: http://graphics.idav.ucdavis.edu/publications/print_pub?pub_id=844

and course notes from SIGGRAPH 2005: http://www.gpgpu.org/s2005/#outline
and here are some related course notes from IEEE Vizualization 2005
http://www.gpgpu.org/vis2005/

and here is an introductory paper
http://numod.ins.uni-bonn.de/research/papers/public/StDoKo05sim.pdf

a realy good site for this sort of thing is
http://www.gpgpu.org

lets start with:
- http://www.gpgpu.org/s2005/slides/luebke.Introduction.ppt
then
- http://www.gpgpu.org/s2005/slides/harris.Mapping.ppt
and then
- http://www.gpgpu.org/s2005/slides/purcell.SortingAndSearching.ppt
and
- http://www.gpgpu.org/s2005/slides/woolley.GPUProgramOptimization.ppt

The main issue here is that you need to map the computational tasks into a graphical form

SIGGRAPH notes on how to map computational concepts to the GPU
http://www.gpgpu.org/s2005/slides/harris.Mapping.ppt

Do work in the Fragment Processor (tend to have many more of them than vertex processors)

Works well for highly parallel tasks.
Works well for large data sets (but they must fit into texture memory)

Multiple passes may be necessary

So whats not there right now:

no real integer data type
no bitwise logical operations
no 64bit support

Branching issues

A stream is an ordered set of data of the same type of any length
A kernel takes one or more streams as inputs and produces one or more streams as output

Stream Operations

Map (apply)

given a stream of data elements and a function, apply the function to each data element (e.g. the convolution operation, or a simple addition of elements from multiple textures)

Reduce

given a stream of data, compute a smaller stream (e.g. sum or maximum). For example given with a 512x512 block of data, a 256x256 pixel fram buffer would be created and then each of the 256x256 elements could compute the operation on the (x,y), (x+256,y), (x, y+256), (x+256, y+256) elements. Then in the next iteration a 128x128 pixel frame buffer could be used.

Scatter and Gather

Scatter is indirect write: d[a] = v ... very hard in a fragment shader
Gather is indirect read: v = d[a] ... pretty easy in a fragment shader

Stream Filtering (non-uniform reduction)

given a stream of data, select a subset of the elements.

Sort

given a stream of data, reorder the stream into an ordered set of data. Hard to do without scatter.
GPU-based sorting will take a fixed number of steps no matter what the input data is. ie sorting an already sorted stream will take the same amount of time to sort as an unsorted stream.
Odd Even MergeSort is a simple one: http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/networks/oemen.htm
A faster one is Bitonic Mergesort on a GPU (with some odd grammar) - http://www.cis.upenn.edu/~suvenkat/700/lectures/19/sorting-kider.pdf
Another Bitonic Mergesort paper - http://www.cs.mu.oz.au/498/notes/node38.html
And a more mathematical one - http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/bitonic/bitonicen.htm
And some nice animations of how it works - http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm
And some CG code on page 71 of http://171.64.77.146/papers/tpurcell_thesis/tpurcell_thesis.pdf
But basically:

Given a bitonic sequence with 2^n data elements (that is it only has one local minimum or maximum so the sequence is either V shaped or A shaped in its magnitudes.)
Perform Binary Split to break the sequence in half
Perform Bitonic Merge on the sequence (swapping partner elements in the two halves if necessary to get smaller elements on the left and larger on the right
Recurse for each half of the sequence
But what if my data doesn't start out as a bitonic sequence? Individual elements are bitonic sequences of length 1. From these elements you can build bitonic sequences of length 2, 4, 8 etc - typically ascending for the left half and descending for the right half - using Bitonic Merge.

binary search
nearest neighbour search (kNN-grid)
searching notes from SIGGRAPH: http://www.gpgpu.org/s2005/slides/purcell.SortingAndSearching.ppt
instead of making each individual search faster, we use the GPU to do multiple searches simultaneously

Now how about some code ...

There is a good example as Tutorial 0 from the GPGPU site:
http://cg.in.tu-clausthal.de/publications.shtml#shader_maker

This is a good starting point, though for some reason the code doesn't have a glutInit which makes my powerbook unhappy. But after adding that it works fine (aside from not allowing me to use the escape key to exit.)

The project page is http://sourceforge.net/projects/gpgpu/
There is code here http://sourceforge.net/project/showfiles.php?group_id=104004&package_id=117303&release_id=245080

Here are some nice Optimization notes from SIGGRAPH 2005:
http://www.gpgpu.org/s2005/slides/woolley.GPUProgramOptimization.ppt

So what else can you do?

GPUSort (Windows, Linux with an Nvidia card)
http://gamma.cs.unc.edu/GPUSORT/index.html

FFT on a GPU
http://www.cs.unm.edu/~kmorel/documents/fftgpu/

Coming Next Time

Case Studies

last revision 5/16/08