Week 7
    
    Case Studies
    
    Lets look
      through some more examples that come with the CUDA SDK
    
    here is a
      nice 30 minute webinar on global memory optimization in CUDA -
http://developer.download.nvidia.com/CUDA/training/globalmemoryusage_june2011.mp4
    
    and 30
      minutes on local memory / register optimization -
http://developer.download.nvidia.com/CUDA/training/CUDA_LocalMemoryOptimization.mp4
    
    
    and then back
      to OpenCL with some videos from the AMD Fusion Developer Summit
      from the summer of 2011
    
    OpenCL and
      the 13 Dwarfs - 50 minute talk covering a lot of different uses
      for GPUs
       
      video -
      http://developer.amd.com/afds/pages/OLD/video.aspx#/Dev_AFDS_Reb_2155
       
      slides -
      http://developer.amd.com/afds/assets/presentations/2155_final.pdf
    
    OpenCL with
      graphics using multiple graphics cards:
    
          - video -
      http://developer.amd.com/afds/pages/OLD/video.aspx#/Dev_AFDS_Reb_2115
    
          - slides -
      http://developer.amd.com/afds/assets/presentations/2115_final.pdf
    
    Interesting
      talk on molecular docking that looks at the energy issues of using
      GPUs
    
          - video -
      http://developer.amd.com/afds/pages/OLD/video.aspx#/Dev_AFDS_Reb_2130
    
          - slides -
      http://developer.amd.com/afds/assets/presentations/2130_final.pdf
    
    
    and here are
      some notes from a couple years ago looking at some more particle
      examples in CUDA
    
    As an example
      lets take a look at using the GPU to render a particle system.
    This is something that can be done with GLSL or CUDA. We
      saw a GLSL example a couple weeks ago.
    
    This
        kind of thing is easier in CUDA. Here is an example derived from
        the CUDA OpenGL integration example which shows a simple
        particle system.
    
     
    
      
      andySwarm.cu
      andySwarm_kernel.cu
    
    in this
      example each particle is generated and acts independently of the
      other particles
    
    
    
    __global__
void
kernel(float4*
pos,
float4
      * pdata, unsigned int width,
      unsigned int height, int max_age, float time, float randy)
      {
          unsigned int x = blockIdx.x * blockDim.x +
      threadIdx.x;
          unsigned int y = blockIdx.y * blockDim.y +
      threadIdx.y;
          unsigned int arrayLoc = y*width + x;
      
          float4 localData;
          float4 localP;
          float4 newOne;
          float  cScale;
          
          // if we are doing real work ...
          if ((x < width) && (y < height))
            {
            // move the data we need into a register
          localData = pdata[arrayLoc];
          localP    =
      pos[arrayLoc];
      
              
            // if the particle needs to
      re-spawn ...
            if (localData.x >= max_age)
              {
              localData =
      make_float4(0.0,                               
      // age
                 
                 
              0.02 * (x / (float) width -
      0.5),   // horz velocity
                 
                 
              0.015 + 0.01 *
      randy,              
      // vertical velocity
                 
                 
              0.02 * (y / (float) height -
      0.5)); // horz velocity
      
              // move the generation
      point around
              localP =
      make_float4(__sinf(time)*0.2,
                 
                 
          0.0,
                 
                 
          __cosf(time)*0.2,
                 
                 
          0.0);
             
      }       
          
            // take the current position and
      add on the velocity values
            newOne = make_float4( localP.x +
      localData.y,
                 
                 
          localP.y + localData.z,
                 
                 
          localP.z + localData.w,
                 
                 
          1.0f);
          
            localData.x +=
      1.0;    // increase age
            localData.z -= 0.0001; // feel the
      affects of gravity on vertical velocity
          
            // does the particle hit the
      tabletop surface?
            if ((newOne.y <= 0.0) &&
      (localP.x*localP.x + localP.z*localP.z < 25.0))
               localData.z *= -0.2;
          
            // now need to modify the color
      info in the array based on age of particle
            cScale =  localData.x /
      max_age; 
          
            // move the particle data back to
      global memory
            pdata[arrayLoc] = localData; 
      
            // write out color (r, g, b, alpha)
      into the vbo
            // pos[width*height] is where the
      colours start
            pos[width*height + arrayLoc] =
      make_float4(1.0, 
                
                 
                 
                   
      1.0 - 0.5 * cScale,
                 
                 
                 
                   1.0
      - cScale,
                 
                 
                 
                  
      0.0);
             
            // write output vertex x, y, z, w
            pos[arrayLoc] = newOne;
            }
      }
    
    
    
    void createVBO(GLuint* vbo)
    {
       
      // create buffer object
       
      glGenBuffers( 1, vbo);
       
      glBindBuffer( GL_ARRAY_BUFFER, *vbo);
    
       
      // initialize buffer object
       
unsigned
int
size
=
mesh_width
*
mesh_height
*
      8 * sizeof( float);
       
      glBufferData( GL_ARRAY_BUFFER, size, 0, GL_DYNAMIC_DRAW);
    
       
      glBindBuffer( GL_ARRAY_BUFFER, 0);
    
       
      // register buffer object with CUDA
       
      CUDA_SAFE_CALL(cudaGLRegisterBufferObject(*vbo));
    
       
      CUT_CHECK_ERROR_GL();
    }
    
    
    void runCuda( GLuint vbo)
    {
        // map
      OpenGL buffer object for writing from CUDA
        float4
      *dptr;
       
      CUDA_SAFE_CALL(cudaGLMapBufferObject( (void**)&dptr, vbo));
    
        // execute
      the kernel
       
      dim3 block(16, 16, 1);
       
      dim3 grid(mesh_width / block.x, mesh_height / block.y, 1);
       
      kernel<<< grid, block>>>(dptr, d_particleData,
      mesh_width, mesh_height,
       
                  max_age,
      anim, (rand()%1000)/1000.0);
    
    
        // here we
      have a mesh_width and mesh_height of 256
        // each
      block contains an 16x16 array of kernels so the grid contains 16 x
      16 blocks
    
        // unmap
      buffer object
       
      CUDA_SAFE_CALL(cudaGLUnmapBufferObject( vbo));
    }
    
    
    void display()
    {
        // run CUDA
      kernel to generate vertex positions
        runCuda(vbo);
    
       
      glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
    
        // set view
      matrix
       
      glMatrixMode(GL_MODELVIEW);
       
      glLoadIdentity();
       
      glTranslatef(0.0, 0.0, translate_z);
       
      glRotatef(rotate_x, 1.0, 0.0, 0.0);
       
      glRotatef(rotate_y, 0.0, 1.0, 0.0);
    
        // render
      from the vbo
       
      glBindBuffer(GL_ARRAY_BUFFER, vbo);
        glVertexPointer(4, GL_FLOAT,
        0, 0); // size, type, stride, pointer
       
glColorPointer(4,
GL_FLOAT,
0,
(GLvoid
*)
(mesh_width
*
mesh_height
      *
       
               
      sizeof(float)*4));
    
    
       
      glEnableClientState(GL_VERTEX_ARRAY);
       
      glEnableClientState(GL_COLOR_ARRAY);
    
       
      glDrawArrays(GL_POINTS, 0, mesh_width * mesh_height);
    
          glDisableClientState(GL_VERTEX_ARRAY);
          glDisableClientState(GL_COLOR_ARRAY);
    
       
      glutSwapBuffers();
       
      glutPostRedisplay();
    
        anim +=
      0.01;
    }
      
    
    
// constants
    
    const unsigned int
      mesh_width  = 256;
    const unsigned int mesh_height
      = 256;
    
    const unsigned int
      max_age     = 600;
    
    // vbo variables
    GLuint vbo;
    
    float anim = 0.0;
    
    float4 * d_particleData;
    float4 * h_particleData;
    
    void runTest( int argc, char**
      argv)
    {
       
      CUT_DEVICE_INIT(argc, argv);
    
        // Create
      GL context
        glutInit(
      &argc, argv);
       
      glutInitDisplayMode( GLUT_RGBA | GLUT_DOUBLE);
       
      glutInitWindowSize( window_width, window_height);
       
      glutCreateWindow( "Cuda GL interop");
    
        //
      initialize GL
        if(
      CUTFalse == initGL()) {
           
      return;
        }
    
        // register
      callbacks
       
      glutDisplayFunc( display);
       
      glutKeyboardFunc( keyboard);
       
      glutMouseFunc( mouse);
       
      glutMotionFunc( motion);
    
       
//
create
VBO
-
each
element
has
      8 floats (x, y, z, w, r, g, b, alpha)
       
      createVBO( &vbo);
    
       
//
create
a
new
array
to
      hold the data on the particles on the host
       
//
array
has
mesh_width
      x mesh_height elements
       
//
age,
x-velocity,
y-velocity,
z-velocity
(each
of
them
      a float)
    
       
h_particleData
=
(float
*)
malloc
(4
*
mesh_width
      * mesh_height * sizeof(float));
    
       
//
an
a
similar
      array on the device
    
       
      CUDA_SAFE_CALL(cudaMalloc( (void**) &d_particleData, 
                                    
      4 * mesh_width * mesh_height * sizeof(float)));
    
       
      // initialize the particles
    
       
      int pCounter;
       
for
(pCounter
=
0;
pCounter
<
mesh_width
*
      mesh_height; pCounter ++)
           
      {
           
//
set
age
to
a
random
age
so
      particles will spawn over time
           
      h_particleData[4*pCounter] = rand() % max_age; // age
    
           
//
set
all
the
velocities
to
get
them
      off the screen
           
      h_particleData[4*pCounter+1] = -10000;
           
      h_particleData[4*pCounter+2] = -10000;
           
      h_particleData[4*pCounter+3] = -10000;
           
      }
    
       
//
copy
the
particle
      data from the host over to the card
    
       
      CUDA_SAFE_CALL(cudaMemcpy (d_particleData, h_particleData,
       
   
   
   
 
   
   
   
4
      * mesh_width * mesh_height *
       
                 
                   
      sizeof(float),
       
                   
                 
      cudaMemcpyHostToDevice));
    
       
      // run the cuda part
       
      runCuda( vbo);
    
        // start
      rendering mainloop
       
      glutMainLoop();
    }
    
    in terms of
      how speed is affected by kernel block layout
    256, 1 -> 75 fps
      16, 16 -> 75 fps
    16,  8 -> 75 fps
      32, 16 -> 75 fps
      32,  8 -> 75 fps
    32,  4 -> 75 fps
     8,  8 -> 73 fps
     4,  4 -> 73 fps
     8, 16 -> 72 fps
     2,  2 -> 69 fps
     1,256 -> 61 fps
     1,  1 -> 49 fps
    
    With a few modifications we can
        create a simple swarm where the location of the center of the
        swarm is passed into the kernel and all of the particles try to
        head towards the center.
    
    
    
      // for a change of pace this kernel has pdata as a flot * rather
      than a float4 *
    __global__ void
      kernel(float4* pos, float * pdata, unsigned int width,
        
      unsigned int height, int max_age, float time, 
         float
      randy1, float randy2, float randy3,
         float
      tx, float ty, float tz)
    {
        unsigned
      int x = blockIdx.x*blockDim.x + threadIdx.x;
        unsigned
      int y = blockIdx.y*blockDim.y + threadIdx.y;
    
        unsigned
      int arrayLoc = y*width*4 + x*4;
        unsigned
      int posLoc = y*width+x;
    
        float rx,
      ry, rz;
        float vx,
      vy, vz;
        float dx,
      dy, dz, sum;
    
        if
      (pdata[arrayLoc] >= max_age)
          
      {
          
      rx = (randy1 - 0.5);
          
      ry = (randy2 - 0.5);
          
      rz = (randy3 - 0.5);
    
          
      pdata[arrayLoc] = 0; 
          
pdata[arrayLoc+1]
=
0.001
*
rx
*
rx
*
      rx;
          
pdata[arrayLoc+2]
=
0.001
*
ry
*
ry
*
      ry;
          
pdata[arrayLoc+3]
=
0.001
*
rz
*
rz
*
      rz;
    
          
      // any new ones spawn near the target
    
          
pos[posLoc].x
=
tx
+
2.0
*
rx
*
      rx * rx;
          
pos[posLoc].y
=
ty
+
2.0
*
ry
*
      ry * ry;
          
pos[posLoc].z
=
tz
+
2.0
*
rz
*
      rz * rz;
          
      }
    
        
      pdata[arrayLoc] += 1;        //
      increase age
    
    
         dx =
      (tx - pos[posLoc].x);
         dy =
      (ty - pos[posLoc].y);
         dz =
      (tz - pos[posLoc].z);
         sum =
      sqrt(dx*dx + dy*dy + dz*dz);
    
        // update
      the velocity
        vx =
      0.000005 * dx/sum;
        vy =
      0.000005 * dy/sum;
        vz =
      0.000005 * dz/sum;
    
       
      pdata[arrayLoc+1] = pdata[arrayLoc+1] + vx;
       
      pdata[arrayLoc+2] = pdata[arrayLoc+2] + vy;
       
      pdata[arrayLoc+3] = pdata[arrayLoc+3] + vz;
    
        float newX
      = pos[posLoc].x + pdata[arrayLoc+1];
        float newY
      = pos[posLoc].y + pdata[arrayLoc+2];
        float newZ
      = pos[posLoc].z + pdata[arrayLoc+3];
    
    
         //
      now need to modify the color info in the array
    
        
      pos[width*height + posLoc].x = 1.0;
        
      pos[width*height + posLoc].y = 1.0 - 0.5 *
      pdata[arrayLoc]/max_age;
        
      pos[width*height + posLoc].z = 1.0 - pdata[arrayLoc]/max_age;
    
        // write
      output vertex
        
      pos[posLoc] = make_float4(newX, newY, newZ, 1.0f);
    }
      
      
      or the particles can move away from the center and fall like a
      sparkler
      
      
    
    Another option is to have each
        particle try to move towards the center of the swarm in a very
        simplified version of gravity. Here each particle computes the
        distance and direction to all of the other particles to work out
        the sum of those 'forces.' In this case since the computation is
        much greater there are many fewer particles.
    
    
    
    
    Note that
      this code is most definitely not optimized.
      
    __global__ void
      kernel(float4* pos, float4 * pdata, unsigned int width,
      unsigned int height)
      {
          unsigned int x = blockIdx.x*blockDim.x +
      threadIdx.x;
          unsigned int y = blockIdx.y*blockDim.y +
      threadIdx.y;
      
          if ((x >= width) || (y >= height))
              return;
      
          unsigned int posLoc   = y*width+x;
          
          int i;
          float3 affect = {0.0, 0.0, 0.0};
          float3 dif, p;
          float3 newOne;
      
          // in larger problems float4 might be better
              
          for (i=0; i <width * height; i++)
              {
              dif.x = 1000 * (pos[i].x -
      pos[posLoc].x);
              dif.y = 1000 * (pos[i].y -
      pos[posLoc].y);
              dif.z = 1000 * (pos[i].z -
      pos[posLoc].z);
          
              if (fabs(dif.x) > 0.1)
                  affect.x
      += (1/dif.x); 
          
              if (fabs(dif.y) > 0.1)
                 affect.y +=
      (1/dif.y);
          
              if (fabs(dif.z) > 0.1)
                  affect.z
      += (1/dif.z);
              }
          
          p = affect;
          p.x += pdata[posLoc].y;
          p.y += pdata[posLoc].z;
          p.z += pdata[posLoc].w;
          
          newOne.x = pos[posLoc].x + 0.0000001 * p.x;
          newOne.y = pos[posLoc].y + 0.0000001 * p.y;
          newOne.z = pos[posLoc].z + 0.0000001 * p.z;
          
          // write out the current velocity information
          pdata[posLoc] = make_float4(0, p.x, p.y, p.z);
           
           // map r, g, b colour to x, y, z velocity
      
          if (p.x > 0)
              pos[width*height + posLoc].x
      = 1.0;
          else
              pos[width*height + posLoc].x
      = 0.4;
      
          if (p.y > 0)
              pos[width*height + posLoc].y
      = 1.0;
          else
              pos[width*height + posLoc].y
      = 0.4;
      
          if (p.z > 0)
              pos[width*height + posLoc].z
      = 1.0;
          else
              pos[width*height + posLoc].z
      = 0.4;
              
              
          // write output vertex
           pos[posLoc] = make_float4(newOne.x,
      newOne.y, newOne.z, 1.0f);
      }
    
      
    and if we
      take that further we can get the basis of the galaxy collision
      simulation the class did as their second project last time
    
    
    
      
 
    Optimization
    
     We currently have a maximum of 512
        threads per block. 
      
      Threads are assigned to streaming
          processors in blocks. Up to 8 blocks can be assigned to a
          single streaming multiprocessor ... BUT ... in the G80 card
          each streaming multiprocessor can run a max of 768 threads at
          once. To max out a streaming multiprocessor you could have 3
          blocks each with 256 threads or 6 blocks with 128 threads.
        
        A 'warp'
          is the scheduling unit for the streaming multiprocessor. On
          the G80 each block is executed in 32-thread warps - so the
          first 32 threads in the block are warp 1, the next 32 threads
          are warp 2, etc. If you have 2 blocks, each with 256 threads
          then there will be 8 warps generated per block (256 / 32 = 8)
          and so there will be 16 warps total. 
        
        Warps are
          not visible in the language - they may change radically in the
          future.
          
          The CUDA occupancy calculator excel spreadsheet might be able
          to help:
          http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls
          
         
     
    
     
 
        
        There are lots of common functions available (but no rand) and
        there are also less accurate faster device only versions e.g.
        __sin, __pow, __log, __exp . You can use -use_fast_math to force
        all the math functions to use the faster low precision versions.
      
    
    
    CUDA and graphics threads take
        turns on the GPU switching between kernels on the cuda side and
        between primitives on the graphics side.
      
    
    The more
      threads that are running implies fewer registers available to each
      of those threads e.g. the 8800 has 8192 registers per block shared
      among all the threads in that block. 
    
    Variables declared without any
        qualifiers (shared, constant) will be allocated on a register
        unless there are no more registers in which case it will be
        allocated on local memory. Arrays will be allocated in local
        memory.
    
    
    Global variables are stored in DRAM
        which is much slower than shared memory - so a good idea is to
        partition data into tiles that can fit into shared memory, have
        one thread block do the computation with that tile, and write
        the results back out to global memory
    
    
    Constant variables are stored in
        DRAM, so it should be slow, but its cached so its very efficient
        for read-only data. Constant variables can only be assigned a
        value from the host
    
    
    Shared variables can not be
        initialized as part of their declaration
    
    
    So here are some simple rules:
        data
          is read-only -> constant
        data is read/write
          within a block -> shared
        data is read/write
          within a thread -> registers
        data is read/write
          inputs/results -> global
      
      New cards
          will have double precision and those will be used by default
          which will be slower and you may not need the extra precision,
          so you should add an 'f' onto the end of literals, e.g. 3.14f
          vs 3.14 and onto the functions, e.g. sinf() instead of sin
          
          
          Its important to avoid Bank Conflicts in shared memory as
          described in Section 5.1.2.5 on page 60 in version 2 of the
          CUDA Programming guide.
          
          Shared memory is divided into equally-sized memory modules,
          called banks, which can be accessed simultaneously. So, any
          memory read or write request made of n addresses that fall in
          n distinct memory banks can be serviced simultaneously,
          yielding an effective bandwidth that is n times as high as the
          bandwidth of a single module.
          
          However, if two addresses of a memory request fall in the same
          memory bank, there is a bank conflict and the access has to be
          serialized. The hardware splits a memory request with bank
          conflicts into as many separate conflict-free requests as
          necessary, decreasing the effective bandwidth by a factor
          equal to the number of separate memory requests.
          
          In particular, warps are currently of size 32, and shared
          memory is divided into 16 banks where successive 32-bit words
          are assigned to successive banks. A shared memory request for
          a warp is split into one request
          for the first half of the warp and one request for the second
          half of the warp so there can't be conflicts between those two
          half-warps. However within a half-warp all 16 threads could
          try to access shared memory within the same bank.
          
          Pinning memory may also help by using cudaMallocHost() rather
          than malloc() if you have regular communication between the
          device and the host.
          
          You should also be careful to minimize the movement of data
          between the host and the device.
          
          
          http://developer.download.nvidia.com/compute/cuda/2_0/docs/NVIDIA_CUDA_Programming_Guide_2.0.pdf
          
          Parts 4 and 5 of the Dr Dobbs article from last week discussed
          more about optimization:
        http://www.ddj.com/hpc-high-performance-computing/207200659
      
      and the
        CUDA C best practices guide:
        http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUDA_BestPracticesGuide_2.3.pdf
        
      
       
    
    
    My original particle code looked
        like this:
        
        __global__ void kernel(float4* pos, float * pdata, unsigned int
        width,
        unsigned int height, int max_age, float time, float randy)
        {
            unsigned int x = blockIdx.x*blockDim.x +
        threadIdx.x;
            unsigned int y = blockIdx.y*blockDim.y +
        threadIdx.y;
        
            unsigned int arrayLoc = y*width*4 + x*4;
            unsigned int posLoc = y*width+x;
          
            if (pdata[arrayLoc] >= max_age)
               {
               pdata[arrayLoc] = 0;
               pdata[arrayLoc+1] = 0.02 *
        (x / (float) width - 0.5);
               pdata[arrayLoc+2] = 0.015 +
        0.01 * randy; 
               pdata[arrayLoc+3] = 0.02 *
        (y / (float) height - 0.5);
        
               // maybe move the
        generation point around?
        
               pos[posLoc].x =
        sin(time)/5.0;
               pos[posLoc].y = 0;
               pos[posLoc].z =
        cos(time)/5.0;
               }
               
               float newX = pos[posLoc].x
        + pdata[arrayLoc+1];
               float newY = pos[posLoc].y
        + pdata[arrayLoc+2];
               float newZ = pos[posLoc].z
        + pdata[arrayLoc+3];
        
               pdata[arrayLoc] +=
        1;        // increase age
               pdata[arrayLoc+2] -=
        0.0001; // gravity
        
                // tabletop surface
                if ((newY <= 0)
        && fabs(pos[posLoc].x)<5 &&
        fabs(pos[posLoc].z)<5)
                   {
                  
        pdata[arrayLoc+2] = -0.2 * pdata[arrayLoc+2];
                   }
               
        
              // now need to modify the color info
        in the array
              pos[width*height + posLoc].x = 1.0;
              pos[width*height + posLoc].y = 1.0 -
        0.5 * pdata[arrayLoc]/max_age;
              pos[width*height + posLoc].z = 1.0 -
        pdata[arrayLoc]/max_age;
        
        
            // write output vertex
             pos[posLoc] = make_float4(newX, newY,
        newZ, 1.0f);
        }
        
        
        from above and tried to do some optimizing on it. First I did
        some performance testing using
        CUT_SAFE_CALL(cutCreateTimer(&timer)); and
        CUT_SAFE_CALL(cutStartTimer(timer)); and
        CUT_SAFE_CALL(cutStopTimer(timer));  and
        cutGetAverageTimerValue(timer) in the main program.
      
      With a 256
        x 256 VBO of particles and the code above I varied the
        arrangement of kernels in each block which influenced the
        arrangement of blocks in the grid and timed the application on
        both the 8600M GT and 9600 GT.
      
      Then I
        tried to improve the kernel. The biggest improvement was in
        using an array of float4s instead of just floats for the pdata
        and adjusting how the kernels were assigned to the blocks.
        Lesser improvements involved reducing the number of reads from
        global memory by doing one read to copy those values into shared
        memory or local registers to do the computation there. I also
        tried adding some __syncthreads but those only had the affect of
        reducing performance by 1%. The faster sin and cos functions are
        taking up no time. I think I gain a little by replacing the fabs
        calls with multiplication and reducing one if statement.
    
    
    
      
        
          | Blocks 
 | Grid 
 | 8600M GT fps 
 | 8600M GT * fps 
 | 9600 GT fps 
 | 9600 GT * fps 
 | 
        
          | 1 x 1 
 | 256 x 256 
 | 40 
 | 55 
 | 245 
 | 430 
 | 
        
          | 2 x 2 
 | 128 x 128 
 | 71 
 | 84 
 | 545 
 | 920 
 | 
        
          | 8 x 8 
 | 32 x 32 
 | 80 
 | 89 
 | 730 
 | 1100 
 | 
        
          | 12 x 12 
 | 22 x 22 
 | 80 
 | 88 
 | 710 
 | 1080 
 | 
        
          | 16 x 16 
 | 16 x 16 
 | 82 
 | 94 
 | 830 
 | 1350 
 | 
        
          | 20 x 20 
 | 13 x 13 
 | 78 
 | 90 
 | 690 
 | 1130 
 | 
        
          | 
 | 
 | 
 | 
 | 
 | 
 | 
        
          | 16 x 1 
 | 16 x 256 
 | 
 | 
 | 
 | 1360 
 | 
        
          | 1 x 16 
 | 256 x 1 
 | 
 | 
 | 
 | 1060 
 | 
        
          | 256 x 1 
 | 1 x 256 
 | 
 | 
 | 
 | 1330 
 | 
        
          | 1 x 256 
 | 256 x 1 
 | 
 | 
 | 
 | 625 
 | 
      
    
    
    in graphical form we can show the
        relative fps gains as follows where the blue bars show the naive
        version of the code and the green bars show the optimized
        version of the code. There is not a large improvement on the
        laptop card, but there is a significant improvement on the newer
        desktop card.
      
    
    
     
   
 
      
        I can push the frame rate for the 16 x 16 case on the 9600 GT up
        to 1515 by declaring a BLOCK_DIM by BLOCK_DIM region of float4
        shared memory and copying either pdata or pos (but not both)
        into that shared memory at the beginning of the kernel. This
        increases to 1530 with updating that shared memory at the end of
        the kernel and then copying from shared memory back to global
        memory. Since we aren't doing that much work in shared memory
        registers should have a similar effect here.
        
        From my original starting point at 730 fps I was able to boost
        performance to 1530 fps without too much work.
      
    
    
    another tool
      worth looking at is the visual profiler
    
    
    
    Coming Next
      Time
    Project 2
      presentations from the class
    
    
 last revision 2/29/12