I See Dead Code

… as sounding brass, or a tinkling cymbal.

I See Dead Code header image 2

Oh my god—it’s full of cores!

May 24th, 2009 · 7 Comments

Beginnings

After the bleak, joyless work of releasing software, plugging holes of preventing users from clicking buttons they should not have clicked in the first place, writing unctuous documentation and release notes and wrapping code into neat little installers with tiny bows and bells that gently tingle when touched, there are few things as profoundly satisfying as taking code that looked like line noise in the first place, applying some non-trivial transformation to it and still have it look like line noise, only now it’s twice as fast, or half as long or of some other inaccessible quality that can, on some technical level, be referred to as “cool”.

Quite some time ago, I decided that this time, “cool” will mean “runs on graphics hardware, possibly faster”. For this end, I took my trusty T61 to work, downloaded the latest CUDA toolkit, found out that nearly all examples crashed when compiled, installed the latest beta drivers from NVIDIA, noticed that everything worked now and started playing around with the examples.

After some time marveling at ball pit simulations and similarly telling examples, I printed out the CUDA programming guide and spent a considerable amount of time marveling at how well the NVidia green looked when printed with our institute’s color laser printer, before reading its more important parts.

Technical Details

Roughly said, modern graphics hardware from NVIDIA consists of an array of multiprocessors. Each multiprocessor has eight cores, some shared memory, a controller unit and units for transcendental functions. Kernels (methods that run on the device, as opposed to host code) are automatically distributed to the available multiprocessors (their number depending on the hardware, with as many as 120 on the Tesla cards).

Kernel invocations are organized in 3-dimensional thread blocks and 2-dimensional grids. A single thread block can have at most 512 threads (i.e. x × y × z ⩽ 512) and is always executed by a single multiprocessor. Thread blocks are organized in a grid (with at most 65,536 elements on each dimension), which are distributed to the available multiprocessors, and thus inherently scalable.

A multiprocessor divides the threads of a block into warps and all threads of a single warp are executed in parallel. Optimal performance is reached when the control flow in threads of the same warp does not diverge based on the input data, the cores are optimized to execute the same code in several threads on different data (therefore the name SIMT—Single instruction, multiple thread—for this architecture).

There is also a memory hierarchy, from registers to on-die shared memory to global device memory. Data has to be explicitly copied from the system RAM to device memory, which is costly and should be minimized.

In other words, if you have a computation that has to be executed many, many times on different input data, move it the GPU instead and use the freed CPU cycles for shuffling around the data. In other words, ZOOOOOOM.

Some Experiments

Back in the day, we had to write code for training Gaussian Mixture Models. At the core of the algorithm, an auxiliary vector containing each point in the training set has to be created for each Gaussian. The creation of the auxiliar vector includes exponentiation, which proved to be quite costly, since it accounted for 50% of the running time of the program1.

Instead of doing comparedly boring optimization work for running that algorithm on a single processor, and even more boring work to have it run on two processors, I partially ported it to have the auxiliary vectors be computed on the GPU, with the result that by moving only this one part of the computation leads to the three-fold speed increase for the whole program.

On my machine (2 GiB RAM, 2.2 GHz Core 2 Duo, NVS 140M with 2 multiprocessors), running the C++-only version2 takes 33s for 90,000 points and three Gaussians. Running the computation with the same data set on the GPU takes 11s.

How is it done

In order to invoke a kernel running on the GPU, the code has to be compiled with nvcc, NVIDIA’s compiler for CUDA. It separates device, which is compiled to some binary code, from host code which is transformed and handed over to the platform compiler. Therefore, it is needed to write some intermediate methods which can be called from “normal” C++ and that take care of shuffling data to the device and invoking the kernel methods3:

typedef float PROB;
 
#define BLOCK_SIZE 16
 
Matrix d_in;
Matrix d_out;
 
__device__ __constant__ KernelGaussian d_params[20];
 
extern "C"
void LoadPoints(Matrix in, size_t params) 
{
    d_in.height = in.height;
    d_in.width = in.width;    
    size_t size_in = in.width * in.height * sizeof(PROB);
 
    cutilSafeCall(cudaMalloc((void**) &d_in.elements, size_in));
    cutilSafeCall(cudaMemcpy(d_in.elements, in.elements, size_in, cudaMemcpyHostToDevice));
 
    d_out.width = in.width;
    d_out.height = params;
 
    size_t size_out = d_out.width * d_out.height * sizeof(PROB);
    cudaMalloc((void**) &d_out.elements, size_out);
}
 
__global__ void computeAuxVectors(Matrix in, Matrix out) 
{
    int p = blockIdx.y * BLOCK_SIZE + threadIdx.y;
    KernelGaussian g = d_params[blockIdx.x];
    PROB d = (in.elements[p] - g.mean) / g.stddev;
    out.elements[blockIdx.x * out.width + p] = g.factor * expf(-0.5*d*d);
}
extern "C"
void RunComputeAuxVectors(
        KernelGaussian* g, 
        std::vector<std::vector<PROB> >& out, 
        size_t points, size_t params)
{
    cutilSafeCall(
        cudaMemcpyToSymbol(
          d_params, g, 
          params*sizeof(KernelGaussian), 
          0, cudaMemcpyHostToDevice));
 
    dim3 dimGrid(params, points / BLOCK_SIZE);
    dim3 dimBlock(1, BLOCK_SIZE);
    computeAuxVectors<<<dimGrid, dimBlock>>>(d_in, d_out);
    cutilCheckMsg("Kernel execution failed");
 
    for(size_t i = 0; i < params; ++i) {
        cutilSafeCall(
            cudaMemcpy(
              &out[i][0], &d_out.elements[i * points], 
              sizeof(PROB) * points, 
              cudaMemcpyDeviceToHost));
    }
}

The matrix d_in holds the points, which are constant over the whole computation and therefore copied over to the device exactly once. d_out is the matrix that holds the auxiliary values computed for each point and each Gaussians. For faster retrieval, the Gaussians are stored in constant device memory (d_params), which is set to contain 20 elements at most, i.e. the algorithm is limited to learn mixture models with at most 20 Gaussians.

In the method LoadPoints, device memory is allocated for the points and the output vectors and the points are copied from host to device memory using cudaMemcpy.

The actual invocation of the kernel is done in the host method RunComputeAuxVectors, which first copies the parameters into the constant storage and invokes the kernel computeAuxVectors using the new <<>> syntax. grid holds the dimensions of the grid, which is number of gaussians× points / BLOCK_SIZE. The block size is the width of a thread block and somewhat arbitrarily chosen to be 164, currently this also limits the number of points to be a multiple of 165. After the blocking kernel invocation, the results are copied back into host memory.

Important:

  • My card only supports single-precision floats, newer ones also have double-precision
  • blockIdx and threadIdx are global variables provided by CUDA compiler
  • __global__ methods are run on the device and can be called from both host and device code
  • Currently, the device memory for d_in and d_out is never freed, the program simply exits
Results

Working with CUDA is definitely fun (if rewarding), the system itself is quite unforgiving though. Wrapping each invocation of a CUDA method into cutilSafeCall therefore is strongly recommended, in case of an error it will print a (not always helpful) error message and exit. Though in this case the code running on the GPU itself is quite trivial, it can be hard to figure out what is going wrong, because the GPU itself is a black box, and it’s not possible to simply print out a message6. It’s possible to emulate the code on the CPU, and there are other tools available which I haven’t tried out.

I’ve also started toying around with my original idea, having the constraint checks for the TIGER query evaluation run on the graphics hardware and initial results are mixed. Still, if nothing at all, this gives me justification to spent a lot of money on expensive graphics hardware for my next computer—it’s all for science, this harsh and demanding mistress.

  1. There might be a smart way of getting rid of the exp() call, though I haven’t found it yet []
  2. compiled using g++ 4.4.0 -O3 []
  3. looking at the SDK examples was quite helpful here []
  4. I didn’t try out other block sizes []
  5. this is a restriction of the implementation, one could simply pad the points with however many zeros as needed and compute a little more []
  6. or maybe it is, and I don’t know how to do it… []

Tags: hardware · lang:en · programming

7 responses so far ↓

  • 1 t3d // Oct 9, 2009 at 22:36

    Regarding footnote #6:
    Did you hear of cuda-gdb? I think using it could give even better results than printf() ;)

    Unfortunately i cannot use it, because i invoice my cuda code from FORTRAN, with a C wrapper added.

    I came here looking for a way to use “cutilSafeCall()” Still no idea what to do to make nvcc stop yelling “error: identifier “cutilSafeCall” is undefined” at me…

  • 2 Morten Madsen // Feb 10, 2010 at 11:27

    #1: I guess my answer comes a bit late but remember to include additional libraries. (cutil32.lib)

  • 3 #include // Feb 16, 2010 at 13:40

    Include me ^

  • 4 No, include me! // Feb 16, 2010 at 13:41

    #include

  • 5 No, include me! // Feb 16, 2010 at 13:41

    cutil_inline.h

  • 6 extern _shader_ hyperfloat computeAuxVectors ( ) // Feb 27, 2010 at 19:01

    good…. i like that..

  • 7 Javier Ruiz // Jul 14, 2010 at 16:07

    Hi, if you are lucky to get one of the new Fermi architecture cards (like the GTX470), you will enjoy all the bells and whistles of Compute Capability 2.0, which includes use of printf() inside a device function.
    Also concurrent kernels and other funny things.
    Saludos,
    Javier Ruiz

Leave a Comment