Rodrigo Hübner: A review parallelism in CUDA architecture (Part 1)

In this tutorial I will show how to exploit the GPU architecture features with an example of vector sum. I will start with a fully sequential example and make several modifications to the code explaining the GPU limits. Let's go.

Acquainted with a simple example of sum of two vectors:

#include <stdlib.h>
#include <iostream>
#define N 1024
using namespace std;

void vectorAdd(int *a, int *b, int *c) {
    int i = 0;
    for (; i < N; ++i)
        c[i] = a[i] + b[i];
}
// create random numbers and put on the vector
void randomInts(int *vector) {
    int i = 0;
    time_t t;
    srand((unsigned) time(&t));
    for (; i < N; i++)
        vector[i] = rand() % 1000 + 1;
}
int main() {
    int *a, *b, *c, i = 0;
    int size = N * sizeof(int);

    a = (int *) malloc(size); randomInts(a);
    b = (int *) malloc(size); randomInts(b);
    c = (int *) malloc(size);

    vectorAdd(a, b, c);

    // printing the first ten number of result vector
    for (; i < 10; ++i)
        cout << "c[" << i << "] = " << c[i] << endl;

    free(a); free(b); free(c);

    return 0;
}

Let's start talking about blocks. Blocks can execute in parallel. So let's make a change in vectorAdd function transforming it into a kernel that will execute on the device (GPU).

__global__ void vectorAdd(int *a, int *b, int *c) {
    int i = blockIdx.x;
    c[i] = a[i] + b[i];
}

Using the __global__ we say the function vectorAdd will execute on device. Using blockIdx.x, each index i will be execute in different block.

It is also required some changes in the main function. We need to create variables to allocate memory in the GPU space using cudaMalloc:

int *d_a, *d_b, *d_c;
...
cudaMalloc(&d_a, size);
cudaMalloc(&d_b, size);
cudaMalloc(&d_c, size);

After filling the arrays in host (CPU), we should copy them to device using cudaMemcpy:

cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

Now we can draw vectorAdd function to run on N blocks in the GPU:

VectorAdd<<<N,1>>>(d_a, d_b, d_c);

Finally we'll copy the result from GPU to CPU memory and free the memory used in GPU:

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
...
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

The complete source code can be downloaded here: http://pastebin.com/qc7gUiy1

We are executing N copies of vectorAdd in N blocks. We can also perform N copies of vectorAdd N thread with few modifications.

The index in vectorAdd is change from blockIdx.x to threadIdx.x:

int i = threadIdx.x;

We also modified the call vectorAdd function:

vectorAdd<<<1,N>>>(d_a, d_b, d_c);

The complete source code can be downloaded here: http://pastebin.com/zRufa9aV

In the next parts of this tutorial, I'll show how to combine blocks and threads in the same application and show others mechanisms required for programming in CUDA.

Rodrigo Hübner

Saturday, April 4, 2015

A review parallelism in CUDA architecture (Part 1)

No comments:

Post a Comment