Rodrigo Hübner

Sunday, April 5, 2015

Configuring CUDA toolkit in my workstations

I have access to two computers with NVidia video card: my notebook with hybrid graphics card (Intel and NVidia GTX 540M) and a desktop with a single GTX 680.

To install all I need to program in CUDA on my desktop was smooth, however the notebook I had some trouble for the two video cards. I will present what I did in this post.

Clear any driver version you have installed

press ctrl+alt+F1, make the login, stop xwindow...

~$ sudo /etc/init.d/mdm stop
~$ sudo apt-get purge xserver-xorg-video-nouveau libdrm-nouveau2 nvidia*
~$ sudo update-initramfs -u -k all
~$ sudo reboot

Installing requirements

press ctrl+alt+F1, make the login, and...

~$ sudo /etc/init.d/mdm stop
~$ sudo apt-get update
~$ sudo apt-get install binutils gcc make g++ build-essential

Get the latest CUDA Toolkit

~$ wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run
~$ chmod a+x cuda_7.0.28_linux.run
~$ ./cuda_7.0.28_linux.run --extract=/home/rhubner/inst
~$ cd inst
~/inst$ sudo ./NVIDIA-Linux-x86_64-346.46.run
~/inst$ sudo ./cuda-linux64-rel-7.0.28-19326674.run
~/inst$ sudo ./cuda-samples-linux-7.0.28-19326674.run
~/inst$ sudo reboot

Test CUDA instalation in samples

~$ cd /usr/local/cuda/samples/1_Utilities/deviceQuery
/usr/local/cuda/samples/1_Utilities/deviceQuery$ sudo make
/usr/local/cuda/samples/1_Utilities/deviceQuery$ ./deviceQuery

In my poor laptop show this:

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 540M"
CUDA Driver Version / Runtime Version          7.0 / 7.0
CUDA Capability Major/Minor version number:    2.1
Total amount of global memory:                 2048 MBytes (2147155968 bytes)
( 2) Multiprocessors, ( 48) CUDA Cores/MP:     96 CUDA Cores
GPU Max Clock rate:                            1344 MHz (1.34 GHz)
Memory Clock rate:                             900 Mhz
Memory Bus Width:                              128-bit
L2 Cache Size:                                 131072 bytes
Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory:               65536 bytes
Total amount of shared memory per block:       49152 bytes
Total number of registers available per block: 32768
Warp size:                                     32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block:           1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size    (x,y,z): (65535, 65535, 65535)
Maximum memory pitch:                          2147483647 bytes
Texture alignment:                             512 bytes
Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
Run time limit on kernels:                     No
Integrated GPU sharing Host Memory:            No
Support host page-locked memory mapping:       Yes
Alignment requirement for Surfaces:            Yes
Device has ECC support:                        Disabled
Device supports Unified Addressing (UVA):      Yes
Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = GeForce GT 540M
Result = PASS

When nouveau driver is uninstalled, the window borders disappear ... I solved the problem as follows:

~$ sudo aptitude install kde-window-manager
~$ kwin --replace

Ready! My laptop is ready for programming in CUDA.

Saturday, April 4, 2015

A review parallelism in CUDA architecture (Part 1)

In this tutorial I will show how to exploit the GPU architecture features with an example of vector sum. I will start with a fully sequential example and make several modifications to the code explaining the GPU limits. Let's go.

Acquainted with a simple example of sum of two vectors:

#include <stdlib.h>
#include <iostream>
#define N 1024
using namespace std;

void vectorAdd(int *a, int *b, int *c) {
    int i = 0;
    for (; i < N; ++i)
        c[i] = a[i] + b[i];
}
// create random numbers and put on the vector
void randomInts(int *vector) {
    int i = 0;
    time_t t;
    srand((unsigned) time(&t));
    for (; i < N; i++)
        vector[i] = rand() % 1000 + 1;
}
int main() {
    int *a, *b, *c, i = 0;
    int size = N * sizeof(int);

    a = (int *) malloc(size); randomInts(a);
    b = (int *) malloc(size); randomInts(b);
    c = (int *) malloc(size);

    vectorAdd(a, b, c);

    // printing the first ten number of result vector
    for (; i < 10; ++i)
        cout << "c[" << i << "] = " << c[i] << endl;

    free(a); free(b); free(c);

    return 0;
}

Let's start talking about blocks. Blocks can execute in parallel. So let's make a change in vectorAdd function transforming it into a kernel that will execute on the device (GPU).

__global__ void vectorAdd(int *a, int *b, int *c) {
    int i = blockIdx.x;
    c[i] = a[i] + b[i];
}

Using the __global__ we say the function vectorAdd will execute on device. Using blockIdx.x, each index i will be execute in different block.

It is also required some changes in the main function. We need to create variables to allocate memory in the GPU space using cudaMalloc:

int *d_a, *d_b, *d_c;
...
cudaMalloc(&d_a, size);
cudaMalloc(&d_b, size);
cudaMalloc(&d_c, size);

After filling the arrays in host (CPU), we should copy them to device using cudaMemcpy:

cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

Now we can draw vectorAdd function to run on N blocks in the GPU:

VectorAdd<<<N,1>>>(d_a, d_b, d_c);

Finally we'll copy the result from GPU to CPU memory and free the memory used in GPU:

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
...
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

The complete source code can be downloaded here: http://pastebin.com/qc7gUiy1

We are executing N copies of vectorAdd in N blocks. We can also perform N copies of vectorAdd N thread with few modifications.

The index in vectorAdd is change from blockIdx.x to threadIdx.x:

int i = threadIdx.x;

We also modified the call vectorAdd function:

vectorAdd<<<1,N>>>(d_a, d_b, d_c);

The complete source code can be downloaded here: http://pastebin.com/zRufa9aV

In the next parts of this tutorial, I'll show how to combine blocks and threads in the same application and show others mechanisms required for programming in CUDA.

Friday, April 3, 2015

Tutorial: Kernel update to full CPU exploration

I am preparing my machine to work with heterogeneous parallel programming. Some kernel updates (kernel headers) are required to install the latest driver from my video card NVidia CUDA toolkit and ... So I decided to create a step-by-step. Let's go!

Get the latest linux-source and unpack it

~$ wget https://www.kernel.org/pub/linux/kernel/v3.x/linux-3.19.3.tar.xz
~$ tar -xvf linux-3.19.3.tar.xz

Install required packages and build dependencies

~$ sudo aptitude install fakeroot bzip2 kernel-package libncurses-dev

Copy sources, create symlink to linux source folder

~$ cd /usr/src /usr/src$ sudo mv ~/linux-3.19.3/ . /usr/src$ sudo ln -s linux-3.19.3 linux /usr/src$ cd linux/

Prepare to build linux kernel

/usr/src/linux$ sudo make clean /usr/src/linux$ sudo make mrproper

Copy the config file from existing linux kernel installed on your system

/usr/src/linux$ sudo cp /boot/config-`uname -r` .config
/usr/src/linux$ sudo make menuconfig

Working on kernel configuration file

Load ".config" file
Go into Processor type and features

Go into Processor family (Generic-x86-64)

Select Core 2/newer Xeon

Go into Preemption Model (Voluntary kernel Preemption (Desktop))

Select Preemtible Kernel (Low-Latency Desktop)

Go into Time frequency (250 HZ)

Select 300 HZ

Save the configuration file and exit.

Compilation with speedup tips

If you have an i7 core (like me), you can setup this in make process:

/usr/src/linux$ sudo export CONCURRENCY_LEVEL=8

Clear the folder to initialize compilation

/usr/src/linux$ sudo make-kpkg clean

Build the kernel

/usr/src/linux$ sudo fakeroot make-kpkg --append-to-version "<-suffix>" --revision "<revision#>" --us --uc --initrd kernel_image kernel_headers

On my system I typed:

/usr/src/linux$ sudo fakeroot make-kpkg --append-to-version "i7core" --revision "1" --us --uc --initrd kernel_image kernel_headers

This step takes a several minutes (or hours)...

Install newly built kernel packages

/usr/src/linux$ cd .. /usr/src$ sudo dpkg -i

Update initramfs before rebooting into new kernel

/usr/src$ sudo update-initramfs -c -k

Check if you are running new kernel after rebooting using uname

~$ uname -a

New kernel is compiled for core2 with preemption, it should feel more responsive :)