![]() Declare and allocate host and device memory.Transfer data from the host to the device.Transfer results from the device to the host.Keeping this sequence of operations in mind, let’s look at a CUDA C example. ![]() In a recent post, I illustrated Six Ways to SAXPY, which includes a CUDA C version. SAXPY stands for “Single-precision A*X Plus Y”, and is a good “hello world” example for parallel computation. Int i = blockIdx.x*blockDim.x + threadIdx.x ĬudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost) Void saxpy(int n, float a, float *x, float *y) In this post I will dissect a more complete version of the CUDA C SAXPY, explaining in detail what is done and why. The information between the triple chevrons is the execution configuration, which dictates how many device threads execute the kernel in parallel. In CUDA there is a hierarchy of threads in software which mimics how thread processors are grouped on the GPU. In the CUDA programming model we speak of launching a kernel with a grid of thread blocks. The first argument in the execution configuration specifies the number of thread blocks in the grid, and the second specifies the number of threads in a thread block. Thread blocks and grids can be made one-, two- or three-dimensional by passing dim3 (a simple struct defined by CUDA with x, y, and z members) values for these arguments, but for this simple example we only need one dimension so we pass integers instead. In this case we launch the kernel with thread blocks containing 256 threads, and use integer arithmetic to determine the number of thread blocks required to process all N elements of the arrays ( (N+255)/256).įor cases where the number of elements in the arrays is not evenly divisible by the thread block size, the kernel code must check for out-of-bounds memory accesses. Cleaning UpĪfter we are finished, we should free any allocated memory. In CUDA, we define kernels such as saxpy using the _global_ declaration specifier.įor device memory allocated with cudaMalloc(), simply call cudaFree(). Variables defined within device code do not need to be specified as device variables because they are assumed to reside on the device. In this case the n, a and i variables will be stored by each thread in a register, and the pointers x and y must be pointers to the device memory address space. This is indeed true because we passed d_x and d_y to the kernel when we launched it from the host code. The first two arguments, n and a, however, were not explicitly transferred to the device in host code. Because function arguments are passed by value by default in C/C++, the CUDA runtime can automatically handle the transfer of these values to the device. This feature of the CUDA Runtime API makes launching kernels on the GPU very natural and easy-it is almost the same as calling a C function. There are only two lines in our saxpy kernel. As mentioned earlier, the kernel is executed by multiple threads in parallel.
0 Comments
Leave a Reply. |