More updates to come:
Project Proposal: To develop a program using CUDA, NVIDIA’S library for processing on the GPGPU. I will implement a program that will combine the usuage of MPI with CUDA. I will link the .c MPI file and the .cu CUDA file so that they can call functions within each other. This can be acomplished by allowing each source file to share a common header file. I will run two dot products on randomized vectors of data. They will both be lauched from the MPI nodes. My initial intent was to launch two CUDA kernels concurrently, each using an MPI node as a host. However my machine would not register the other core for some reason, so i had to lauch both kernels from the root node. One of the kernels is run serially (in terms of CUDA) meaning one block, and one thread. The other kernel will be launched using blocks to run in parallel. The number of blocks will be scaled to the number of elements in the vector to keep scalability.
In class during my presentation some things went wrong. I will attempt to explain those issues. My code i executed during class was attempting to also use N number of threads per block. This can get quickly complicated. I was attempting to scale the number of elements in the vector, to the number of blocks and threads used. This quickly created issues when i wanted to change the number of elements used. Also i noticed without using multi-dimensions in CUDA, it is impossible to use more than a around 65k blocks. I was attempting to perform a dot product of 100k. This is what generated the “nan” error.
So what’s next?
Well next i’d like the parallize the summation of the end dot product vector. This could be done using a scan. I’d like to use CUDA to perform this scan, however i could also access the data in the MPI source file to perform the dot product.
Through the course of my project my main object was to show how much faster this code could be executed when scaled well in CUDA. The images down below will conclude my mission. As you can the code that is parallized with blocks executes extremely faster than it’s serial counter part. It is my belief that anyone interested in parallel computing should look into CUDA. It boasts performance and provides control of a massively parallel system for a cheap price.