This report presents SkePU, a C++ template library which provides a simple and unified interface for specifying data-parallel computations with the help of skeletons on GPUs using CUDA and OpenCL. The interface is also general enough to support other architectures, and SkePU implements both a sequential CPU and a parallel OpenMP back end. It also supports multi-GPU systems.
Benchmarks show that copying data between the host and the GPU is often a bottleneck. Therefore a container which uses lazy memory copying has been implemented to avoid unnecessary memory transfers.
SkePU was evaluated with small benchmarks and a larger application, a Runge-Kutta ODE solver. The results show that skeletal parallel programming is indeed a viable approach for GPU Computing and that a generalized interface for multiple back ends is also reasonable. The best performance gains are received when the computation load is large compared to memory I/O (the lazy memory copying can help to achieve this). We see that SkePU offers good performance with a more complex and realistic task such as ODE solving, with up to ten times faster run times when using SkePU with a GPU back end compared to a sequential solver running on a fast CPU.
From the benchmarks we can conclude that skeletal parallel programming is indeed a viable approach for GPU Computing and that a generalized interface for multiple back ends is also reasonable. SkePU does however have some disadvantages too; there is some overhead in using the library which we can see from the dot product and LibSolve benchmarks. Although not big, it is still there and if performance is of uttermost importance, then a hand coded solution would be best. One cannot express all calculations in terms of skeletons either, if one have such a problem, specialized routines must still be created.