Graphics processing units (GPUs) offer parallel computing power that usually requires a cluster of networked computers or a supercomputer to accomplish. While writing kernel code is fairly straightforward, achieving efficiency and performance requires very careful optimisation decisions and changes to the original serial algorithm. We introduce a parallel canonical ensemble Monte Carlo (MC) simulation that runs entirely on the GPU. In this paper, we describe two MC simulation codes of Lennard-Jones particles in the canonical ensemble, a single CPU core and a parallel GPU implementations. Using Compute Unified Device Architecture, the parallel implementation enables the simulation of systems containing over 200,000 particles in a reasonable amount of time, which allows researchers to obtain more accurate simulation results. A remapping algorithm is introduced to balance the load of the device resources and demonstrate by experimental results that the efficiency of this algorithm is bounded by available GPU resource. Our parallel implementation achieves an improvement of up to 15 times on a commodity GPU over our efficient single core implementation for a system consisting of 256k particles, with the speedup increasing with the problem size. Furthermore, we describe our methods and strategies for optimising our implementation in detail.
SUMMARYGraphics processing unit (GPU) computing is a popular approach to simulating complex models and performing massive calculations. GPUs have attracted a great deal of interest because they offer both high performance and energy efficiency. Efficient General-Purpose computation on Graphics Processing Units requires good parallelism, memory coalescing, regular memory access, small overhead on data exchange between the CPU and the GPU, and few explicit global synchronizations, which are usually gained from optimizing the algorithms. Besides these advantages, the proper use of some novel features provided on NVIDIA GPUs can offer further improvement. In this paper, we modify an existing optimized GPU application to illustrate the potential performance gains of these features and to demonstrate the performance trade offs of different implementation choices. The paper focuses on the challenges of reducing interactions between CPU and GPU and reducing the use of explicit synchronization. We explain how to achieve these objectives using two features of the Kepler architecture, warp shuffle, and dynamic parallelism. We find that a judicious use of these two techniques, eliminating repeated operations and synchronizations, results in significantly better performance. We describe various pitfalls encountered in optimizing our code to use these two features and how these were addressed. In particular, we identify a subtle data race condition when using dynamic parallelism under certain circumstances, and present our solution. Finally, we present a technique to trade off the allocation of various device resources to find the parameters that offer the best performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.