This paper presents the design of an efficient multi-target (CPU+GPU) implementation for the Parallel_for skeleton. Emerging massively parallel architectures promise very high performances for a low cost. However, these architectures change faster than ever. Thus, optimization of codes becomes a very complex and time consumming task. We have identified the data storage as the main difference between the CPU and the GPU implementation of a code. We introduce an abstract data layout in order to adapt the data storage. Based on this layout, the utilization of Parallel_for skeleton allows to compile and execute the same program both on CPU and on GPU. Once compiled, the program runs close to the hardware limits.
Categories and Subject Descriptors
General TermsC++ templates, parallel computing, Nvidia CUDA, Intel TBB, parallel skeletons, data layout
MOTIVATION AND MAIN OBJECTIVESIn many scientific applications, computation time is a strong constraint. Optimizing for the rapidly changing computer hardware is a very expensive and time consuming task. Emerging hybrid architectures tend to make this process even more complex.The classical way to ease this optimization process is to build applications on top of High Performance Computing (HPC) libraries. Each HPC library allows the scientific developer to use a well defined Application Programming Interface (API) tailored for its specific scientific sub-domain. Because of their limited scope, it is possible to produce specialized HPC implementations of these libraries for a large variety of hardware target architectures. Hence, scientific Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Legolas++ is a generic library developed at EDF R&D that provides building blocks for the specific domain of Highly Structured Sparse Linear Algebra (HSSLA) problems arising in many simulation codes. In particular, it allows to deal with recursively blocked matrices (matrix of blocks of blocks of...) that appear for example in neutron transport simulations [16]. In order to build HPC codes meeting EDF's industrial quality standards, a multi-target version of Legolas++ is presently developed that should provide a unified interface for both multi-core CPUs and Graphics Processing Units (GPUs) optimal implementations. Not all but a large fraction of the Legolas++ operations are embarrassingly parallel and consist in applying the same function independently on multiple data. This kind of problems are well described with a Parallel_for algorithm that is an instance of parallel algorithm skeletons introduced in [4]. In this article we propose a design for a C++ multi-target (CPU/GPU) implementation of the Parallel_for skeleton. Thi...