Extending smart containers for data locality‐aware skeleton programming

Ernstsson, August; Keßler, Christoph

doi:10.1002/cpe.5003

Cited by 11 publications

(15 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Operands to skeleton instances are to be passed in data containers, which are STLlike, generic collection abstract data types like Vector and Matrix that encapsulate C++ array-type data. We call them smart containers [9] because they transparently perform data transfer and memory management for their elements in (heterogeneous) systems with distributed memory, as well as global optimizations for data locality [14]. Using C++ iterators, skeleton instance calls may also operate on a proper subset of a container's elements only.…”

Section: Skepu 3 Overviewmentioning

confidence: 99%

SkePU 3: Portable High-Level Programming of Heterogeneous Systems and HPC Clusters

Ernstsson

Ahlqvist

Zouzoula

et al. 2021

Int J Parallel Prog

Self Cite

View full text Add to dashboard Cite

We present the third generation of the C++-based open-source skeleton programming framework SkePU. Its main new features include new skeletons, new data container types, support for returning multiple objects from skeleton instances and user functions, support for specifying alternative platform-specific user functions to exploit e.g. custom SIMD instructions, generalized scheduling variants for the multicore CPU backends, and a new cluster-backend targeting the custom MPI interface provided by the StarPU task-based runtime system. We have also revised the smart data containers’ memory consistency model for automatic data sharing between main and device memory. The new features are the result of a two-year co-design effort collecting feedback from HPC application partners in the EU H2020 project EXA2PRO, and target especially the HPC application domain and HPC platforms. We evaluate the performance effects of the new features on high-end multicore CPU and GPU systems and on HPC clusters.

show abstract

Section: Skepu 3 Overviewmentioning

confidence: 99%

SkePU 3: Portable High-Level Programming of Heterogeneous Systems and HPC Clusters

Ernstsson

Ahlqvist

Zouzoula

et al. 2021

Int J Parallel Prog

Self Cite

View full text Add to dashboard Cite

show abstract

“…We can only speculate that this anomaly might be caused by some stateful optimization within the CUDA memory allocator, and it might also be specific to our GPU, CUDA and driver version. 2 Straightline control holds for use with lazy execution [3] and for branch-free regions in a kernel-level compiler IR. i = 0, ..., N − 1 is executed either on the CPU (device d i = 0) or on the accelerator (d i = 1).…”

Section: Problem Formulationmentioning

confidence: 99%

“…The global optimization method presented in this paper could, in principle, be likewise applied as a runtime optimization once sufficiently large kernel (sub)graphs such as lineages [3] have been identified at runtime, which in turn is done by lazy execution techniques that are also applied, e.g., in Spark and TensorFlow. However, in our case the runtime overhead for the optimization might only pay off if the computed memory placement can be reused, e.g.…”

Section: Related Work 61 Transfer Fusionmentioning

confidence: 99%

“…The graph structure can be leveraged for global program optimizations, both statically in compilers and dynamically in runtime systems for heterogeneous systems. For example, the skeleton programming framework SkePU allows for global tiling of lineages (i.e., acyclic kernel-vector graphs) of skeleton-based kernels at runtime to improve cache hit rates [3].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Global optimization of operand transfer fusion in heterogeneous computing

Keßler

2019

Proceedings of the 22nd International Workshop on Software and Compilers for Embedded Systems

Self Cite

View full text Add to dashboard Cite

We consider the problem of minimizing, for a dataflow graph of kernel calls, the overall number of operand data transfers, and thus, the accumulated transfer startup overhead, in heterogeneous systems with non-shared memory. Our approach analyzes the kernel-operand dependence graph and reorders the operand arrays in memory such that transfers and memory allocations of multiple operands adjacent in memory can be merged, saving transfer startup costs and memory allocation overheads. CCS CONCEPTS • Computer systems organization → Heterogeneous (hybrid) systems; • Software and its engineering → Communications management; Compilers.

show abstract

“…Ernstsson and Kessler propose a solution based on skeletons to the problem of manage data locality on large clusters. This solution is based on the use of lazy evaluation to record invocations and dependences of sequences of transformations, using tiling to keep chunks of container data in the same working set, thus improving cache usage.…”

Section: In This Issuementioning

confidence: 99%