A programming model for deterministic task parallelism

Pratikakis, Polyvios; Vandierendonck, Hans; Lyberis, Spyros; Nikolopoulos, Dimitrios S.

doi:10.1145/1988915.1988918

Cited by 21 publications

(26 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In recursive programs, tasks can only spawn child tasks with a subset of the privileges that they hold, i.e., tasks with pushpopdep access on a hyperqueue can pass both privileges on that hyperqueue, while tasks with either pushdep or popdep access mode can pass only the named privilege on the corresponding hyperqueue. This restriction makes it safe to apply the above rules for task scheduling separately to each procedure instance [10].…”

Section: Task Schedulingmentioning

confidence: 99%

Deterministic scale-free pipeline parallelism with hyperqueues

Vandierendonck

Chronaki

Nikolopoulos

2013

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

Ubiquitous parallel computing aims to make parallel programming accessible to a wide variety of programming areas using deterministic and scale-free programming models built on a task abstraction. However, it remains hard to reconcile these attributes with pipeline parallelism, where the number of pipeline stages is typically hard-coded in the program and defines the degree of parallelism.This paper introduces hyperqueues, a programming abstraction that enables the construction of deterministic and scale-free pipeline parallel programs. Hyperqueues extend the concept of Cilk++ hyperobjects to provide thread-local views on a shared data structure. While hyperobjects are organized around private local views, hyperqueues require shared concurrent views on the underlying data structure. We define the semantics of hyperqueues and describe their implementation in a work-stealing scheduler. We demonstrate scalable performance on pipeline-parallel PARSEC benchmarks and find that hyperqueues provide comparable or up to 30% better performance than POSIX threads and Intel's Threading Building Blocks. The latter are highly tuned to the number of available processing cores, while programs using hyperqueues are scale-free.

show abstract

Section: Task Schedulingmentioning

confidence: 99%

Deterministic scale-free pipeline parallelism with hyperqueues

Vandierendonck

Chronaki

Nikolopoulos

2013

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

show abstract

“…This combination of enhanced work-pushing and deferred allocation is fully automatic, application-independent, portable across NUMA machines and transparently adapts to dynamic changes at run time. These techniques require detailed information about the affinities between tasks and data, but this information is either readily available or can be obtained automatically in the run-times of recent task-parallel programming models, such as StarSs (Planas et al 2009), OpenMP 4 (OpenMP Architecture Review Board 2013), SWAN (Pratikakis et al 2011) andOpenStream (Pop andCohen 2013), which allow the programmer to make inter-task data dependences explicit. While specifying the precise task-level dataflow rather than synchronization constraints alone requires more initial work for programmers, this effort is more than offset by the resulting enhanced performance and performance portability.…”

Section: Numa-aware Optimizationsmentioning

confidence: 99%

“…Shared memory programming models with fine-grained concurrency have successfully harnessed the computational resources of such architectures (Blumofe et al 1995;Pratikakis et al 2011;OpenMP Architecture Review Board 2013;Planas et al Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.…”

Section: Introductionmentioning

confidence: 99%

NUMA-aware scheduling and memory allocation for data-flow task-parallel applications

Drebes

Pop

Heydemann

et al. 2016

Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

Dynamic task parallelism is a popular programming model on shared-memory systems. Compared to data parallel loop-based concurrency, it promises enhanced scalability, load balancing and locality. These promises, however, are undermined by non-uniform memory access (NUMA) systems. We show that it is possible to preserve the uniform hardware abstraction of contemporary taskparallel programming models, for both computing and memory resources, while achieving near-optimal data locality. Our run-time algorithms for NUMA-aware task and data placement are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences and reuse. This information is readily available in the run-time systems of modern task-parallel programming frameworks, and from the operating system regarding the placement of previously allocated memory. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability through the elimination of false dependences and enables finegrained dynamic control over the placement of application data. We demonstrate that the benefits of dynamically managing data placement outweigh the privatization cost, even when comparing with target-specific optimizations through static, NUMA-aware data interleaving. Our implementation and the experimental evaluation on a set of high-performance benchmarks executing on a 192-core system with 24 NUMA nodes show that the fraction of local memory accesses can be increased to more than 99%, resulting in a speedup of up to 5× compared to a NUMA-aware hierarchical work-stealing baseline.

show abstract

“…Each dynamic procedure instance may have a task graph that restricts the execution order of its children. This restriction allows us to prove that all parallel executions compute the same value as the sequential elision of the program [13]. 1 Note that the sequential elision of the program always respects the dependencies in the program: by deducing dependencies from input/output properties, there can never be backward dependencies in the sequential elision.…”

Section: Programming Modelmentioning

confidence: 99%