A Transformation Framework for Optimizing Task-Parallel Programs

Nandivada, V. Krishna; Shirako, Jun; Zhao, Jisheng; Sarkar, Vivek

doi:10.1145/2450136.2450138

Cited by 31 publications

(19 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These HJ ports are not new to this paper; they have also been used in earlier performance evaluation, e.g. [3] and [26]. Also, the HJ versions of these benchmarks are fundamentally the same as the OpenMP versions; the primary change (in addition to translating C code to Java code) is that the OpenMP 3.0 task, taskwait and critical directives were replaced by async, finish and isolated statements in HJ, respectively.…”

Section: Methodsmentioning

confidence: 99%

Isolation for nested task parallelism

Zhao

Lublinerman

Budimlić

et al. 2013

Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages &Amp; Application

Self Cite

View full text Add to dashboard Cite

Isolation-the property that a task can access shared data without interference from other tasks-is one of the most basic concerns in parallel programming. While there is a large body of past work on isolated task-parallelism, the integration of isolation, task-parallelism, and nesting of tasks has been a difficult and unresolved challenge. In this paper, we present a programming and execution model called Otello where isolation is extended to arbitrarily nested parallel tasks with irregular accesses to heap data. At the same time, no additional burden is imposed on the programmer, who only exposes parallelism by creating and synchronizing parallel tasks, leaving the job of ensuring isolation to the underlying compiler and runtime system.Otello extends our past work on Aida execution model and the delegated isolation mechanism [22] to the setting of nested parallelism. The basic runtime construct in Aida and Otello is an assembly: a task equipped with a region in the shared heap that it owns. When an assembly A conflicts with an assembly B, A transfers-or delegates-its code and owned region to a carefully selected assembly C in a way that will ensure isolation with B, leaving the responsibility of re-executing task A to C. The choice of C depends on the nesting relationship between A and B.We have implemented Otello on top of the Habanero Java (HJ) parallel programming language [8], and used this implementation to evaluate Otello on collections of nested taskparallel benchmarks and non-nested transactional benchmarks from past work. On the nested task-parallel bench-, and the relative overhead of Otello is lower than that of many published data-race detection algorithms that detect the isolation violations (but do not enforce isolation). For the transactional benchmarks, Otello incurs lower overhead than a state-of-the-art software transactional memory system (Deuce STM).

show abstract

Section: Methodsmentioning

confidence: 99%

Isolation for nested task parallelism

Zhao

Lublinerman

Budimlić

et al. 2013

Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages &Amp; Application

Self Cite

View full text Add to dashboard Cite

show abstract

“…Such clock objects are replaced with instantiations of specialized clock classes that take advantage of the above properties. Nandivada et al presented techniques to reduce the overheads of X10 clock (and HJ phaser) operations by chunking parallel loops with synchronization operations. Feautrier et al proposed a technique to transform code written using clocks‐async‐finish abstractions to code that does not use clocks.…”

Section: Related Workmentioning

confidence: 99%

Efficient lock‐step synchronization in task‐parallel languages

Utture

Nandivada

2019

Softw Pract Exp

View full text Add to dashboard Cite

Summary Many modern task‐parallel languages allow the programmer to synchronize tasks using high‐level constructs like barriers, clocks, and phasers. While these high‐level synchronization primitives help the programmer express the program logic in a convenient manner, they also have their associated overheads. In this paper, we identify the sources of some of these overheads for task‐parallel languages like X10 that support lock‐step synchronization, and propose a mechanism to reduce these overheads. We first propose three desirable properties that an efficient runtime (for task‐parallel languages like X10, HJ, Chapel, and so on) should satisfy, to minimize the overheads during lock‐step synchronization. We use these properties to derive a scheme to called uClocks to improve the efficiency of X10 clocks; uClocks consists of an extension to X10 clocks and two related runtime optimizations. We prove that uClocks satisfies the proposed desirable properties. We have implemented uClocks for the X10 language+runtime and show that the resulting system leads to a geometric mean speedup of 5.36× on a 16‐core Intel system and 11.39× on a 64‐core AMD system, for benchmarks with a significant number of synchronization operations.

show abstract

“…Parallelization of place-change operations. In Figure 12 The dependencies among S1, S2, and E1 are computed using standard techniques [20]. Interestingly, say, "p" is the number of places and "k" is the size of Distribution D, then the number of remote-communications performed by the code compiled using the synchronization-elimination and place-level strip-mining techniques of Barik et al [6] are "2k" and "p + k," respectively.…”

Section: At-pruning: Reducing the Overheads Of Place Change Operationsmentioning

confidence: 99%

Untitled

2019

TACO

View full text Add to dashboard Cite

X10 is a partitioned global address space programming language that supports the notion of places; a place consists of some data and some lightweight tasks called activities. Each activity runs at a place and may invoke a place-change operation (using the at-construct) to synchronously perform some computation at another place. These place-change operations can be very expensive, as they need to copy all the required data from the current place to the remote place. However, identifying the necessary number of place-change operations and the required data during each place-change operation are non-trivial tasks, especially in the context of irregular applications (like graph applications) that contain complex code with large amounts of cross-referencing objects-not all of those objects may be actually required, at the remote place. In this article, we present AT-Com, a scheme to optimize X10 code with place-change operations. AT-Com consists of two interrelated new optimizations: (i) AT-Opt, which minimizes the amount of data serialized and communicated during place-change operations, and (ii) AT-Pruning, which identifies/elides redundant place-change operations and does parallel execution of place-change operations. AT-Opt uses a novel abstraction, called abstract-place-tree, to capture place-change operations in the program. For each place-change operation, AT-Opt uses a novel inter-procedural analysis to precisely identify the data required at the remote place in terms of the variables in the current scope. AT-Opt then emits the appropriate code to copy the identified data-items to the remote place. AT-Pruning introduces a set of program transformation techniques to emit optimized code such that it avoids the redundant place-change operations. We have implemented AT-Com in the x10v2.6.0 compiler and tested it over the IMSuite benchmark kernels. Compared to the current X10 compiler, the AT-Com optimized code achieved a geometric mean speedup of 18.72× and 17.83× on a four-node (32 cores per node) Intel and two-node (16 cores per node) AMD system, respectively. CCS Concepts: • Computing methodologies → Parallel programming languages; Distributed programming languages; • Software and its engineering → Compilers;

show abstract

A Transformation Framework for Optimizing Task-Parallel Programs

Cited by 31 publications

References 45 publications

Isolation for nested task parallelism

Isolation for nested task parallelism

Efficient lock‐step synchronization in task‐parallel languages

Untitled

Contact Info

Product

Resources

About