Dynamic Load Balancing of Unbalanced Computations Using Message Passing

Dinan, James; Olivier, Stephen Lecler; Sabin, Gerald; Prins, Jan F.; Sadayappan, P.; Tseng, Chau‐Wen

doi:10.1109/ipdps.2007.370581

Cited by 55 publications

(45 citation statements)

References 15 publications

(13 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Parallel implementation of the search requires continuous dynamic load balancing to keep all processors engaged in the search. Our implementation achieves better scaling and parallel efficiency in both shared memory and distributed memory settings than previous efforts using UPC [1] and MPI [2]. We observe parallel efficiency of 80% using 1024 processors performing over 85,000 total load balancing operations per second continuously.…”

Section: Introductionmentioning

confidence: 78%

See 1 more Smart Citation

Scalable Dynamic Load Balancing Using UPC

Olivier

Prins

2008

2008 37th International Conference on Parallel Processing

Self Cite

View full text Add to dashboard Cite

An asynchronous work-stealing implementation of dynamic load balance is implemented using Unified Parallel C (UPC) and evaluated using the Unbalanced Tree Search (UTS) benchmark [1]. The UTS benchmark presents a synthetic tree-structured search space that is highly imbalanced. Parallel implementation of the search requires continuous dynamic load balancing to keep all processors engaged in the search. Our implementation achieves better scaling and parallel efficiency in both shared memory and distributed memory settings than previous efforts using UPC [1] and MPI [2]. We observe parallel efficiency of 80% using 1024 processors performing over 85,000 total load balancing operations per second continuously. The UPC programming model provides substantial simplifications in the expression of the asynchronous work stealing protocol compared with MPI. However, to obtain performance portability with UPC in both shared memory and distributed memory settings requires the careful use of onesided reads and writes to minimize the impact of high latency communication. Additional protocol improvements are made to improve dissemination of available work and to decrease the cost of termination detection.

show abstract

Section: Introductionmentioning

confidence: 78%

“…The contributions are streamlined termination detection, rapid diffusion of work, and an asynchronous request-response protocol for work stealing that minimizes overheads to threads performing useful work. This last contribution was inspired by an MPI implementation of UTS [2], but exploits UPC's one-sided communication operations.…”

Section: Introductionmentioning

confidence: 99%

Scalable Dynamic Load Balancing Using UPC

Olivier

Prins

2008

2008 37th International Conference on Parallel Processing

Self Cite

View full text Add to dashboard Cite

show abstract

“…Followup work refines these comparisons by considering the delays in the system [33], and different job scheduling policies [13]. More recently Dinan et al [14] compare work stealing (receiver initiated) and work sharing (sender initiated) when implemented on top of the MPI interface for message passing by using the unbalanced tree-search benchmark. These papers find that the algorithms both perform quite well-there are no clear winners-and the specifics such as the delays, the system load, and the job scheduling and preemption policies can make one preferable over the other.…”

Section: Related Workmentioning

confidence: 99%

Scheduling parallel programs by work stealing with private deques

2013

View full text Add to dashboard Cite

Work stealing has proven to be an effective method for scheduling fine-grained parallel programs on multicore computers. To achieve high performance, work stealing distributes tasks between concurrent queues, called deques, assigned to each processor. Each processor operates on its deque locally except when performing load balancing via steals. Unfortunately, concurrent deques suffer from two limitations: 1) local deque operations require expensive memory fences in modern weak-memory architectures, 2) they can be very difficult to extend to support various optimizations and flexible forms of task distribution strategies needed many applications, e.g., those that do not fit nicely into the divide-and-conquer, nested data parallel paradigm.For these reasons, there has been a lot recent interest in implementations of work stealing with non-concurrent deques, where deques remain entirely private to each processor and load balancing is performed via message passing. Private deques eliminate the need for memory fences from local operations and enable the design and implementation of efficient techniques for reducing task-creation overheads and improving task distribution. These advantages, however, come at the cost of communication. It is not known whether work stealing with private deques enjoys the theoretical guarantees of concurrent deques and whether they can be effective in practice.In this paper, we propose two work-stealing algorithms with private deques and prove that the algorithms guarantee similar theoretical bounds as work stealing with concurrent deques. For the analysis, we use a probabilistic model and consider a new parameter, the branching depth of the computation. We present an implementation of the algorithm as a C++ library and show that it compares well to Cilk on a range of benchmarks. Since our approach relies on private deques, it enables implementing flexible task creation and distribution strategies. As a specific example, we show how to implement task coalescing and steal-half strategies, which can be important in fine-grain, non-divide-and-conquer algorithms such as graph algorithms, and apply them to the depth-first-search problem.

show abstract

“…For our scaling experiment we chose the Unbalanced Tree Search (UTS) application [99,100]. The benchmark contains a reference implementation using MPI in the publicly available version [101].…”

Section: Uts Case Studymentioning

confidence: 99%

“…The reference MPI implementation of the benchmark, used as the baseline for creating the HCMPI version, performed parallel search using multiple MPI processes, and load balancing using inter-process work-sharing or work-stealing algorithms. In our experiments we have focused on the work-stealing version due to better scalability [100]. We scale our experiment up to 16,384 cores on the Jaguar supercomputer.…”

Section: Uts Case Studymentioning

confidence: 99%

Runtime Systems for Extreme Scale Platforms

Chatterjee¹

2013

View full text Add to dashboard Cite

Future extreme-scale systems are expected to contain homogeneous and heterogeneous many-core processors, with O(10 3 ) cores per node and O(10 6 ) nodes overall.Effective combination of inter-node and intra-node parallelism is recognized to be a major software challenge for such systems. Further, applications will have to deal with constrained energy budgets as well as frequent faults and failures. To aid programmers manage these complexities and enhance programmability, much of recent research has focussed on designing state-of-art software runtime systems. Such runtime systems are expected to be a critical component of the software ecosystem for the management of parallelism, locality, load balancing, energy and resilience on extreme-scale systems.In this dissertation, we address three key challenges faced by a runtime system using a dynamic task parallel framework for extreme-scale computing. First, we address the challenge of integrating an intra-node task parallel runtime with a communication system for scalable performance. We present a runtime communication system, called HC-COMM, designed to use dedicated communication cores on a system. We introduce the HCMPI programming model which integrates the Habanero-C asynchronous dynamic task parallel language with the MPI message passing communication model on the HC-COMM runtime. We also introduce the HAPGNS model that enables data flow programming for extreme-scale systems in which the user does not require knowledge of MPI. Second, we address the challenge of separating locality optimizations from a programmer with domain specific knowledge. We present a tuning framework, through which performance experts can optimize existing applications by specifying runtime operations aimed at co-scheduling of affinitized tasks. Finally, we address the challenge of scalable synchronization for long running tasks on a dynamic task parallel runtime. We use the phaser construct to present a generalized tree-based synchronization algorithm and support unified collective operations at both inter-node and intra-node levels. Overcoming these runtime challenges are a first step towards effective programming on extreme-scale systems. AcknowledgmentsIt was an honor and a gift to have had Prof. Vivek Sarkar as my PhD advisor.Working with him has been a truly great learning experience for me.

show abstract

Dynamic Load Balancing of Unbalanced Computations Using Message Passing

Cited by 55 publications

References 15 publications

Scalable Dynamic Load Balancing Using UPC

Scalable Dynamic Load Balancing Using UPC

Scheduling parallel programs by work stealing with private deques

Runtime Systems for Extreme Scale Platforms

Contact Info

Product

Resources

About