Dynamic Load Balancing Using Work-Stealing

Cederman, Daniel; Tsigas, Philippas

doi:10.1016/b978-0-12-385963-1.00035-6

Cited by 15 publications

(13 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our implementation of work-stealing follows Cederman and Tsigas [5], who give an implementation for GPUs of an algorithm due to Arora et al [3]. The implementation is based around a collection of double-ended queues (deques), one per OpenCL work-item in this work.…”

Section: Adding Work-stealingmentioning

confidence: 99%

“…execution unit ('work-item') maintains its own task queue, but can steal from another's queue should its own become empty. We present an implementation of work-stealing that builds on an implementation for GPUs, by Cederman and Tsigas [5], of an algorithm due to Arora et al [3]. It is written in OpenCL (a multi-threaded extension of C for programming heterogeneous systems of CPUs, GPUs, and FPGAs [13]) and automatically compiled to hardware using Altera's software development kit for OpenCL (AOCL) [2].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Case for Work-stealing on FPGAs with OpenCL Atomics

Ramanathan

Wickerson

Winterstein

et al. 2016

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

We provide a case study of work-stealing, a popular method for run-time load balancing, on FPGAs. Following the Cederman-Tsigas implementation for GPUs, we synchronize workitems not with locks, mutexes or critical sections, but instead with the atomic operations provided by Altera's OpenCL SDK. We evaluate work-stealing for FPGAs by synthesizing a K-means clustering algorithm on an Altera P385 D5 board, both with work-stealing and with a statically-partitioned load. When block RAM utilization is maximized in both cases, we find that work-stealing leads to a 1.5× speedup. This demonstrates that the ability to do load balancing at run-time can outweigh the drawback of using 'expensive' atomics on FPGAs. We hope that our case study will stimulate further research into the high-level synthesis of fine-grained, lock-free, concurrent programs.

show abstract

Section: Adding Work-stealingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A Case for Work-stealing on FPGAs with OpenCL Atomics

Ramanathan

Wickerson

Winterstein

et al. 2016

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

show abstract

“…We developed an efficient LBM algorithm which uses MPI+GPU-based cluster structure model, and combining the advantages of both static and dynamic load balance in the LBM (Willebeek-LeMair and Reeves 1993;Hui and Chanson 1999;Arora et al 2001;Tzeng et al 2010;Cederman and Tsigas 2012). Therefore, we refer to (Colajanni et al 1998;Pai et al 1998;Srisuresh and Gan 1998;Bunt et al 1999;Cardellini et al 2002;Padhy and Rao 2011) research papers, and design LBM operating mechanism based on Queuing Theory, which is divided into MPI strategy mechanism and GPU strategy mechanism.…”

Section: Hybrid Mpi/cuda Program With Lbmmentioning

confidence: 99%

SAR Image Simulations Using the LBM Algorithm on MPI-GPU

Sun¹,

Tsai²,

Chiang³

2016

Terr. Atmos. Ocean. Sci.

View full text Add to dashboard Cite

Synthetic Aperture Radar (SAR) is a powerful tool for studying natural environments under all-weather and day-andnight conditions. SAR system design and data-processing algorithm simulation is noted for its controllable parameters. The satellite SAR echo signal simulation framework has been successfully applied to target recognition based on Radarsat-2 and TerraSAR-X images and in strip map mode. However, such SAR image simulation works only on CPU or GPU (graphics processing units) and requires huge calculations. We developed a "Load-Balancing Model (LBM)" algorithm that uses Message Passing Interface GPU (MPI-GPU) to reduce the inner loop load and improve the computational performance. The LBM algorithm uses MPI-GPU technology to build the simple GPU cluster system. The LBM algorithm is used to separate the intensive computing and controlling tasks for each node, and exploit the contemporary GPU computation capability to accelerate the computing tasks. We conducted a relevant experiment on a target radar cross section (RCS) and improved the performance by a factor of > 40 compared to a 4-core CPU accelerated program.

show abstract

“…Consider the tricky lock-free code in Figure 6 for stealing from a queue. This function is a part of an intricate work-stealing algorithm originally proposed by Arora et al [16] and presented in the context of GPUs by Cederman and Tsigas [17]. We have augmented the code with remote synchronization highlighted in bold italics.…”

Section: Motivating Scope Promotionmentioning

confidence: 99%

Synchronization Using Remote-Scope Promotion

Orr

Che

Yilmazer

et al. 2015

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

Heterogeneous system architecture (HSA) and OpenCL™ define scoped synchronization to facilitate low overhead communication across a subset of threads. Scoped synchronization works well for static sharing patterns, where consumer threads are known a priori. It works poorly for dynamic sharing patterns (e.g., work stealing) where programmers cannot use a faster small scope due to the rare possibility that the work is stolen by a thread in a distant slower scope. This puts programmers in a conundrum: optimize the common case by synchronizing at a faster small scope or use work stealing at a slower large scope.In this paper, we propose to extend scoped synchronization with remote-scope promotion. This allows the most frequent sharers to synchronize through a small scope. Infrequent sharers synchronize by promoting that remote small scope to a larger shared scope. Synchronization using remote-scope promotion provides performance robustness for dynamic workloads, where the benefits provided by scoped synchronization and work stealing are hard to anticipate. Compared to a naïve baseline, static scoped synchronization alone achieves a 1.07x speedup on average and dynamic work stealing alone achieves a 1.18x speedup on average. In contrast, synchronization using remote-scope promotion achieves a robust 1.25x speedup on average, across a diverse set of graph benchmarks and inputs.

show abstract

Dynamic Load Balancing Using Work-Stealing

Cited by 15 publications

References 1 publication

A Case for Work-stealing on FPGAs with OpenCL Atomics

A Case for Work-stealing on FPGAs with OpenCL Atomics

SAR Image Simulations Using the LBM Algorithm on MPI-GPU

Synchronization Using Remote-Scope Promotion

Contact Info

Product

Resources

About