Locality-Aware Work Stealing Based on Online Profiling and Auto-Tuning for Multisocket Multicore Architectures

Chen, Quan; Guo, Minyi

doi:10.1145/2766450

Cited by 7 publications

(15 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Targeting scheduling systems for task-based programs, a large amount of prior work aims to improve energy-efficiency [38,41], to improve data locality [9,10], or to reduce scheduling overhead [17,29]. However, with the increasing bandwidth requirements of computing tasks, many papers have also conducted related research for efficient bandwidth usage.…”

Section: Related Workmentioning

confidence: 99%

“…Many task-stealing schedulers have been proposed to improve data locality by reducing shared cache misses [10,11] and increasing local memory accesses [9,25,40,46]. Based on Charm++ [19], NUMALB [32] is proposed to balance the workload while avoiding unnecessary migrations and reducing cross-core communication.…”

Section: Related Workmentioning

confidence: 99%

“…However, users have to give the level manually in HWS or provide a number of command line arguments for the scheduler to calculate the boundary level in CAB. To relieve the above burden, CATS [10] was proposed to divide task graph based on the online information, without extra user-provided information. These techniques assume that the data accessed by a task is known by analyzing the task graph, which is not always true in real-system applications.…”

Section: Related Workmentioning

confidence: 99%

“…Those efforts are orthogonal to BATS. [10] HWS [34] DFA [15] LAWS [9] RELWS [25] HPT [52] Jenga [45] There is also prior work on improving the performance of other schedulers, such as OpenMP. Olivier et al [30] proposed a hierarchical scheduling strategy that uses one thread to steal work on behalf of all of the threads in a chip.…”

Section: Related Workmentioning

confidence: 99%

See 3 more Smart Citations

Bandwidth and Locality Aware Task-stealing for Manycore Architectures with Bandwidth-Asymmetric Memory

Zhao

Chen

Qiu

et al. 2018

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

Parallel computers now start to adopt Bandwidth-Asymmetric Memory architecture that consists of traditional DRAM memory and new High Bandwidth Memory (HBM) for high memory bandwidth. However, existing task schedulers suffer from low bandwidth usage and poor data locality problems in bandwidthasymmetric memory architectures. To solve the two problems, we propose a Bandwidth and Locality Aware Task-stealing (BATS) system, which consists of an HBM-aware data allocator, a bandwidth-aware traffic balancer, and a hierarchical task-stealing scheduler. Leveraging compile-time code transformation and run-time data distribution, the data allocator enables HBM usage automatically without user interference. According to data access hotness, the traffic balancer migrates data to balance memory traffic across memory nodes proportional to their bandwidth. The hierarchical scheduler improves data locality at runtime without a priori program knowledge. Experiments on an Intel Knights Landing server that adopts bandwidth-asymmetric memory show that BATS reduces the execution time of memory-bound programs up to 83.5% compared with traditional task-stealing schedulers.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Bandwidth and Locality Aware Task-stealing for Manycore Architectures with Bandwidth-Asymmetric Memory

Zhao

Chen

Qiu

et al. 2018

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Recently, LAWS [12] proposes a runtime library for Divide and Conquer applications in NUMA systems. It features a work stealing algorithm designed for NUMA systems, very focused in reducing remote memory accesses and last-level cache pollution.…”

Section: Related Workmentioning

confidence: 99%

Dense Matrix Computations on NUMA Architectures with Distance-Aware Work Stealing

Alomairy

Miranda

Ltaief

et al. 2015

JSFI

View full text Add to dashboard Cite

We employ the dynamic runtime system OmpSs to decrease the overhead of data motion in the now ubiquitous non-uniform memory access (NUMA) high concurrency environment of multicore processors. The dense numerical linear algebra algorithms of Cholesky factorization and symmetric matrix inversion are employed as representative benchmarks. Work stealing occurs within an innovative NUMA-aware scheduling policy to reduce data movement between NUMA nodes. The overall approach achieves separation of concerns by abstracting the complexity of the hardware from the end users so that high productivity can be achieved. Performance results on a large NUMA system outperform the state-of-the-art existing implementations up to a twofold speedup for the Cholesky factorization, as well as the symmetric matrix inversion, while the OmpSs-enabled code maintains strong similarity to its original sequential version.

show abstract

Work-Stealing for NUMA-enabled Architecture

Chen

Guo

2017

Task Scheduling for Multi-Core and Parallel Architectures

View full text Add to dashboard Cite

Locality-Aware Work Stealing Based on Online Profiling and Auto-Tuning for Multisocket Multicore Architectures

Cited by 7 publications

References 28 publications

Bandwidth and Locality Aware Task-stealing for Manycore Architectures with Bandwidth-Asymmetric Memory

Bandwidth and Locality Aware Task-stealing for Manycore Architectures with Bandwidth-Asymmetric Memory

Dense Matrix Computations on NUMA Architectures with Distance-Aware Work Stealing

Work-Stealing for NUMA-enabled Architecture

Contact Info

Product

Resources

About