Andi Drebes scite author profile

We present a joint scheduling and memory allocation algorithm for efficient execution of task-parallel programs on non-uniform memory architecture (NUMA) systems. Task and data placement decisions are based on a static description of the memory hierarchy and on runtime information about intertask communication. Existing locality-aware scheduling strategies for fine-grained tasks have strong limitations: they are specific to some class of machines or applications, they do not handle task dependences, they require manual program annotations, or they rely on fragile profiling schemes. By contrast, our solution makes no assumption on the structure of programs or on the layout of data in memory. Experimental results, based on the Open-Stream language, show that locality of accesses to main memory of scientific applications can be increased significantly on a 64-core machine, resulting in a speedup of up to 1.63× compared to a state-of-the-art work-stealing scheduler.

show abstract

OCC: An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory

Siemieniuk

Chelini

Khan

et al. 2022

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

Interactive visualization of cross-layer performance anomalies in dynamic task-parallel applications and systems

Drebes

Pop

Heydemann

et al. 2016

View full text Add to dashboard Cite

Language-Centric Performance Analysis of OpenMP Programs with Aftermath

Drebes

Bréjon

Pop

et al. 2016

View full text Add to dashboard Cite

Abstract. We present a new set of tools for the language-centric performance analysis and debugging of OpenMP programs that allows programmers to relate dynamic information from parallel execution to OpenMP constructs. Users can visualize execution traces, examine aggregate metrics on parallel loops and tasks, such as load imbalance or synchronization overhead, and obtain detailed information on specific events, such as the partitioning of a loop's iteration space, its distribution to workers according to the scheduling policy and fine-grain synchronization. Our work is based on the Aftermath performance analysis tool and a ready-to-use, instrumented version of the LLVM/clang OpenMP runtime with negligible overhead for tracing. By analyzing the performance of the MG application of the NPB suite, we show that language-centric performance analysis in general and our tools in particular can help improve the performance of large-scale OpenMP applications significantly.

show abstract

Scalable Task Parallelism for NUMA

Drebes

Pop

Heydemann

et al. 2016

View full text Add to dashboard Cite

Dynamic task-parallel programming models are popular on shared-memory systems, promising enhanced scalability, load balancing and locality. These promises, however, are undermined by non-uniform memory access (NUMA). We show that using NUMA-aware task and data placement, it is possible to preserve the uniform hardware abstraction of contemporary task-parallel programming models for both computing and memory resources with high data locality. Our data placement scheme guarantees that all accesses to task output data target the local memory of the accessing core. The complementary task placement heuristic improves the locality of accesses to task input data on a best effort basis. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability by eliminating false dependences and enabling fine-grained dynamic control over data placement. The algorithms are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences readily available in the run-time system, and placement information from the operating system. On a 192-core system with 24 NUMA nodes, our optimizations achieve above 94% locality (fraction of local memory accesses), up to 5× better performance than NUMAaware hierarchical work-stealing, and even 5.6× compared to static interleaved allocation. Finally, we show that stateof-the-art dynamic page migration by the operating system cannot catch up with frequent affinity changes between cores and data and thus fails to accelerate task-parallel applications.

show abstract

Progressive Raising in Multi-level IR

Chelini

Drebes²,

Zinenko³

et al. 2021

View full text Add to dashboard Cite

Multi-level intermediate representations (IR) show great promise for lowering the design costs for domain-specific compilers by providing a reusable, extensible, and non-opinionated framework for expressing domain-specific and high-level abstractions directly in the IR. But, while such frameworks support the progressive lowering of high-level representations to low-level IR, they do not raise in the opposite direction. Thus, the entry point into the compilation pipeline defines the highest level of abstraction for all subsequent transformations, limiting the set of applicable optimizations, in particular for general-purpose languages that are not semantically rich enough to model the required abstractions. We propose Progressive Raising, a complementary approach to the progressive lowering in multi-level IRs that raises from lower to higher-level abstractions to leverage domain-specific transformations for low-level representations. We further introduce Multi-Level Tactics, our declarative approach for progressive raising, implemented on top of the MLIR framework, and demonstrate the progressive raising from affine loop nests specified in a general-purpose language to high-level linear algebra operations. Our raising paths leverage subsequent high-level domain-specific transformations with significant performance improvements. Index Terms-MLIR, progressive raising, multi-level intermediate representation / * instantiate the context * / auto _i = m_Placeholder(), _j = m_Placeholder(); auto _A = m_ArrayPlaceholder(); auto matcher = m_Op(_A({2 * _i+1, _j+5})); Listing 6: Declarative access pattern matcher. For(For(For(For(access_callback())))); auto access_callback = [&a](Body loop) { { AccessPatternContext pctx(/ * MLIR ctx * /); auto _a = m_Placeholder(); auto _b = m_Placeholder(); auto _c = m_Placeholder(); auto _d = m_Placeholder(); auto _C = m_ArrayPlaceholder(); auto _A = m_ArrayPlaceholder(); auto _B = m_ArrayPlaceholder(); auto var0 = m_Op(_C({_a, _b, _c})); / * check the store is the last instruction in the block * / auto var1 = m_Op(_C({_a, _b, _c})); auto var2 = m_Op(_A({_a, _c, _d}));

show abstract

NUMA-aware scheduling and memory allocation for data-flow task-parallel applications

Drebes

Pop

Heydemann

et al. 2016

View full text Add to dashboard Cite

Dynamic task parallelism is a popular programming model on shared-memory systems. Compared to data parallel loop-based concurrency, it promises enhanced scalability, load balancing and locality. These promises, however, are undermined by non-uniform memory access (NUMA) systems. We show that it is possible to preserve the uniform hardware abstraction of contemporary taskparallel programming models, for both computing and memory resources, while achieving near-optimal data locality. Our run-time algorithms for NUMA-aware task and data placement are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences and reuse. This information is readily available in the run-time systems of modern task-parallel programming frameworks, and from the operating system regarding the placement of previously allocated memory. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability through the elimination of false dependences and enables finegrained dynamic control over the placement of application data. We demonstrate that the benefits of dynamically managing data placement outweigh the privatization cost, even when comparing with target-specific optimizations through static, NUMA-aware data interleaving. Our implementation and the experimental evaluation on a set of high-performance benchmarks executing on a 192-core system with 24 NUMA nodes show that the fraction of local memory accesses can be increased to more than 99%, resulting in a speedup of up to 5× compared to a NUMA-aware hierarchical work-stealing baseline.

show abstract

Fuse

Neill

Drebes

Pop

2017

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Collecting hardware event counts is essential to understanding program execution behavior. Contemporary systems offer few Performance Monitoring Counters (PMCs), thus only a small fraction of hardware events can be monitored simultaneously. We present new techniques to acquire counts for all available hardware events with high accuracy by multiplexing PMCs across multiple executions of the same program, then carefully reconciling and merging the multiple profiles into a single, coherent profile. We present a new metric for assessing the similarity of statistical distributions of event counts and show that our execution profiling approach performs significantly better than Hardware Event Multiplexing.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Andi Drebes

Topology-Aware and Dependence-Aware Scheduling and Memory Allocation for Task-Parallel Languages

OCC: An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory

Interactive visualization of cross-layer performance anomalies in dynamic task-parallel applications and systems

Language-Centric Performance Analysis of OpenMP Programs with Aftermath

Scalable Task Parallelism for NUMA

Progressive Raising in Multi-level IR

NUMA-aware scheduling and memory allocation for data-flow task-parallel applications

Fuse

Contact Info

Product

Resources

About