Abhinav Bhatele scite author profile

Abstract-Parallel machines are becoming more complex with increasing core counts and more heterogeneous architectures. However, the commonly used parallel programming models, C/C++ with MPI and/or OpenMP, make it difficult to write source code that is easily tuned for many targets. Newer language approaches attempt to ease this burden by providing optimization features such as automatic load balancing, overlap of computation and communication, message-driven execution, and implicit data layout optimizations. In this paper, we compare several implementations of LULESH, a proxy application for shock hydrodynamics, to determine strengths and weaknesses of different programming models for parallel computation. We focus on four traditional (OpenMP, MPI, MPI+OpenMP, CUDA) and four emerging (Chapel, Charm++, Liszt, Loci) programming models. In evaluating these models, we focus on programmer productivity, performance and ease of applying optimizations.

show abstract

Combing the Communication Hairball: Visualizing Parallel Execution Traces using Logical Time

Isaacs

Bremer

Jusufi

et al. 2014

IEEE Trans. Visual. Comput. Graphics

View full text Add to dashboard Cite

Fig. 1: Logical timeline and clustered logical timeline views from Ravel, a tool for visualizing parallel execution traces. Events are represented by boxes, colored by their wall-clock delay. The use of logical time reveals communication patterns and leverages developers' understanding of their program's structure. We use the logical time structure to cluster on any metric, which allow us to represent large-scale traces using explorable clusters while still depicting messages with full timelines for a subset of processes.Abstract-With the continuous rise in complexity of modern supercomputers, optimizing the performance of large-scale parallel programs is becoming increasingly challenging. Simultaneously, the growth in scale magnifies the impact of even minor inefficiencies -potentially millions of compute hours and megawatts in power consumption can be wasted on avoidable mistakes or sub-optimal algorithms. This makes performance analysis and optimization critical elements in the software development process. One of the most common forms of performance analysis is to study execution traces, which record a history of per-process events and interprocess messages in a parallel application. Trace visualizations allow users to browse this event history and search for insights into the observed performance behavior. However, current visualizations are difficult to understand even for small process counts and do not scale gracefully beyond a few hundred processes. Organizing events in time leads to a virtually unintelligible conglomerate of interleaved events and moderately high process counts overtax even the largest display. As an alternative, we present a new trace visualization approach based on transforming the event history into logical time inferred directly from happened-before relationships. This emphasizes the code's structural behavior, which is much more familiar to the application developer. The original timing data, or other information, is then encoded through color, leading to a more intuitive visualization. Furthermore, we use the discrete nature of logical timelines to cluster processes according to their local behavior leading to a scalable visualization of even long traces on large process counts. We demonstrate our system using two case studies on large-scale parallel codes.

show abstract

Improving communication performance in dense linear algebra via topology aware collectives

Solomonik

Bhatele

Demmel

2011

View full text Add to dashboard Cite

Recent results have shown that topology aware mapping reduces network contention in communication-intensive kernels on massively parallel machines. We demonstrate that on mesh interconnects, topology aware mapping also allows for the utilization of highly-efficient topology aware collectives. We map novel 2.5D dense linear algebra algorithms to exploit rectangular collectives on cuboid partitions allocated by a Blue Gene/P supercomputer. Our mappings allow the algorithms to exploit optimized line multicasts and reductions. Commonly used 2D algorithms cannot be mapped in this fashion. On 16,384 nodes (65,536 cores) of Blue Gene/P, 2.5D algorithms that exploit rectangular collectives are significantly faster than 2D matrix multiplication (MM) and LU factorization, up to 8.7x and 2.1x, respectively. These speed-ups are due to communication reduction (up to 95.6% for 2.5D MM with respect to 2D MM). We also derive LogPbased novel performance models for rectangular broadcasts and reductions. Using those, we model the performance of matrix multiplication and LU factorization on a hypothetical exascale architecture.

show abstract

Massively Parallel Simulations of Spread of Infectious Diseases over Realistic Social Networks

Bhatele

Yeom

Jain

et al. 2017

View full text Add to dashboard Cite

Overcoming the Scalability Challenges of Epidemic Simulations on Blue Waters

Yeom

Bhatele

Bisset

et al. 2014

View full text Add to dashboard Cite

Visualizing Hierarchical Performance Profiles of Parallel Codes Using CallFlow

Nguyen

Bhatele

Jain

et al. 2021

IEEE Trans. Visual. Comput. Graphics

View full text Add to dashboard Cite

Auto-tuning Parameter Choices in HPC Applications using Bayesian Optimization

Menon

Bhatele

Gamblin

2020

View full text Add to dashboard Cite

Characterizing Parallel Scientific Applications on Commodity Clusters: An Empirical Study of a Tapered Fat-Tree

León¹,

Karlin²,

Bhatele³

et al. 2016

View full text Add to dashboard Cite

12 3 4 5

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Abhinav Bhatele

Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application

Combing the Communication Hairball: Visualizing Parallel Execution Traces using Logical Time

Improving communication performance in dense linear algebra via topology aware collectives

Massively Parallel Simulations of Spread of Infectious Diseases over Realistic Social Networks

Overcoming the Scalability Challenges of Epidemic Simulations on Blue Waters

Visualizing Hierarchical Performance Profiles of Parallel Codes Using CallFlow

Auto-tuning Parameter Choices in HPC Applications using Bayesian Optimization

Characterizing Parallel Scientific Applications on Commodity Clusters: An Empirical Study of a Tapered Fat-Tree

Contact Info

Product

Resources

About