Designing parallel loop self-scheduling schemes using the hybrid MPI and OpenMP programming model for multi-core grid systems

Wu, Chao‐Chin; Yang, Chao-Tung; Lai, Kuan‐Chou; Chiu, Po-Hsun

doi:10.1007/s11227-010-0418-y

Cited by 12 publications

(12 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…SMV is an application kernel from Sparse Linear Algebra that performs a multiplication between a sparse matrix and a dense vector. Besides finding applications in several scientific and engineering domains, sparse matrix‐vector multiplication is a frequently studied application kernel within the context of loop scheduling . We extracted the SMV kernel from the Conjugate Gradient application from the NAS Parallel Benchmarks (NPB) .…”

Section: Evaluation Methodologymentioning

confidence: 99%

“…Besides finding applications in several scientific and engineering domains, 24 sparse matrix-vector multiplication is a frequently studied application kernel within the context of loop scheduling. 25,26 We extracted the SMV kernel from the Conjugate Gradient application from the NAS Parallel Benchmarks (NPB). 27 In the SMV kernel, the sparse matrix is stored in compressed row format so that memory can be saved and data affinity exploited.…”

Section: Application Kernelsmentioning

confidence: 99%

See 1 more Smart Citation

A comprehensive performance evaluation of the BinLPT workload‐aware loop scheduler

Penna

Gomes

Castro

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Summary Workload‐aware loop schedulers were introduced to deliver better performance than classical loop scheduling strategies. However, they presented limitations such as inflexible built‐in workload estimators and suboptimal chunk scheduling. Targeting these challenges, we proposed previously a workload‐aware scheduling strategy called BinLPT, which relies on three features: (i) user‐supplied estimations of the workload of the loop; (ii) a greedy heuristic that adaptively partitions the iteration space in several chunks; and (iii) a scheduling scheme based on the Longest Processing Time (LPT) rule and on‐demand technique. In this paper, we present two new contributions to the state‐of‐the‐art. First, we introduce a multiloop support feature to BinLPT, which enables the reuse of estimations across loops. Based on this feature, we integrated BinLPT into a real‐world elastodynamics application, and we evaluated it running on a supercomputer. Second, we present an evaluation of BinLPT using simulations as well as synthetic and application kernels. We carried out this analysis on a large‐scale NUMA machine under a variety of workloads. Our results revealed that BinLPT is better at balancing the workloads of the loop iterations and this behavior improves as the algorithmic complexity of the loop increases. Overall, BinLPT delivers up to 37.15% and 9.11% better performance than well‐known loop scheduling strategies, for the application kernels and the elastodynamics simulation, respectively.

show abstract

Section: Evaluation Methodologymentioning

confidence: 99%

Section: Application Kernelsmentioning

confidence: 99%

A comprehensive performance evaluation of the BinLPT workload‐aware loop scheduler

Penna

Gomes

Castro

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…A common approach is to use MPI [1] for inter-node communication and OpenMP [4] for programming the shared-memory systems [6]. In the context of DLS techniques, the hierarchical loop scheduling (HLS) [14] was one of the earliest efforts to use MPI+OpenMP programming model. In HLS, a free worker (MPI process) requests a chunk from the master rank which calculates and assigns the chunk based on a certain performance function [30].…”

Section: Background and Related Workmentioning

confidence: 99%

“…The OpenMP threads (workers) require synchronization before requesting and executing chunks, i.e., only the main thread is allowed to call MPI communication functions, such as MPI Send and MPI Receive [14]. Otherwise, a complex implementation is needed to allow individual OpenMP threads to perform MPI calls.…”

Section: Introductionmentioning

confidence: 99%

Hierarchical Dynamic Loop Self-Scheduling on Distributed-Memory Systems Using an MPI+MPI Approach

Eleliemy

Ciorba

2019

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

View full text Add to dashboard Cite

Computationally-intensive loops are the primary source of parallelism in scientific applications. Such loops are often irregular and a balanced execution of their loop iterations is critical for achieving high performance. However, several factors may lead to an imbalanced load execution, such as problem characteristics, algorithmic, and systemic variations. Dynamic loop self-scheduling (DLS) techniques are devised to mitigate these factors, and consequently, improve application performance. On distributed-memory systems, DLS techniques can be implemented using a hierarchical master-worker execution model and are, therefore, called hierarchical DLS techniques. These techniques self-schedule loop iterations at two levels of hardware parallelism: across and within compute nodes. Hybrid programming approaches that combine the message passing interface (MPI) with open multi-processing (OpenMP) dominate the implementation of hierarchical DLS techniques. The MPI-3 standard includes the feature of sharing memory regions among MPI processes. This feature introduced the MPI+MPI approach that simplifies the implementation of parallel scientific applications. The present work designs and implements hierarchical DLS techniques by exploiting the MPI+MPI approach. Four well-known DLS techniques are considered in the evaluation proposed herein. The results indicate certain performance advantages of the proposed approach compared to the hybrid MPI+OpenMP approach.

show abstract

“…In [17][18][19] new results are presented for loops with dependencies. Recent research results [20,21] have been reported for designing loop self-scheduling methods for grids. In [10,22,23], the heterogeneity of different cluster systems was considered, in order to get better load balancing.…”

Section: Related Workmentioning

confidence: 99%

Scalable Loop Self-Scheduling Schemes for Large-Scale Clusters and Cloud Systems

Han

Chronopoulos

2016

Int J Parallel Prog

View full text Add to dashboard Cite

Cloud systems have demonstrated the powerful computation and storage capability in many scientific applications. In this paper, we propose a class of scalable distributed loop self-scheduling schemes to achieve good load balancing and scalability. We implemented these schemes on a large-scale cluster and on a heterogeneous cloud system. The schemes consider the distribution of the output data, which can help reduce communication overhead and improve scalability. We evaluated the schemes using four scientific computations: Mandelbrot set, adjoint convolution, matrix multiplication and quick sort. The results show that the new schemes achieve better load balancing, better scalability and better overall performance than standard distributed loop self-scheduling schemes.

show abstract

Designing parallel loop self-scheduling schemes using the hybrid MPI and OpenMP programming model for multi-core grid systems

Cited by 12 publications

References 22 publications

A comprehensive performance evaluation of the BinLPT workload‐aware loop scheduler

A comprehensive performance evaluation of the BinLPT workload‐aware loop scheduler

Hierarchical Dynamic Loop Self-Scheduling on Distributed-Memory Systems Using an MPI+MPI Approach

Scalable Loop Self-Scheduling Schemes for Large-Scale Clusters and Cloud Systems

Contact Info

Product

Resources

About