CCL: a portable and tunable collective communication library for scalable parallel computers

Bala, Vasanth; Bruck, Jehoshua; Cypher, Robert; Elustondo, P.; Ho, A.; Ho, Ching-Tien; Kipnis, Shlomo; Snir, Marc

doi:10.1109/71.342126

Cited by 68 publications

(41 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The better scalability may be due to various reasons, including larger memory and more efficient all-to-all communication subroutines available on the SP2. Interested readers may refer to [14] for more information on all-to-all communications. The emphasis here is that when an algorithm is not ideally scalable, its scalability does vary with machine parameters.…”

Section: Resultsmentioning

confidence: 99%

“…The listed communication cost of the PPT algorithm is based on a square 2-D torus with p processors (i.e., 2-D mesh, wraparound, square) [13]. If a hypercube topology or a multistage Omega network is assumed the communication cost would be log(p) r+12(p − 1) b and log(p) r+8(p − 1) n 1 · b for single systems and systems with multiple right sides, respectively [12,14].…”

Section: Fig 2 An Alternative Range Comparison Algorithmmentioning

confidence: 99%

See 1 more Smart Citation

Scalability versus Execution Time in Scalable Systems

Sun

2002

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Parallel programming is elusive. The relative performance of different parallel implementations varies with machine architecture, system and problem size. How to compare different implementations over a wide range of machine architectures and problem sizes has not been well addressed due to its difficulty. Scalability has been proposed in recent years to reveal scaling properties of parallel algorithms and machines. In this paper, the relation between scalability and execution time is carefully studied. The concepts of crossing point analysis and range comparison are introduced. Crossing point analysis finds slow/fast performance crossing points of parallel algorithms and machines. Range comparison compares performance over a wide range of ensemble and problem size via scalability and crossing point analysis. Three algorithms from scientific computing are implemented on an Intel Paragon and an IBM SP2 parallel computer. Experimental and theoretical results show how the combination of scalability, crossing point analysis, and range comparison provides a practical solution for scalable performance evaluation and prediction. While our testings are conducted on homogeneous parallel computers, the proposed methodology applies to heterogeneous and network computing as well. © 2002 Elsevier Science (USA)

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Fig 2 An Alternative Range Comparison Algorithmmentioning

confidence: 99%

Scalability versus Execution Time in Scalable Systems

Sun

2002

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

show abstract

“…Note when f = 6, a parent node receives 5 messages so that the reception costs accumulate to exactly balance the message latency, 5 · r = L. As computation cost increases, the best degree decreases. It is interesting to consider the range [1,2]. Values of f smaller than 2 do not produce meaningful f -nomial trees.…”

Section: Modeling F -Nomial Treesmentioning

confidence: 99%

“…Reduction collectives entail both communication (data transfer) and processing (data reduction operations), and therefore efficient implementations must consider the characteristics of the network, the processor, and the interactions between them. Over the years, many researchers have dedicated significant effort to derive optimal and scalable algorithms [1,2,3,4,5,8]. However, with respect to the underlying system characteristics, all of this work commonly assumed reduction processing must be performed by the host CPU.…”

Section: Introductionmentioning

confidence: 99%

NIC-based reduction algorithms for large-scale clusters

Petrini¹,

Moody²,

Fernández³

et al. 2006

IJHPCN

View full text Add to dashboard Cite

Efficient algorithms for reduction operations across a group of processes are crucial for good performance in many large-scale, parallel scientific applications. While previous algorithms limit processing to the host CPU, we utilize the programmable processors and local memory available on modern cluster network interface cards (NICs) to explore a new dimension in the design of reduction algorithms. In this paper, we present the benefits and challenges, design issues and solutions, analytical models, and experimental evaluations of a family of NIC-based reduction algorithms. Performance and scalability evaluations were conducted on the ASCI Linux Cluster (ALC), a 960-node, 1920-processor machine at Lawrence Livermore National Laboratory, which uses the Quadrics QsNet interconnect. We find NIC-based reductions on modern interconnects to be more efficient than host-based implementations in both scalability and consistency. In particular, at large-scale-1812 processes-NIC-based reductions of small integer and floating-point arrays provided respective speedups of 121% and 39% over the host-based, production-level MPI implementation.

show abstract

“…Early work on collective communication implements the reduction operation as an inverse broadcast and do not try to optimize the protocols based on different buffer sizes [1]. Other work already handle allreduce as a combination of basic routines, e.g., [2] already proposed the combine-to-all (allreduce) as a combination of distributed combine (reduce scatter) and collect (allgather).…”

Section: Introduction and Related Workmentioning

confidence: 99%

Optimization of Collective Reduction Operations

Rabenseifner

2004

Lecture Notes in Computer Science

145

100

View full text Add to dashboard Cite

Abstract.A 5-year-profiling in production mode at the University of Stuttgart has shown that more than 40% of the execution time of Message Passing Interface (MPI) routines is spent in the collective communication routines MPI Allreduce and MPI Reduce. Although MPI implementations are now available for about 10 years and all vendors are committed to this Message Passing Interface standard, the vendors' and publicly available reduction algorithms could be accelerated with new algorithms by a factor between 3 (IBM, sum) and 100 (Cray T3E, maxloc) for long vectors. This paper presents five algorithms optimized for different choices of vector size and number of processes. The focus is on bandwidth dominated protocols for power-of-two and non-power-of-two number of processes, optimizing the load balance in communication and computation.

show abstract

CCL: a portable and tunable collective communication library for scalable parallel computers

Cited by 68 publications

References 22 publications

Scalability versus Execution Time in Scalable Systems

Scalability versus Execution Time in Scalable Systems

NIC-based reduction algorithms for large-scale clusters

Optimization of Collective Reduction Operations

Contact Info

Product

Resources

About