Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community

Vetter, Jeffrey S.; Glassbrook, Richard; Dongarra, Jack; Schwan, Karsten; Loftis, Bruce; McNally, Stephen; Meredith, Jeremy; Rogers, James H.; Roth, Philip C.; Spafford, Kyle; Yalamanchili, S.

doi:10.1109/mcse.2011.83

Cited by 98 publications

(70 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Keeneland has 120 compute nodes, each with dual-socket, six-core Intel X5660 2.8 GHz Westmere processors and 3 GPUs per node, with 24GB of DDR3 host memory. The interconnect is single rail, QDR Infiniband [41].…”

Section: Experiments Setupmentioning

confidence: 99%

On the communication complexity of 3D FFTs and its implications for Exascale

Czechowski

Battaglino

McClanahan

et al. 2012

Proceedings of the 26th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

This paper revisits the communication complexity of largescale 3D fast Fourier transforms (FFTs) and asks what impact trends in current architectures will have on FFT performance at exascale. We analyze both memory hierarchy traffic and network communication to derive suitable analytical models, which we calibrate against current software implementations; we then evaluate models to make predictions about potential scaling outcomes at exascale, based on extrapolating current technology trends. Of particular interest is the performance impact of choosing high-density processors, typified today by graphics co-processors (GPUs), as the base processor for an exascale system. Among various observations, a key prediction is that although inter-node all-to-all communication is expected to be the bottleneck of distributed FFTs, intra-node communication-expressed precisely in terms of the relative balance among compute capacity, memory bandwidth, and network bandwidth-will play a critical role.

show abstract

Section: Experiments Setupmentioning

confidence: 99%

On the communication complexity of 3D FFTs and its implications for Exascale

Czechowski

Battaglino

McClanahan

et al. 2012

Proceedings of the 26th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

show abstract

“…Our evaluation in this section and the previous sections is conducted on Keeneland [11] cluster, a National Science Foundation Track2D Experimental System based on the HP SL390 powered with Nvidia Tesla M2070 GPUs in Oak Ridge National Laboratory. Each compute node in Keeneland has two Intel Xeon X5660 CPUs, 24 GB main memory, 3 GPU devices connected through 2 IO hubs; nodes are connected via single rail, QDR Infiniband.…”

Section: Discussionmentioning

confidence: 99%

Efficient Intranode Communication in GPU-Accelerated Systems

Aji

Dinan

et al. 2012

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops &Amp; PhD Forum

View full text Add to dashboard Cite

Abstract-Current implementations of MPI are unaware of accelerator memory (i.e., GPU device memory) and require programmers to explicitly move data between memory spaces. This approach is inefficient, especially for intranode communication where it can result in several extra copy operations. In this work, we integrate GPU-awareness into a popular MPI runtime system and develop techniques to significantly reduce the cost of intranode communication involving one or more GPUs. Experiment results show an up to 2x increase in bandwidth, resulting in an average of 4.3% improvement to the total execution time of a halo exchange benchmark.

show abstract

“…Our experiments with double-precision Cholesky and QR factorizations, on the heterogeneous Keeneland system [23] at the Oak Ridge National Laboratory, demonstrate great scalability from one to 100 nodes using all CPUs and GPUs. In addition, we apply our framework to the other two possible environments: clusters without GPUs, and a shared system with CPUs and multiple GPUs.…”

Section: Introductionmentioning

confidence: 91%

A scalable framework for heterogeneous GPU-based clusters

Song

Dongarra

2012

Proceedings of the Twenty-Fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures

Self Cite

View full text Add to dashboard Cite

GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to their high energy efficiency and much improved single-node computational performance, however, there is few parallel software that can utilize all CPU cores and all GPUs on the heterogeneous system efficiently. On a heterogeneous cluster, the performance of a GPU (or a compute node) increases in a much faster rate than the performance of the PCI-Express connection (or the interconnection network) such that communication eventually becomes the bottleneck of the entire system. To overcome the bottleneck, we developed a multilevel partitioning and distribution method that guarantees a near-optimal communication volume. We have also extended heterogeneous tile algorithms to work on distributed-memory GPU clusters. Our main idea is to execute a serial program and generate hybrid-size tasks, and follow a dataflow programming model to fire the tasks on different compute nodes. We devised a distributed dynamic scheduling runtime system to schedule tasks, and transfer data between hybrid CPU-GPU compute nodes transparently. The runtime system employs a novel distributed task-assignment protocol to solve data dependencies between tasks without coordination between processing units. The runtime system on each node consists of a number of CPU compute threads, a number of GPU compute threads, a task generation thread, an MPI communication thread, and a CUDA communication thread. By overlapping computation and communication through dynamic scheduling, we are able to attain a high performance of 75 TFlops for Cholesky factorization on the heterogeneous Keeneland system [23] using 100 nodes, each with twelve CPU cores and three GPUs. Moreover, our framework can also deliver high performance on distributed-memory clusters without GPUs, and shared-system multiGPUs.

show abstract

Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community

Cited by 98 publications

References 12 publications

On the communication complexity of 3D FFTs and its implications for Exascale

On the communication complexity of 3D FFTs and its implications for Exascale

Efficient Intranode Communication in GPU-Accelerated Systems

A scalable framework for heterogeneous GPU-based clusters

Contact Info

Product

Resources

About