Petascale computing with accelerators

Kistler, Michael; Gunnels, John A.; Brokenshire, Daniel; Benton, Brad

doi:10.1145/1504176.1504212

Cited by 18 publications

(20 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since the same matrix between tasks can be reused, the order of the four tasks is like T 0, T 1, T 3, T 2 by using the "bounce corner turn" [18] method. When T 1 is executed, matrix A 1 does not need to be transferred, neither do B 2 for T 3 and A 2 for T 2.…”

Section: Software Pipeliningmentioning

confidence: 99%

See 1 more Smart Citation

Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer

Wang

Yang

et al. 2011

J. Comput. Sci. Technol.

View full text Add to dashboard Cite

In this paper we present the programming of the Linpack benchmark on TianHe-1 system, the first petascale supercomputer system of China, and the largest GPU-accelerated heterogeneous system ever attempted before. A hybrid programming model consisting of MPI, OpenMP and streaming computing is described to explore the task parallel, thread parallel and data parallel of the Linpack. We explain how we optimized the load distribution across the CPUs and GPUs using the two-level adaptive method and describe the implementation in details. To overcome the low-bandwidth between the CPU and GPU communication, we present a software pipelining technique to hide the communication overhead. Combined with other traditional optimizations, the Linpack we developed achieved 196.7 GFLOPS on a single compute element of TianHe-1. This result is 70.1% of the peak compute capability, 3.3 times faster than the result by using the vendor's library. On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0.563 PFLOPS, which made TianHe-1 the 5th fastest supercomputer on the Top500 list in November, 2009.

show abstract

Section: Software Pipeliningmentioning

confidence: 99%

“…Using Cell accelerators [26] , in 2008 IBM built the first heterogenous petascale supercomputer called Roadrunner [18] . This system was very different than a GPU-accelerated system.…”

Section: Related Workmentioning

confidence: 99%

Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer

Wang

Yang

et al. 2011

J. Comput. Sci. Technol.

View full text Add to dashboard Cite

show abstract

“…A node in such heterogeneous clusters was typically built with multicore CPUs and a single accelerator (e.g., a GPU). Recently, more and more cluster systems have started to have multiple accelerators per node to deal with large size problems [1], [2], [3]. Multiple accelerators per node may enlarge the benefits of a heterogeneous system, especially for massively data-parallel applications.…”

Section: Introductionmentioning

confidence: 99%

“…Moreover, it is usually used as a yardstick of the performance of supercomputers because the TOP500 supercomputer list [10] ranks supercomputers by their performance on the LINPACK benchmark. The LINPACK benchmark requires 2 3 n 3 þ Oðn 2 Þ double-precision floatingpoint operations to solve a system of linear equations of order n. Reducing the operation count (e.g., using the Strassen algorithm for matrix multiplication) is not allowed. Under this constraint, any optimizations can be applied to the algorithm in order to achieve the best performance for the target system.…”

Section: Introductionmentioning

confidence: 99%

Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU Nodes

Nah

Lee

et al. 2015

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

OpenCL is an open standard to write parallel applications for heterogeneous computing systems. Since its usage is restricted to a single operating system instance, programmers need to use a mix of OpenCL and MPI to program a heterogeneous cluster. In this paper, we introduce an MPI-OpenCL implementation of the LINPACK benchmark for a cluster with multi-GPU nodes. The LINPACK benchmark is one of the most widely used benchmark applications for evaluating high performance computing systems. Our implementation is based on High Performance LINPACK (HPL) and uses the blocked LU decomposition algorithm. We address that optimizations aimed at reducing the overhead of CPUs are necessary to overcome the performance gap between the CPUs and the multiple GPUs. Our LINPACK implementation achieves 93.69 Tflops (46 percent of the theoretical peak) on the target cluster with 49 nodes, each node containing two eight-core CPUs and four GPUs.

show abstract

“…processing elements (SPEs). The CBE is currently used in scientific computing on both large [2][3][4] and small scales [5,6] due to its high floating-point throughput. The CBE allows SIMD instructions to be used without resorting to assembly language and provides a great deal of programmer control over memory management.…”

Section: Introductionmentioning

confidence: 99%