Double-Precision FPUs in High-Performance Computing: An Embarrassment of Riches?

Domke, Jens; Matsumura, Kazuaki; Wahib, Mohamed; Zhang, Haoyu; Yashima, Keita; Tsuchikawa, Toshiki; Tsuji, Yohei; Podobas, Artur; Matsuoka, Satoshi

doi:10.1109/ipdps.2019.00019

Cited by 15 publications

(7 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The HPC community has been testing Arm-based architectures for a few years now [1], [2], [3], and Supercomputer Fugaku [4] is the first large-scale system in the top-end of the TOP500 list, which demonstrates the competitiveness of Arm in a space which recently had been dominated by Intel, AMD, and Nvidia. The benefits of Arm CPUs paired with high bandwidth memory, as in the case of Fujitsu's A64FX processor [5], for the HPC field are clear: (1) Arm CPUs are highly customizable, energy efficient, and there is an existing ecosystem of software, compilers, tools, etc., which is readily available (unlike for the K computer with its SPARC CPU); and (2) most applications executed on HPC systems tend to be memory-bandwidth-bound, as we have shown in a previous study [6]. Although, a different compute-tobandwidth ratio, as found in A64FX, might challenge this view in individual cases resulting in a greater influence by the compiler onto the performance.…”

Section: Introductionmentioning

confidence: 84%

“…ECP proxy-apps and RIKEN Fiber mini-apps are collections of so called proxy applications which are smaller representative codes and inputs for production applications commonly executed on supercomputers in the USA and Japan. We have studied these codes previously [6], [11], and we refer the reader to these publications for details.…”

Section: Benchmarks -From Micro To Macro Levelmentioning

confidence: 99%

See 1 more Smart Citation

A64FX -- Your Compiler You Must Decide!

Domke¹

2021

Preprint

Self Cite

View full text Add to dashboard Cite

The current number one of the TOP500 list, Supercomputer Fugaku, has demonstrated that CPU-only HPC systems aren't dead and CPUs can be used for more than just being the host controller for a discrete accelerators. While the specifications of the chip and overall system architecture, and benchmarks submitted to various lists, like TOP500 and Green500, etc., are clearly highlighting the potential, the proliferation of Arm into the HPC business is rather recent and hence the software stack might not be fully matured and tuned, yet. We test 3 state-of-the-art compiler suite against a broad set of benchmarks. Our measurements show that orders of magnitudes in performance can be gained by deviating from the recommended usage model of the A64FX compute nodes.

show abstract

Section: Introductionmentioning

confidence: 84%

Section: Benchmarks -From Micro To Macro Levelmentioning

confidence: 99%

A64FX -- Your Compiler You Must Decide!

Domke¹

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Although our method is limited to inner product based computations, it extends the application range of hardware with limited (or no) FP32/FP64 resources and fast low-precision processing units for general purpose workloads. Consequently, we can consider reducing the number of FP64 (or even FP32) FPUs, as discussed by Domke et al [3], by exchanging them with low-precision FPUs such as Tensor Cores. Our rationale is supported by the following situations.…”

Section: Discussionmentioning

confidence: 99%

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions

Mukunoki

Ozaki

Ogita

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA's graphics processing units (GPUs). Tensor Cores are special processing units that perform 4 × 4 matrix multiplications on FP16 inputs with FP32 precision, and return the result on FP32. The proposed method adopts the Ozaki scheme, an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication. The proposed method has three prominent advantages: first, it can be built upon the cublasGemmEx routine using Tensor Core operations; second, it can achieve higher accuracy than standard DGEMM, including the correctly-rounded result; third, it ensures bit-level reproducibility even for different numbers of cores and threads. The achievable performance of the method depends on the absolute-value range of each element of the input matrices. For example, when the matrices were initialized with random numbers over a dynamic range of 1E+9, our DGEMM-equivalent implementation achieved up to approximately 980 GFlops of FP64 operation on the Titan RTX GPU (with 130 TFlops on Tensor Cores), although cublasDgemm can achieve only 539 GFlops on FP64 floating-point units. Our results reveal the possibility of utilizing hardware with limited FP32/FP64 resources and fast low-precision processing units (such as AI-oriented processors) for general-purpose workloads. Keywords: Tensor cores • FP16 • Half-precision • Low-precision • Matrix multiplication • GEMM • Linear algebra • Accuracy • Reproducibility

show abstract

“…Domke et al 51 note that conventional chips allocated a large portion of silicon area to DP computing units, however, recent processors, including KNM, allocate a large portion of chip area to single/half-precision/integer units. They study the impact of this change on the performance of HPC applications, when they run on KNL, KNM, and a Broadwell CPU.…”

Section: Machine Learningmentioning

confidence: 99%

A survey on evaluating and optimizing performance of Intel Xeon Phi

Mittal

2020

Concurrency and Computation

View full text Add to dashboard Cite

Summary Intel's Xeon Phi combines the parallel processing power of a many‐core accelerator with the programming ease of CPUs. In this paper, we present a survey of works that study the architecture of Phi and use it as an accelerator for a broad range of applications. We review performance optimization strategies as well as the factors that bottleneck the performance of Phi. We also review works that perform comparison or collaborative execution of Phi with CPUs and GPUs. This paper will be useful for researchers and developers in the area of computer‐architecture and high‐performance computing.

show abstract

Double-Precision FPUs in High-Performance Computing: An Embarrassment of Riches?

Cited by 15 publications

References 42 publications

A64FX -- Your Compiler You Must Decide!

A64FX -- Your Compiler You Must Decide!

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions

A survey on evaluating and optimizing performance of Intel Xeon Phi

Contact Info

Product

Resources

About