Scaling Deep Learning Workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

Gawande, Nitin; Landwehr, Joshua; Daily, Jeff; Tallent, Nathan R.; Vishnu, Abhinav; Kerbyson, Darren J.

doi:10.1109/ipdpsw.2017.36

Cited by 20 publications

(14 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are several policies governing where data is homed. A common high-performance configuration [12], which is also the one we used in our study, is the quadrant mode. Quadrant mode means that the physical cores are divided into four logical parts, where each logical part is assigned two memory controllers; each logical group is treated as a unique Non-Uniform Memory-Access (NUMA) node, allowing the operating system to perform data-locality optimizations.…”

Section: A Hardware and Software Environmentmentioning

confidence: 99%

Double-Precision FPUs in High-Performance Computing: An Embarrassment of Riches?

Domke

Matsumura

Wahib

et al. 2019

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

Among the (uncontended) common wisdom in High-Performance Computing (HPC) is the applications' need for large amount of double-precision support in hardware. Hardware manufacturers, the TOP500 list, and (rarely revisited) legacy software have without doubt followed and contributed to this view.In this paper, we challenge that wisdom, and we do so by exhaustively comparing a large number of HPC proxy applications on two processors: Intel's Knights Landing (KNL) and Knights Mill (KNM). Although similar, the KNL and KNM architecturally deviate at one important point: the silicon area devoted to doubleprecision arithmetics. This fortunate discrepancy allows us to empirically quantify the performance impact in reducing the amount of hardware double-precision arithmetic.Our analysis shows that this common wisdom might not always be right. We find that the investigated HPC proxy applications do allow for a (significant) reduction in double-precision with little-to-no performance implications. With the advent of a failing of Moore's law, our results partially reinforce the view taken by modern industry (e.g., upcoming Fujitsu ARM64FX) to integrate hybrid-precision hardware units.

show abstract

Section: A Hardware and Software Environmentmentioning

confidence: 99%

Double-Precision FPUs in High-Performance Computing: An Embarrassment of Riches?

Domke

Matsumura

Wahib

et al. 2019

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

show abstract

“…Manycore machines, including KNL, are widely used for deep learning, as standalone devices or within clusters, e.g. [29], [30]. SVM training on multicore and manycore architectures was proposed by You et al [31].…”

Section: E Evaluation Of Quantized Representationmentioning

confidence: 99%

On Linear Learning with Manycore Processors

Wszola

Mendler-Dünner

Jäggi

et al. 2019

2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC)

View full text Add to dashboard Cite

A new generation of manycore processors is on the rise that offers dozens and more cores on a chip and, in a sense, fuses host processor and accelerator. In this paper we target the efficient training of generalized linear models on these machines. We propose a novel approach for achieving parallelism which we call Heterogeneous Tasks on Homogeneous Cores (HTHC). It divides the problem into multiple fundamentally different tasks, which themselves are parallelized. For evaluation, we design a detailed, architecture-cognizant implementation of our scheme on a recent 72-core Knights Landing processor that is adaptive to the cache, memory, and core structure. Our library efficiently supports dense and sparse datasets as well as 4bit quantized data for further possible gains in performance. We show benchmarks for Lasso and SVM with different data sets against straightforward parallel implementations and prior software. In particular, for Lasso on dense data, we improve the state-of-the-art by an order of magnitude.

show abstract

“…A heterogeneous system is composed of general purpose CPUs and specific purpose hardware accelerators, such as GPUs, Xeon Phi, FPGAs or TPUs. Under this concept, a wide range of systems are included, from powerful computing nodes capable of executing teraflops [2], to integrated CPU and GPU chips [3]. This architecture allows, not only to significantly increase the computing power, but also to improve their energy efficiency.…”

Section: Introductionmentioning

confidence: 99%

EngineCL: Usability and Performance in Heterogeneous Computing

Nozal

Bosque

Beivide

2020

Future Generation Computer Systems

View full text Add to dashboard Cite

Heterogeneous systems have become one of the most common architectures today, thanks to their excellent performance and energy consumption. However, due to their heterogeneity they are very complex to program and even more to achieve performance portability on different devices. This paper presents EngineCL, a new OpenCL-based runtime system that outstandingly simplifies the co-execution of a single massive data-parallel kernel on all the devices of a heterogeneous system. It performs a set of low level tasks regarding the management of devices, their disjoint memory spaces and scheduling the workload between the system devices while providing a layered API. EngineCL has been validated in two compute nodes (HPC and commodity system), that combine six devices with different architectures. Experimental results show that it has excellent usability compared with OpenCL; a maximum 2.8% of overhead compared to the native version under loads of less than a second of execution and a tendency towards zero for longer execution times; and it can reach an average efficiency of 0.89 when balancing the load.

show abstract

Scaling Deep Learning Workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

Cited by 20 publications

References 11 publications

Double-Precision FPUs in High-Performance Computing: An Embarrassment of Riches?

Double-Precision FPUs in High-Performance Computing: An Embarrassment of Riches?

On Linear Learning with Manycore Processors

EngineCL: Usability and Performance in Heterogeneous Computing

Contact Info

Product

Resources

About