Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Li, Ang; Song, Shuaiwen Leon; Chen, Jieyang; Li, Jiajia; Liu, Xu; Tallent, Nathan R.; Barker, Kevin

doi:10.1109/tpds.2019.2928289

Cited by 143 publications

(71 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Further important questions are how the hardware components are composed to avoid bottlenecks. Li et al [100] have performed a comprehensive performance evaluation of recent GPU interconnects. In terms of energy consumption, Wang et al [181] provide evaluations that compare FPGAs to GPUs.…”

Section: Infrastructurementioning

confidence: 99%

Scalable Deep Learning on Distributed Infrastructures

2020

View full text Add to dashboard Cite

Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-art results in various domains such as image recognition and natural language processing. One of the reasons for this success is the increasing size of DL models and the proliferation of vast amounts of training data being available. To keep on improving the performance of DL, increasing the scalability of DL systems is necessary. In this survey, we perform a broad and thorough investigation on challenges, techniques and tools for scalable DL on distributed infrastructures. This incorporates infrastructures for DL, methods for parallel DL training, multitenant resource scheduling and the management of training and model data. Further, we analyze and compare 11 current open-source DL frameworks and tools and investigate which of the techniques are commonly implemented in practice. Finally, we highlight future research trends in DL systems that deserve further research.One of the driving factors of the success of DL is the scale of training in three dimensions. The first dimension of scale is the size and complexity of the models themselves. Starting from simple, shallow neural networks, with increasing depth and more sophisticated model architectures, new breakthroughs in model accuracy were achieved [30,38]. The second dimension of scale is the amount of training data. The model accuracy can, to a large extent, be improved by feeding more training data into the model [56,63]. In practice, it is reported that 10s to 100s of Terabyte (TB) of training data are used in the training of a DL model [27,62]. The third dimension is the scale of the infrastructure. The availability of programmable highly-parallel hardware, especially graphics processing units (GPUs), is a key-enabler to training large models with a lot of training data in a short time [30,206].Our survey is focused on challenges that arise when managing a large, distributed infrastructure for DL. Hosting a large amount of DL models that are trained with large amounts of training data is challenging. This includes questions of parallelization, resource scheduling and elasticity, data management and portability. This field is now in rapid development, with contributions from diverse research communities such as distributed and networked systems, data management, and machine learning. At the same time, we see a number of open source DL frameworks and orchestration systems emerging [4,24,141,195]. In this survey, we bring together, classify and compare the huge body of work on distributed infrastructures for DL from the different communities that contribute to this area. Furthermore, we provide an overview and comparison of the existing open-source DL frameworks and tools that put distributed DL into practice. Finally, we highlight and discuss open research challenges in this field. Complementary SurveysThere are a number of surveys on DL that are complementary to ours. Deng [41] provides a general survey on DL architectures, algorithms and applications. LeCunn et al. pro...

show abstract

Section: Infrastructurementioning

confidence: 99%

Scalable Deep Learning on Distributed Infrastructures

2020

View full text Add to dashboard Cite

show abstract

“…When the filtering stage is applied at the GPU, the AllGather collective will be applied on data residing in GPU memory and not on data residing in the CPU memory (as is the case when the filtering is applied on the CPU). Applying AllGather on data residing on the GPU incur the extra cost of moving data across the PCIe interconnect, even when the GPUDirect [36,49] Technology is enabled.…”

Section: Filtering Stagementioning

confidence: 99%

“…Commonly, in a single compute node, there are multiple GPUs (e.g. ORNL'Summit has six GPUs, LLNL's Sierra and TokyoTech's Tsubame have four GPUs), which are connected to the CPUs by PCIe or NVLink [36]. We launch a number of MPI ranks per compute node equivalent to the number of GPUs, i.e.…”

Section: Dmentioning

confidence: 99%

“…Those specifications would allow iFDK, for instance, to tackle 4K problems within a minute (projected by the results shown in Figure 5a) without privacy concern. In addition, the DGX-2 has the fast NVSwitch [36] interconnect between GPUs, and a high capacity SSD: iFDK would even perform better due to improvements in the communication and I/O. Finally, it is important to mention that the price of DGX-2 is relatively low when considering the prices of high-end CT instruments.…”

Section: Nvidia Dgx-2mentioning

confidence: 99%

See 1 more Smart Citation

iFDK

Chen

Wahib

Takizawa

et al. 2019

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

Computed Tomography (CT) is a widely used technology that requires compute-intense algorithms for image reconstruction. We propose a novel back-projection algorithm that reduces the projection computation cost to 1/6 of the standard algorithm. We also propose an efficient implementation that takes advantage of the heterogeneity of GPU-accelerated systems by overlapping the filtering and back-projection stages on CPUs and GPUs, respectively. Finally, we propose a distributed framework for high-resolution image reconstruction on state-of-the-art GPU-accelerated supercomputers. The framework relies on an elaborate interleave of MPI collective communication steps to achieve scalable communication. Evaluation on a single Tesla V100 GPU demonstrates that our backprojection kernel performs up to 1.6× faster than the standard FDK implementation. We also demonstrate the scalability and instantaneous CT capability of the distributed framework by using up to 2,048 V100 GPUs to solve a 4K and 8K problems within 30 seconds and 2 minutes, respectively (including I/O).

show abstract

“…Data is fetched over a high-latency, low-bandwidth channel during kernel execution and page migrations incur additional overhead due to fault handling. Given that on current systems, GPU memory bandwidth is an order-of-magnitude higher than that of the CPU-GPU interconnect [33], a device-only placement policy would appear to be the natural choice when programmability is not a concern. Notwithstanding, placement decisions are more nuanced in practice for several reasons.…”

Section: Introductionmentioning

confidence: 99%

Intelligent Data Placement on Discrete GPU Nodes with Unified Memory

Sultana

Allen

Qasem

2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

With increasing heterogeneity, the importance of data organization within a compute node has grown immensely. Recently, industry vendors have introduced technology that can present a unified shared address space for multiple physical pools of memory. In this paper, we leverage unified memory technology and characterize the performance trade-offs of host and device placement across a range of hybrid application design patterns. We perform a Roofline analysis to establish fundamental performance bounds in collaborative applications and then develop an analytical model that makes profitable placement decisions at the individual data structure level. We integrate the placement model into a runtime system and enable transparent data placement in CUDA/C++ applications. Preliminary experiments yield the following results: (i) placement policies have significant performance impact across hybrid application design paradigms (ii) placement decisions are impacted by the sparsity of data access, page re-migration, amount of latency hiding opportunities and design-specific attributes such as the number of pipeline stages, and (iii) intelligent data placement can improve node performance by up to 5× on applications with sparse access patterns. CCS CONCEPTS • General and reference → Performance; • Software and its engineering → Runtime environments.

show abstract

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Cited by 143 publications

References 41 publications

Scalable Deep Learning on Distributed Infrastructures

Scalable Deep Learning on Distributed Infrastructures

iFDK

Intelligent Data Placement on Discrete GPU Nodes with Unified Memory

Contact Info

Product

Resources

About