2021 IEEE Symposium on High-Performance Interconnects (HOTI) 2021
DOI: 10.1109/hoti52880.2021.00017
|View full text |Cite
|
Sign up to set email alerts
|

Accelerating CPU-based Distributed DNN Training on Modern HPC Clusters using BlueField-2 DPUs

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 18 publications
(11 citation statements)
references
References 8 publications
0
4
0
Order By: Relevance
“…Three resource-constrained devices were used in this work: Nvidia Mellanox BlueField-2 DPU: (Nvidia Corporation, Sunnyvale, CA, USA) These cards have been previously validated in state-of-the-art models and have demonstrated efficacy in fields such as task offloading, isolation, and acceleration for security, networking, and storage [ 25 , 27 , 31 ]. Raspberry Pi 4: (Raspberry Pi Foundation, Cambridge, UK) This device and its predecessor and later models are widely used in the Internet of Things (IoT) and various home projects.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Three resource-constrained devices were used in this work: Nvidia Mellanox BlueField-2 DPU: (Nvidia Corporation, Sunnyvale, CA, USA) These cards have been previously validated in state-of-the-art models and have demonstrated efficacy in fields such as task offloading, isolation, and acceleration for security, networking, and storage [ 25 , 27 , 31 ]. Raspberry Pi 4: (Raspberry Pi Foundation, Cambridge, UK) This device and its predecessor and later models are widely used in the Internet of Things (IoT) and various home projects.…”
Section: Methodsmentioning
confidence: 99%
“…In the realm of deep learning, DPUs have found application in various stages of model training, including data augmentation and validation. Leveraging DPUs for these tasks, Jain et al [ 27 ] achieved up to a 15% increase in training performance. Their subsequent work [ 28 ] demonstrated consistent performance improvements for CNNs and Transformer models, both in weak and strong scaling scenarios across multiple nodes.…”
Section: Previous Workmentioning
confidence: 99%
“…Therefore, RDMA applications can directly access the memory of remote hosts with RNICs, achieving much lower latency and higher throughput without CPU mediation. Currently, RDMA has been widely used in a number of data center applications, especially distributed machine learning training tasks and distributed storage clusters [67], [16], [39], [101], [41], [10], [40], [60], [23], [8], [100], [56], [103], [11], [105], [37], [21], [7], [5].…”
Section: A Rdma Basicsmentioning
confidence: 99%
“…Metrics. We pick up 2 typical RDMA applications: distributed machine learning training [37], [52], and cloud storage [21], [90]. In the former application scenario, such as a data-parallel parameter-server (PS) architecture, multiple training nodes transport the batched data in the same size to the parameter server after a step of training.…”
Section: Impact On Real Applicationsmentioning
confidence: 99%
See 1 more Smart Citation