Accelerating CPU-based Distributed DNN Training on Modern HPC Clusters using BlueField-2 DPUs

Jain, Arpan; Alnaasan, Nawras; Shafi, Aamir; Subramoni, Hari; Panda, Dhabaleswar K.

doi:10.1109/hoti52880.2021.00017

Cited by 18 publications

(11 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Three resource-constrained devices were used in this work: Nvidia Mellanox BlueField-2 DPU: (Nvidia Corporation, Sunnyvale, CA, USA) These cards have been previously validated in state-of-the-art models and have demonstrated efficacy in fields such as task offloading, isolation, and acceleration for security, networking, and storage [ 25 , 27 , 31 ]. Raspberry Pi 4: (Raspberry Pi Foundation, Cambridge, UK) This device and its predecessor and later models are widely used in the Internet of Things (IoT) and various home projects.…”

Section: Methodsmentioning

confidence: 99%

“…In the realm of deep learning, DPUs have found application in various stages of model training, including data augmentation and validation. Leveraging DPUs for these tasks, Jain et al [ 27 ] achieved up to a 15% increase in training performance. Their subsequent work [ 28 ] demonstrated consistent performance improvements for CNNs and Transformer models, both in weak and strong scaling scenarios across multiple nodes.…”

Section: Previous Workmentioning

confidence: 99%

See 1 more Smart Citation

Energy-Efficient Edge and Cloud Image Classification with Multi-Reservoir Echo State Network and Data Processing Units

López-Ortiz,

Perea-Trigo,

Soria-Morillo

et al. 2024

Sensors

View full text Add to dashboard Cite

In an era dominated by Internet of Things (IoT) devices, software-as-a-service (SaaS) platforms, and rapid advances in cloud and edge computing, the demand for efficient and lightweight models suitable for resource-constrained devices such as data processing units (DPUs) has surged. Traditional deep learning models, such as convolutional neural networks (CNNs), pose significant computational and memory challenges, limiting their use in resource-constrained environments. Echo State Networks (ESNs), based on reservoir computing principles, offer a promising alternative with reduced computational complexity and shorter training times. This study explores the applicability of ESN-based architectures in image classification and weather forecasting tasks, using benchmarks such as the MNIST, FashionMnist, and CloudCast datasets. Through comprehensive evaluations, the Multi-Reservoir ESN (MRESN) architecture emerges as a standout performer, demonstrating its potential for deployment on DPUs or home stations. In exploiting the dynamic adaptability of MRESN to changing input signals, such as weather forecasts, continuous on-device training becomes feasible, eliminating the need for static pre-trained models. Our results highlight the importance of lightweight models such as MRESN in cloud and edge computing applications where efficiency and sustainability are paramount. This study contributes to the advancement of efficient computing practices by providing novel insights into the performance and versatility of MRESN architectures. By facilitating the adoption of lightweight models in resource-constrained environments, our research provides a viable alternative for improved efficiency and scalability in modern computing paradigms.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Previous Workmentioning

confidence: 99%

Energy-Efficient Edge and Cloud Image Classification with Multi-Reservoir Echo State Network and Data Processing Units

López-Ortiz,

Perea-Trigo,

Soria-Morillo

et al. 2024

Sensors

View full text Add to dashboard Cite

show abstract

“…Therefore, RDMA applications can directly access the memory of remote hosts with RNICs, achieving much lower latency and higher throughput without CPU mediation. Currently, RDMA has been widely used in a number of data center applications, especially distributed machine learning training tasks and distributed storage clusters [67], [16], [39], [101], [41], [10], [40], [60], [23], [8], [100], [56], [103], [11], [105], [37], [21], [7], [5].…”

Section: A Rdma Basicsmentioning

confidence: 99%

“…Metrics. We pick up 2 typical RDMA applications: distributed machine learning training [37], [52], and cloud storage [21], [90]. In the former application scenario, such as a data-parallel parameter-server (PS) architecture, multiple training nodes transport the batched data in the same size to the parameter server after a step of training.…”

Section: Impact On Real Applicationsmentioning

confidence: 99%

“…With RDMA-enabled network interface cards (RNICs), the remote memory access bypasses the CPU and the network stack in operating systems (OS) without any extra data copy, achieving a significant performance improvement. This performance benefit has been brought to a number of data center applications, especially distributed machine learning training tasks and distributed storage clusters [67], [16], [39], [101], [41], [10], [40], [60], [23], [8], [100], [56], [103], [11], [105], [37], [21], [7], [5]. And currently it is becoming a trend to expand RDMA from private high-performance computing clusters to public multi-tenant clouds [3], [4], [91].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

LoRDMA: A New Low-Rate DoS Attack in RDMA Networks

Wang,

Zhang,

et al. 2024

Proceedings 2024 Network and Distributed System Security Symposium

View full text Add to dashboard Cite

RDMA is being widely used from private data center applications to multi-tenant clouds, which makes RDMA security gain tremendous attention. However, existing RDMA security studies mainly focus on the security of RDMA systems, and the security of the coupled traffic control mechanisms (represented by PFC and DCQCN) in RDMA networks is largely overlooked. In this paper, through extensive experiments and analysis, we demonstrate that concurrent short-duration bursts can cause drastic performance loss on flows across multiple hops via the interaction between PFC and DCQCN. And we also summarize the vulnerabilities between the performance loss and the burst peak rate, as well as the duration. Based on these vulnerabilities, we propose the LoRDMA attack, a lowrate DoS attack against RDMA traffic control mechanisms. By monitoring RTT as the feedback signal, LoRDMA can adaptively 1) coordinate the bots to different target switch ports to cover more victim flows efficiently; 2) schedule the burst parameters to cause significant performance loss efficiently. We conduct and evaluate the LoRDMA attack at both ns-3 simulations and a cloud RDMA cluster. The results show that compared to existing attacks, the LoRDMA attack achieves higher victim flow coverage and performance loss with much lower attack traffic and detectability. And the communication performance of typical distributed machine learning training applications (NCCL Tests) in the cloud RDMA cluster can be degraded from 18.23% to 56.12% under the LoRDMA attack.

show abstract