Optimizing Distributed DNN Training Using CPUs and BlueField-2 DPUs

Jain, Arpan; Alnaasan, Nawras; Shafi, Aamir; Subramoni, Hari; Panda, Dhabaleswar K.

doi:10.1109/mm.2021.3139027

Cited by 5 publications

(4 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, they did not find an offloading scheme that could optimally accelerate the training of all models in their study. In an extension of this work [11], the same authors show a consistent improvement for Convolu-tional Neural Networks (CNNs) and Transformer models with weak and strong scaling on multiple nodes.…”

Section: Related Workmentioning

confidence: 61%

“…As in [10,11], our proposal makes use of the DPU in a Deep Learning environment, but unlike those where it is used in the training phases of the model, in our case, the card will be used to perform filtering tasks to help an already trained model to reduce the inference workload. We were motivated to try using a DPU as a filter for a video stream because of its ability to alleviate the load on the system.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Data Processing Unit for Energy Saving in Computer Vision: Weapon Detection Use Case

et al. 2022

View full text Add to dashboard Cite

The growth of the Internet has led to the emergence of servers that perform increasingly heavy tasks. Some servers must remain active 24 h a day, but the evolution of network cards has facilitated the use of Data Processing Units (DPUs) to reduce network traffic and alleviate server workloads. This capability makes DPUs good candidates for load alleviation in systems that perform continuous data processing when the data can be pre-filtered. Computer vision systems that use some form of artificial intelligence, such as facial recognition or weapon detection, tend to have high workloads and high power consumption, which is becoming increasingly costly. Reducing the workload is therefore desirable and possible in some scenarios. The main contributions of this study are threefold: (1) to explore the potential benefits of using a DPU to alleviate the workload of a 24-h active server; (2) to present a study that measures the workload reduction of a CCTV weapon detection system and evaluate its performance under different conditions. We observed a 43,123% reduction in workload over the 24 h of video used in the experimentation, reaching more than 98% savings during night hours, which significantly reduces system stress and has a direct impact on electrical energy expenditure; and 3) to provide a framework that can be adapted to other computer vision-based detection systems.

show abstract

Section: Related Workmentioning

confidence: 61%

Section: Related Workmentioning

confidence: 99%

Data Processing Unit for Energy Saving in Computer Vision: Weapon Detection Use Case

et al. 2022

View full text Add to dashboard Cite

show abstract

“…HPC applications could benefit from DPU devices offloading part of their load to them. For example, when training deep neural networks, data augmentation or validation stages, could be offloaded to to less power accelerators such as DPUs [3]. In turn, in large distributed multiphysics simulations could offload the halo exchange operation making DPUs responsible for communicating and computing the halo among neighbors.…”

Section: Discussionmentioning

confidence: 99%

DPU Offloading Programming with the OpenMP API

Usman,

Iserte,

Ferrer

et al. 2023

Proceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analys

View full text Add to dashboard Cite

Supercomputing Center (BSC), Spain Data processing units (DPUs) as network co-processors are an emerging trend in our community, with plenty of opportunities yet to be explored. These have been generally used as domain-specific accelerators transparent to application developers; In the HPC field, DPUs have been used as MPI accelerators, but also to offload some tasks from the general-purpose processor. However, the latter required application developers to deploy MPI ranks in the DPUs, as if they were remote (weak) compute nodes, hence considerably hindering programmability. The wide adoption of OpenMP as the threading model in the HPC arena, along with that of GPU accelerators, is making OpenMP offloading to GPUs a wide trend for HPC applications. In this paper we introduce, for the first time in the literature, OpenMP offloading support for network co-processor DPUs. We present our design in LLVM to support OpenMP standard offloading semantics and discuss the programming productivity advantages with respect to the existing MPI-based programming model. We also provide the corresponding performance analysis demonstrating competitive results in comparison with the MPI baseline.CCS Concepts: • Computing methodologies → Distributed computing methodologies.

show abstract

“…Leveraging DPUs for these tasks, Jain et al [ 27 ] achieved up to a 15% increase in training performance. Their subsequent work [ 28 ] demonstrated consistent performance improvements for CNNs and Transformer models, both in weak and strong scaling scenarios across multiple nodes.…”

Section: Previous Workmentioning

confidence: 92%

Energy-Efficient Edge and Cloud Image Classification with Multi-Reservoir Echo State Network and Data Processing Units

López-Ortiz,

Perea-Trigo,

Soria-Morillo

et al. 2024

Sensors

View full text Add to dashboard Cite

In an era dominated by Internet of Things (IoT) devices, software-as-a-service (SaaS) platforms, and rapid advances in cloud and edge computing, the demand for efficient and lightweight models suitable for resource-constrained devices such as data processing units (DPUs) has surged. Traditional deep learning models, such as convolutional neural networks (CNNs), pose significant computational and memory challenges, limiting their use in resource-constrained environments. Echo State Networks (ESNs), based on reservoir computing principles, offer a promising alternative with reduced computational complexity and shorter training times. This study explores the applicability of ESN-based architectures in image classification and weather forecasting tasks, using benchmarks such as the MNIST, FashionMnist, and CloudCast datasets. Through comprehensive evaluations, the Multi-Reservoir ESN (MRESN) architecture emerges as a standout performer, demonstrating its potential for deployment on DPUs or home stations. In exploiting the dynamic adaptability of MRESN to changing input signals, such as weather forecasts, continuous on-device training becomes feasible, eliminating the need for static pre-trained models. Our results highlight the importance of lightweight models such as MRESN in cloud and edge computing applications where efficiency and sustainability are paramount. This study contributes to the advancement of efficient computing practices by providing novel insights into the performance and versatility of MRESN architectures. By facilitating the adoption of lightweight models in resource-constrained environments, our research provides a viable alternative for improved efficiency and scalability in modern computing paradigms.

show abstract

Optimizing Distributed DNN Training Using CPUs and BlueField-2 DPUs

Cited by 5 publications

References 8 publications

Data Processing Unit for Energy Saving in Computer Vision: Weapon Detection Use Case

Data Processing Unit for Energy Saving in Computer Vision: Weapon Detection Use Case

DPU Offloading Programming with the OpenMP API

Energy-Efficient Edge and Cloud Image Classification with Multi-Reservoir Echo State Network and Data Processing Units

Contact Info

Product

Resources

About