Accelerating distributed reinforcement learning with in-switch computing

Li, Youjie; Liu, Iou-Jen; Yang, Yifan; Chen, Deming; Schwing, Alexander G.; Huang, Jian

doi:10.1145/3307650.3322259

Cited by 92 publications

(60 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Indeed, the fact that communication is a major performance bottleneck in DDL is well-known [32], and many works [10,35,39,44,58,66] proposed various optimizations to achieve high-bandwidth collective communication specialized for DDL. Besides, a recent body of work, primarily within the ML community, developed gradient compression methods [1,2,42,63,67] to reduce communication time by sending a smaller amount of data, albeit at the cost of reduced training quality due to the lossy nature of compression.…”

Section: Modelmentioning

confidence: 99%

“…Efficient communication in DDL. Several efforts optimize DDL communication ranging from designing high-performance PS software [43] and transfer scheduler [20,25,50], to improving collective communication in heterogeneous networks fabrics [10,28] and within multi-GPU servers [66], to developing in-network reduction systems [35,39,44,57,58], to customizing network congestion protocols and architecture [18]. OmniReduce leverages data sparsity to optimize communication and is complementary to these efforts.…”

Section: Other Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Efficient sparse collective communication and its application to accelerate distributed deep learning

Fei

Sahu

et al. 2021

Proceedings of the 2021 ACM SIGCOMM 2021 Conference

View full text Add to dashboard Cite

Efficient collective communication is crucial to parallel-computing applications such as distributed training of large-scale recommendation systems and natural language processing models. Existing collective communication libraries focus on optimizing operations for dense inputs, resulting in transmissions of many zeros when inputs are sparse. This counters current trends that see increasing data sparsity in large models.We propose OmniReduce, an efficient streaming aggregation system that exploits sparsity to maximize effective bandwidth use by sending only non-zero data blocks. We demonstrate that this idea is beneficial and accelerates distributed training by up to 8.2×. Even at 100 Gbps, OmniReduce delivers 1.4-2.9× better performance for network-bottlenecked DNNs. CCS CONCEPTS• Computer systems organization → Distributed architectures; • Computing methodologies → Machine learning.

show abstract

Section: Modelmentioning

confidence: 99%

Section: Other Related Workmentioning

confidence: 99%

Efficient sparse collective communication and its application to accelerate distributed deep learning

Fei

Sahu

et al. 2021

Proceedings of the 2021 ACM SIGCOMM 2021 Conference

View full text Add to dashboard Cite

show abstract

“…Table I summarizes the related works using hardware accelerators on In-Network Computing. [43] X NetDebug [44] X Lake [45] X iSwitch [46] X…”

Section: State Of the Artmentioning

confidence: 99%

“…iSwitch [46] proposes a distributed solution using innetwork computing to move gradient aggregation operations from network node servers to FPGA-based switches, reducing the number of network hops during gradient aggregation operations. Gradient aggregation is Reinforcement Learning (RL) operations used to train Artificial Intelligence (AI) applications.…”

Section: B Fpga-based Hardware Acceleratorsmentioning

confidence: 99%

Programmable Data Planes meets In-Network Computing: A Review of the State of the Art and Prospective Directions

Gobatto

Rodrigues

Tirone

et al. 2021

JICS

View full text Add to dashboard Cite

Improving network traffic in networks is one of the concerns between networking researchers and network operators since the architecture of modern networks still faces challenges to process large data traffic without the cost of consuming a significant amount of resources not related to computing specifically. On the other hand, network programmability has enabled the development of new applications and network services, from software-defined networking to domain-specific languages created to program network devices and specify their behavior. The development of programmable hardware and hardware accelerators like FPGAs, GPUs, and CPUs help this new paradigm go one step further. Use the artifact of programmability of these devices to solve problems, such as improve the processing of data traffic is the key of in-network computing. It offers the opportunity to execute programs typically running on end-hosts within programmable network devices already incorporated on the network, thus being capable of provides a reduction on the in-network processing load and requires no extra cost, since operations can be concluded using a fewer amount of devices of the network and no extra device are needed. In this paper, we survey in-network computing, as well as we suggest classifying related works to in-network computing according to the hardware accelerator used. Also, we discuss challenges and research directions.

show abstract

“…In many networks, the data and packet rate reduction offered by the former is required to make this possible. Indeed, inswitch aggregation has seen great success in aiding ML for training [20], and direct execution [21]. We make use of the following standard classification algorithms on a fixed-size representation to attempt to single out the CCA in use:…”

Section: I I T C P C O N G E S T I O N C O N T R O L C L a S S I F I C At I O Nmentioning

confidence: 99%

Seiðr: Dataplane Assisted Flow Classification Using ML

Simpson

Cziva

Pezaros

2020

GLOBECOM 2020 - 2020 IEEE Global Communications Conference

View full text Add to dashboard Cite

Real-time, high-speed flow classification is fundamental for network operation tasks, including reactive and proactive traffic engineering, anomaly detection and security enhancement. Existing flow classification solutions, however, do not allow operators to classify traffic based on fine-grained, temporal dynamics due to imprecise timing, often rely on sampled data, or only work with low traffic volumes and rates. In this paper, we present Seiðr, a classification solution that: (i) uses precision timing, (ii) has the ability to examine every packet on the network, (iii) classifies very high traffic volumes with high precision. To achieve this, Seiðr exploits the data aggregation and timestamping functionality of programmable dataplanes. As a concrete example, we present how Seiðr can be used together with Machine Learning algorithms (such as CNN, k-NN) to provide accurate, real-time and high-speed TCP congestion control classification, separating TCP BBR from its predecessors with over 88-96 % accuracy and F1-score of 0.864-0.965, while only using 15.5 MiB of memory in the dataplane.

show abstract

Accelerating distributed reinforcement learning with in-switch computing

Cited by 92 publications

References 19 publications

Efficient sparse collective communication and its application to accelerate distributed deep learning

Efficient sparse collective communication and its application to accelerate distributed deep learning

Programmable Data Planes meets In-Network Computing: A Review of the State of the Art and Prospective Directions

Seiðr: Dataplane Assisted Flow Classification Using ML

Contact Info

Product

Resources

About