BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy

Cho, Minsik; Finkler, Ulrich; Serrano, Mauricio J.; Kung, David S.; Hunter, Hillery C.

doi:10.1147/jrd.2019.2947013

Cited by 64 publications

(44 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Indeed, the fact that communication is a major performance bottleneck in DDL is well-known [32], and many works [10,35,39,44,58,66] proposed various optimizations to achieve high-bandwidth collective communication specialized for DDL. Besides, a recent body of work, primarily within the ML community, developed gradient compression methods [1,2,42,63,67] to reduce communication time by sending a smaller amount of data, albeit at the cost of reduced training quality due to the lossy nature of compression.…”

Section: Modelmentioning

confidence: 99%

“…Efficient communication in DDL. Several efforts optimize DDL communication ranging from designing high-performance PS software [43] and transfer scheduler [20,25,50], to improving collective communication in heterogeneous networks fabrics [10,28] and within multi-GPU servers [66], to developing in-network reduction systems [35,39,44,57,58], to customizing network congestion protocols and architecture [18]. OmniReduce leverages data sparsity to optimize communication and is complementary to these efforts.…”

Section: Other Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Efficient sparse collective communication and its application to accelerate distributed deep learning

Fei

Sahu

et al. 2021

Proceedings of the 2021 ACM SIGCOMM 2021 Conference

View full text Add to dashboard Cite

Efficient collective communication is crucial to parallel-computing applications such as distributed training of large-scale recommendation systems and natural language processing models. Existing collective communication libraries focus on optimizing operations for dense inputs, resulting in transmissions of many zeros when inputs are sparse. This counters current trends that see increasing data sparsity in large models.We propose OmniReduce, an efficient streaming aggregation system that exploits sparsity to maximize effective bandwidth use by sending only non-zero data blocks. We demonstrate that this idea is beneficial and accelerates distributed training by up to 8.2×. Even at 100 Gbps, OmniReduce delivers 1.4-2.9× better performance for network-bottlenecked DNNs. CCS CONCEPTS• Computer systems organization → Distributed architectures; • Computing methodologies → Machine learning.

show abstract

Section: Modelmentioning

confidence: 99%

Section: Other Related Workmentioning

confidence: 99%

Efficient sparse collective communication and its application to accelerate distributed deep learning

Fei

Sahu

et al. 2021

Proceedings of the 2021 ACM SIGCOMM 2021 Conference

View full text Add to dashboard Cite

show abstract

“…Moreover, recently there are some other works related to optimization of all-gather algorithms, where the additional, specific constraints are considered, e.g., in Reference [8], Kang et al provided a solution for intergroup cooperation, rapidly accelerating data gathering between two disjointed process sets; in Reference [29], Zhou et al analyzed and improved all-gather behavior for multi-/many-core processor in compute clusters; in Reference [2], Cho el al. presented an efficient communication library, with an all-gather implementation, for distributed deep learning that is highly optimized for popular GPU-based platforms.…”

Section: Regular All-gather Algorithms In Usementioning

confidence: 99%

All-gather Algorithms Resilient to Imbalanced Process Arrival Patterns

Proficz

2021

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Two novel algorithms for the all-gather operation resilient to imbalanced process arrival patterns (PATs) are presented. The first one, Background Disseminated Ring (BDR), is based on the regular parallel ring algorithm often supplied in MPI implementations and exploits an auxiliary background thread for early data exchange from faster processes to accelerate the performed all-gather operation. The other algorithm, Background Sorted Linear synchronized tree with Broadcast (BSLB), is built upon the already existing PAP-aware gather algorithm, that is, Background Sorted Linear Synchronized tree (BSLS), followed by a regular broadcast distributing gathered data to all participating processes. The background of the imbalanced PAP subject is described, along with the PAP monitoring and evaluation topics. An experimental evaluation of the algorithms based on a proposed mini-benchmark is presented. The mini-benchmark was performed over 2,000 times in a typical HPC cluster architecture with homogeneous compute nodes. The obtained results are analyzed according to different PATs, data sizes, and process numbers, showing that the proposed optimization works well for various configurations, is scalable, and can significantly reduce the all-gather elapsed times, in our case, up to factor 1.9 or 47% in comparison with the best state-of-the-art solution.

show abstract

“…On the system side, one line of work is to further optimize the system communication primitives and communication strategies to take advantage of the property of the underlying ML workload. Some examples of recent work in this direction include (Hashemi et al, 2019;Jayarajan et al, 2019;Cho et al, 2019;Jia et al, 2019;Wang et al, 2018d). Another line of work tries to automatically optimize in the tradeoff introduced by these system relaxation techniques (e.g., the communication frequency which is often a hyperparameter).…”

Section: System Optimization and Automatic Tradeoff Managementmentioning

confidence: 99%

Distributed Learning Systems with First-order Methods

Ji¹,

Zhang²

2020

View full text Add to dashboard Cite

Scalable and efficient distributed learning is one of the main driving forces behind the recent rapid advancement of machine learning and artificial intelligence. One prominent feature of this topic is that recent progresses have been made by researchers in two communities: (1) the system community such as database, data management, and distributed systems, and (2) the machine learning and mathematical optimization community. The interaction and knowledge sharing between these two communities has led to the rapid development of new distributed learning systems and theory.In this work, we hope to provide a brief introduction of some distributed learning techniques that have recently been developed, namely lossy communication compression (e.g., quantization and sparsification), asynchronous communication, and decentralized communication. One special focus in this work is on making sure that it can be easily understood by researchers in both communities -On the system side, we rely on a simplified system model hiding many system details that are not necessary for the intuition behind the system speedups; while, on the theory side, we rely on minimal assumptions and significantly simplify the proof of some recent work to achieve comparable results.

show abstract

BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy

Cited by 64 publications

References 9 publications

Efficient sparse collective communication and its application to accelerate distributed deep learning

Efficient sparse collective communication and its application to accelerate distributed deep learning

All-gather Algorithms Resilient to Imbalanced Process Arrival Patterns

Distributed Learning Systems with First-order Methods

Contact Info

Product

Resources

About