Atal Narayan Sahu scite author profile

Atal Narayan Sahu

3Publications

96Citation Statements Received

79Citation Statements Given

How they've been cited

130

How they cite others

Affiliations

King Abdullah University of Science and Technology, Kootenay Association for Science & Technology

Publications

Order By: Most citations

Natural Compression for Distributed Deep Learning

Horváth¹,

Ho²,

Horvath³

et al. 2019

Preprint

View full text Add to dashboard Cite

Due to their hunger for big data, modern deep learning models are trained in parallel, often in distributed environments, where communication of model updates is the bottleneck. Various update compression (e.g., quantization, sparsification, dithering) techniques [2,47,48,22] have been proposed in recent years as a successful tool to alleviate this problem. In this work, we introduce a new, remarkably simple and theoretically and practically effective compression technique, which we call natural compression (C nat ). Our technique is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two. C nat is "natural" since the nearest power of two of a real expressed as a float can be obtained without any computation, simply by ignoring the mantissa. We show that compared to no compression, C nat increases the second moment of the compressed vector by the tiny factor 9 /8 only, which means that the effect of C nat on the convergence speed of popular training algorithms, such as distributed SGD, is negligible. However, the communications savings enabled by C nat are substantial, leading to 3-4× improvement in overall theoretical running time. For applications requiring more aggressive compression, we generalize C nat to natural dithering, which we prove is exponentially better than the immensely popular random dithering technique [13,39]. Our compression operators can be used on their own or in combination with existing operators for a more aggressive combined effect. Finally, we show that C nat is particularly effective for the in-network aggregation (INA) [40] framework for distributed training, where the update aggregation is done on a switch, which can only perform integer computations.Preprint. Under review.

show abstract

On the Discrepancy between the Theoretical Analysis and Practical Implementations of Compressed Communication for Distributed Deep Learning

Dutta

Bergou

Abdelmoniem

et al. 2020

AAAI

View full text Add to dashboard Cite

Compressed communication, in the form of sparsification or quantization of stochastic gradients, is employed to reduce communication costs in distributed data-parallel training of deep neural networks. However, there exists a discrepancy between theory and practice: while theoretical analysis of most existing compression methods assumes compression is applied to the gradients of the entire model, many practical implementations operate individually on the gradients of each layer of the model.In this paper, we prove that layer-wise compression is, in theory, better, because the convergence rate is upper bounded by that of entire-model compression for a wide range of biased and unbiased compression methods. However, despite the theoretical bound, our experimental study of six well-known methods shows that convergence, in practice, may or may not be better, depending on the actual trained model and compression ratio. Our findings suggest that it would be advantageous for deep learning frameworks to include support for both layer-wise and entire-model compression.

show abstract

Efficient sparse collective communication and its application to accelerate distributed deep learning

Fei

Sahu

et al. 2021

View full text Add to dashboard Cite

Efficient collective communication is crucial to parallel-computing applications such as distributed training of large-scale recommendation systems and natural language processing models. Existing collective communication libraries focus on optimizing operations for dense inputs, resulting in transmissions of many zeros when inputs are sparse. This counters current trends that see increasing data sparsity in large models.We propose OmniReduce, an efficient streaming aggregation system that exploits sparsity to maximize effective bandwidth use by sending only non-zero data blocks. We demonstrate that this idea is beneficial and accelerates distributed training by up to 8.2×. Even at 100 Gbps, OmniReduce delivers 1.4-2.9× better performance for network-bottlenecked DNNs. CCS CONCEPTS• Computer systems organization → Distributed architectures; • Computing methodologies → Machine learning.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.