Rachee Singh scite author profile

Large ML models and datasets have necessitated the use of multi-GPU systems for distributed model training. To harness the power offered by multi-GPU systems, it is critical to eliminate bottlenecks in inter-GPU communication -a problem made challenging by the heterogeneous nature of interconnects. In this work, we present TACCL, a synthesizer for collective communication primitives for large-scale multi-GPU systems. TACCL encodes a profiled topology and input size into a synthesis problem to generate optimized communication algorithms. TACCL is built on top of the standard NVIDIA Collective Communication Library (NCCL), allowing it to be a drop-in replacement for GPU communication in frameworks like PyTorch with minimal changes. TACCL generates algorithms for communication primitives like ALLGATHER, ALLTOALL, and ALLREDUCE that are up to 3× faster than NCCL. Using TACCL's algorithms speeds up the end-to-end training of an internal mixture of experts model by 17%. By decomposing the optimization problem into parts and leveraging the symmetry in multi-GPU topologies, TACCL synthesizes collectives for up-to 80-GPUs in less than 3 minutes, at least two orders of magnitude faster than other synthesis-based state-of-the-art collective communication libraries.

show abstract

Radwan

Singh¹,

Ghobadi

Foerster

et al. 2018

View full text Add to dashboard Cite

Surviving switch failures in cloud datacenters

Singh

Mukhtar

Krishna

et al. 2021

SIGCOMM Comput. Commun. Rev.

View full text Add to dashboard Cite

Switch failures can hamper access to client services, cause link congestion and blackhole network traffic. In this study, we examine the nature of switch failures in the datacenters of a large commercial cloud provider through the lens of survival theory. We study a cohort of over 180,000 switches with a variety of hardware and software configurations and find that datacenter switches have a 98% likelihood of functioning uninterrupted for over 3 months since deployment in production. However, there is significant heterogeneity in switch survival rates with respect to their hardware and software: the switches of one vendor are twice as likely to fail compared to the others. We attribute the majority of switch failures to hardware impairments and unplanned power losses. We find that the in-house switch operating system, SONiC, boosts the survival likelihood of switches in datacenters by 1% by eliminating switch failures caused by software bugs in vendor switch OSes.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Rachee Singh

Characterizing the Nature and Dynamics of Tor Exit Blocking

Characterizing the Deployment and Performance of Multi-CDNs

Synthesizing Collective Communication Algorithms for Heterogeneous Networks with TACCL

Radwan

Surviving switch failures in cloud datacenters

Contact Info

Product

Resources

About