Saptadeep Pal scite author profile

Deploying deep learning (DL) models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used parallelization strategy, but as the number of devices in data parallel training grows, so does the communication overhead between devices. Additionally, a larger aggregate batch size per step leads to statistical efficiency loss, i.e., a larger number of epochs are required to converge to a desired accuracy. These factors affect overall training time and beyond a certain number of devices, the speedup from leveraging DP begins to scale poorly. In addition to DP, each training step can be accelerated by exploiting model parallelism (MP). This work explores hybrid parallelization, where each data parallel worker is comprised of more than one device, across which the model dataflow graph (DFG) is split using MP. We show that atscale, hybrid training will be more effective at minimizing end-to-end training time than exploiting DP alone. We project that for Inception-V3, GNMT, and BigLSTM, the hybrid strategy provides an end-to-end training speedup of at least 26.5%, 8%, and 22% respectively compared to what DP alone can achieve at scale.

show abstract

Latency, Bandwidth and Power Benefits of the SuperCHIPS Integration Scheme

Jangam

Pal

Bajwa

et al. 2017

View full text Add to dashboard Cite

Abstract-In this paper, we describe the performance and power benefits of our Fine Pitch integration scheme on a Silicon Interconnect Fabric (Si IF). Here we propose a Simple Universal Parallel intERface (SuperCHIPS) protocol enabled by fine pitch dielet to interconnect fabric assembly. We show the dramatic improvements in bandwidth, latency, and power are achievable through our integration scheme where small dielets (1-25 mm 2 ) are attached to a rigid Silicon Interconnect Fabric (Si-IF) at fine interconnect pitch (2-10 μm) and short inter-die distance (50-500 μm) using solderless metal-to-metal thermal compression bonding (TCB). Our simulations show that links in the Si-IF with short wire-lengths (<500 μm) have excellent signal transfer characteristics with low channel loss (<-2 dB) and low cross-talk (<-15 dB). With fine interconnect pitches (<10 μm), our scheme can achieve >5-25x improvement in data bandwidth. This can improve system performance (>20x) when compared to PCB-style integration and may even approach single die SoC metrics in some cases. Furthermore our protocol is simple and non-proprietary. We show that this scheme enables heterogeneous system integration using a dielet based assembly method and provides significant reduction in design and validation cost.System-level analysis of heterogeneous integration scheme promises power benefits of more than 15% even for very small systems.

show abstract

Hybrid VC-MTJ/CMOS non-volatile stochastic logic for efficient computing

Wang

Pal

et al. 2017

View full text Add to dashboard Cite

Architecting Waferscale Processors - A GPU Case Study

Pal

Petrisko

Tomei

et al. 2019

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Saptadeep Pal

Heterogeneous Integration at Fine Pitch (≤ 10 µm) Using Thermal Compression Bonding

Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

Latency, Bandwidth and Power Benefits of the SuperCHIPS Integration Scheme

Hybrid VC-MTJ/CMOS non-volatile stochastic logic for efficient computing

Architecting Waferscale Processors - A GPU Case Study

Contact Info

Product

Resources

About