Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Hoefler, Torsten; Alistarh, Dan; Ben-Nun, Tal; Dryden, Nikoli; Peşte, Alexandra

doi:10.48550/arxiv.2102.00554

Cited by 43 publications

(60 citation statements)

References 126 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We assume all sparse approaches use the coordinate (COO) format to store the sparse gradient, which consumes 2𝑘 storage, i.e., 𝑘 values plus 𝑘 indexes. There are other sparse formats (see [22] for an overview), but format selection for a given sparsity is not the topic of this work. To model the communication overhead, we assume bidirectional and direct point-to-point communication between the compute nodes, and use the classic latency-bandwidth cost model.…”

Section: Algorithmsmentioning

confidence: 99%

See 1 more Smart Citation

Near-Optimal Sparse Allreduce for Distributed Deep Learning

Li,

Hoefler

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Communication overhead is one of the major obstacles to train large deep learning models at scale. Gradient sparsification is a promising technique to reduce the communication volume. However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse allreduce algorithm and (2) the sparsification overhead. This paper proposes O𝑘-Top𝑘, a scheme for distributed training with sparse gradients. O𝑘-Top𝑘 integrates a novel sparse allreduce algorithm (less than 6𝑘 communication volume which is asymptotically optimal) with the decentralized parallel Stochastic Gradient Descent (SGD) optimizer, and its convergence is proved. To reduce the sparsification overhead, O𝑘-Top𝑘 efficiently selects the top-𝑘 gradient values according to an estimated threshold. Evaluations are conducted on the Piz Daint supercomputer with neural network models from different deep learning domains. Empirical results show that O𝑘-Top𝑘 achieves similar model accuracy to dense allreduce. Compared with the optimized dense and the state-of-the-art sparse allreduces, O𝑘-Top𝑘 is more scalable and significantly improves training throughput (e.g., 3.29x-12.95x improvement for BERT on 256 GPUs).

show abstract

Section: Algorithmsmentioning

confidence: 99%

“…Only the nonzero values of the distributed gradients are accumulated across all processes. See [22] for an overview of gradient and other sparsification approaches in deep learning.…”

Section: Introductionmentioning

confidence: 99%

Near-Optimal Sparse Allreduce for Distributed Deep Learning

Li,

Hoefler

2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Recently, there has been significant research interest in pruning techniques, and hundreds of different sparsification approaches have been proposed; please see the recent surveys of [15] and [25] for a comprehensive exposition. We categorize existing pruning methods as follows.…”

Section: Sparsification Techniquesmentioning

confidence: 99%

“…The increasing computational and storage costs of deep learning models have led to significant academic and industrial interest in model compression, which is roughly the task of obtaining smaller-footprint models matching the accuracy of larger baseline models. Model compression is a rapidly-developing area, and several generic approaches have been investigated, among which pruning and quantization are among the most popular [16,25].…”

Section: Introductionmentioning

confidence: 99%

“…objective is to remove, by setting to zero, as many weights as possible without losing model accuracy. Weight pruning is, arguably, the compression method with the richest history [35] and is currently a very active research topic [25]. Thanks to this trend, a set of fairly consistent accuracy benchmarks has emerged for pruning, along with increasingly efficient computational support [9,18,33,43].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

How Well Do Sparse Imagenet Models Transfer?

Iofinova¹,

Peşte²,

Kurtz³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Transfer learning is a classic paradigm by which models pretrained on large "upstream" datasets are adapted to yield good results on "downstream," specialized datasets. Generally, it is understood that more accurate models on the "upstream" dataset will provide better transfer accuracy "downstream". In this work, we perform an in-depth investigation of this phenomenon in the context of convolutional neural networks (CNNs) trained on the ImageNet dataset, which have been pruned-that is, compressed by sparsifiying their connections. Specifically, we consider transfer using unstructured pruned models obtained by applying several state-of-the-art pruning methods, including magnitude-based, second-order, re-growth and regularization approaches, in the context of twelve standard transfer tasks. In a nutshell, our study shows that sparse models can match or even outperform the transfer performance of dense models, even at high sparsities, and, while doing so, can lead to significant inference and even training speedups. At the same time, we observe and analyze significant differences in the behaviour of different pruning methods.

show abstract

DensEMANN + Sparsification: Experiments for Further Shrinking Already Small Automatically Generated DenseNet

Garcia-Diaz¹,

Bersini²

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

This paper presents a few experiments that we carried out using DensEMANN (an algorithm that we are developing for automatically generating small and efficient DenseNet neural networks) and various algorithms for pruning or sparsifying neural networks at different granularity levels. The pruning algorithms that we used are based on the Lottery Ticket algorithm by Frankle and Carbin (2019), and on the Dense-Sparse-Dense (DSD) training algorithm by Han et al. (2017). Our experiments show that the pruning method based on DSD training is very efficient for reducing the parameter count of both human-designed and DensEMANN-generated neural networks while making them recover their original accuracy, and that this is especially true when sparsification is performed at the granularity level of individual convolution weights (by means of a mask that zeroes them out). Further research is nevertheless necessary to find out if (and how) this method can become an alternative to DensEMANN, or work in tandem with it, for actually shrinking already small and efficient neural networks.

show abstract

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Cited by 43 publications

References 126 publications

Near-Optimal Sparse Allreduce for Distributed Deep Learning

Near-Optimal Sparse Allreduce for Distributed Deep Learning

How Well Do Sparse Imagenet Models Transfer?

DensEMANN + Sparsification: Experiments for Further Shrinking Already Small Automatically Generated DenseNet

Contact Info

Product

Resources

About