Drawing Early-Bird Tickets: Towards More Efficient Training of Deep Networks

You, Haoran; Li, Chaojian; Xu, Pengfei; Fu, Yonggan; Wang, Yue; Chen, Xiaohan; Baraniuk, Richard G.; Wang, Zhangyang; Lin, Yingyan

doi:10.48550/arxiv.1909.11957

Cited by 22 publications

(41 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Lottery Ticket Hypothesis (LTH) states that typical dense neural networks contain a small sparse sub-network that can be trained to reach similar test accuracy in an equal number of steps (Frankle and Carbin, 2018). In view of that, follow-up works reveal that sparsity patterns might emerge at the initialization , the early stage of training (You et al, 2019) and (Chen et al, 2020b), or in dynamic forms throughout training (Evci et al, 2020) by updating model parameters and architecture typologies simultaneously. Some of the recent findings are that the lottery ticket hypothesis holds for BERT models, i.e., largest weights of the original network do form subnetworks that can be retrained alone to reach the performance close to that of the full model (Prasanna et al, 2020;Chen et al, 2020a).…”

Section: The Lottery Ticket Hypothesismentioning

confidence: 99%

“…Subnetworks that are found on the masked language modeling task transfer universally; those found on other tasks transfer in a limited fashion if at all. 1.6.2 EarlyBERT (Chen et al, 2020b) introduces EarlyBERT which extends the work done on finding lottery-tickets in CNNs (You et al, 2019) to speedup both pre-training and fine-tuning for BERT models. (You et al, 2019) realized that sparsity patterns might emerge at the initialization.…”

Section: The Lottery Ticket Hypothesismentioning

confidence: 99%

“…1.6.2 EarlyBERT (Chen et al, 2020b) introduces EarlyBERT which extends the work done on finding lottery-tickets in CNNs (You et al, 2019) to speedup both pre-training and fine-tuning for BERT models. (You et al, 2019) realized that sparsity patterns might emerge at the initialization. Experimental evaluation of EarlyBERT shows some degradation in accuracy for fine-tuning.…”

Section: The Lottery Ticket Hypothesismentioning

confidence: 99%

See 2 more Smart Citations

On the Compression of Natural Language Models

Damadi¹

2021

Preprint

View full text Add to dashboard Cite

Deep neural networks are effective feature extractors but they are prohibitively large for deployment scenarios. Due to the huge number of parameters, interpretability of parameters in different layers is not straight-forward. This is why neural networks are sometimes considered black boxes. Although simpler models are easier to explain, finding them is not easy. If found, a sparse network that can fit to a data from scratch would help to interpret parameters of a neural network. To this end, (Frankle and Carbin, 2018) showed that typical dense neural networks contain a small sparse sub-network that can be trained to a reach similar test accuracy in an equal number of steps. The goal of this work is to assess whether such a trainable subnetwork exists for natural language models (NLM)s. To achieve this goal we will review state-of-the-art compression techniques such as quantization, knowledge distillation, and pruning.

show abstract

Section: The Lottery Ticket Hypothesismentioning

confidence: 99%

Section: The Lottery Ticket Hypothesismentioning

confidence: 99%

Section: The Lottery Ticket Hypothesismentioning

confidence: 99%

See 1 more Smart Citation

On the Compression of Natural Language Models

Damadi¹

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Pruning typically follows a three-step process of pre-training, pruning, and fine-tuning (Li et al, 2016;. Pre-training is usually the most expensive component, but later work explores strategies of finding good pruned networks with minimal pre-training (You et al, 2019;Chen et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

i-SpaSP: Structured Neural Pruning via Sparse Signal Recovery

Wolfe¹,

Kyrillidis²

2021

Preprint

View full text Add to dashboard Cite

We propose a novel, structured pruning algorithm for neural networks-the iterative, Sparse Structured Pruning algorithm, dubbed as i-SpaSP. Inspired by ideas from sparse signal recovery, i-SpaSP operates by iteratively identifying a larger set of important parameter groups (e.g., filters or neurons) within a network that contribute most to the residual between pruned and dense network output, then thresholding these groups based on a smaller, pre-defined pruning ratio. For both two-layer and multi-layer network architectures with ReLU activations, we show the error induced by pruning with i-SpaSP decays polynomially, where the degree of this polynomial becomes arbitrarily large based on the sparsity of the dense network's hidden representations. In our experiments, i-SpaSP is evaluated across a variety of datasets (i.e., MNIST and ImageNet) and architectures (i.e., feed forward networks, ResNet34, and MobileNetV2), where it is shown to discover high-performing sub-networks and improve upon the pruning efficiency of provable baseline methodologies by several orders of magnitude. Put simply, i-SpaSP is easy to implement with automatic differentiation, achieves strong empirical results, comes with theoretical convergence guarantees, and is efficient, thus distinguishing itself as one of the few computationally efficient, practical, and provable pruning algorithms.

show abstract

“…However, resource-constrained training was not explored much until a few recent efforts on classification [18,32,36].…”

Section: Introductionmentioning

confidence: 99%

DANCE: DAta-Network Co-optimization for Efficient Segmentation Model Training and Inference

Li¹,

Chen²,

Gu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Semantic segmentation for scene understanding is nowadays widely demanded, raising significant challenges for the algorithm efficiency, especially its applications on resource-limited platforms. Current segmentation models are trained and evaluated on massive highresolution scene images ("data level") and suffer from the expensive computation arising from the required multi-scale aggregation ("network level"). In both folds, the computational and energy costs in training and inference are notable due to the often desired large input resolutions and heavy computational burden of segmentation models. To this end, we propose DANCE, general automated DAta-Network Co-optimization for Efficient segmentation model training and inference. Distinct from existing efficient segmentation approaches that focus merely on light-weight network design, DANCE distinguishes itself as an automated simultaneous data-network co-optimization via both input data manipulation and network architecture slimming. Specifically, DANCE integrates automated data slimming which adaptively downsamples/drops input images and controls their corresponding contribution to the training loss guided by the images' spatial complexity. Such a downsampling operation, in addition to slimming down the cost associated with the input size directly, also shrinks the dynamic range of input object and context scales, therefore motivating us to also adaptively slim the network to match the downsampled data. Extensive experiments and ablating studies (on four SOTA segmentation models with three popular segmentation datasets under two training settings) demonstrate that DANCE can achieve "all-win" towards efficient segmentation (reduced training cost, less expensive inference, and better mean Intersection-over-Union (mIoU)). Specifically, DANCE can reduce ↓25% -↓77% energy consumption in training, ↓31% -↓56% in inference, while boosting the mIoU by ↓0.71% -↑ 13.34%. CCS Concepts: • Computing methodologies → Image segmentation; Neural networks; • Hardware → Platform power issues.

show abstract

Drawing Early-Bird Tickets: Towards More Efficient Training of Deep Networks

Cited by 22 publications

References 14 publications

On the Compression of Natural Language Models

On the Compression of Natural Language Models

i-SpaSP: Structured Neural Pruning via Sparse Signal Recovery

DANCE: DAta-Network Co-optimization for Efficient Segmentation Model Training and Inference

Contact Info

Product

Resources

About