Dissecting Lottery Ticket Transformers: Structural and Behavioral Study of Sparse Neural Machine Translation

Rajiv, Movva,; Zhao, Jason

doi:10.18653/v1/2020.blackboxnlp-1.19

Cited by 4 publications

(3 citation statements)

References 27 publications

(29 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…LTH (Frankle and Carbin, 2018) has been widely explored in various applications of deep learning (Brix et al, 2020;Movva and Zhao, 2020;Girish et al, 2020). Most of existing results focus on finding unstructured winning tickets via iterative magnitude pruning and rewinding in randomly initialized networks (Frankle et al, 2019;Renda et al, 2020), where each ticket is a single parameter.…”

Section: Structured and Unstructured Lthsmentioning

confidence: 99%

“…The existence of such a collection of tickets, which is usually referred to as "winning tickets", indicates the potential of training a smaller network to achieve the full model's performance. LTH has been widely explored in across various fields of deep learning (Frankle et al, 2019;You et al, 2019;Brix et al, 2020;Movva and Zhao, 2020;Girish et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization

Chen

Zuo

Chen

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

The Lottery Ticket Hypothesis suggests that an over-parametrized network consists of "lottery tickets", and training a certain collection of them (i.e., a subnetwork) can match the performance of the full model. In this paper, we study such a collection of tickets, which is referred to as "winning tickets", in extremely over-parametrized models, e.g., pre-trained language models. We observe that at certain compression ratios, the generalization performance of the winning tickets can not only match but also exceed that of the full model. In particular, we observe a phase transition phenomenon: As the compression ratio increases, generalization performance of the winning tickets first improves then deteriorates after a certain threshold. We refer to the tickets on the threshold as "super tickets". We further show that the phase transition is task and model dependent -as the model size becomes larger and the training data set becomes smaller, the transition becomes more pronounced. Our experiments on the GLUE benchmark show that the super tickets improve single task fine-tuning by 0.9 points on BERT-base and 1.0 points on BERT-large, in terms of task-average score. We also demonstrate that adaptively sharing the super tickets across tasks benefits multi-task learning 1 .

show abstract

Section: Structured and Unstructured Lthsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization

Chen

Zuo

Chen

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

show abstract

“…However, this method also requires the pre-trained model in order to prune the initial model. Researchers have proposed the methods that perform the pruning from the untrained model [19][20][21][22]. Wang et al [21] added a scalar 'gate value' to measure the effectiveness of each filter in the initial model.…”

Section: Related Workmentioning

confidence: 99%

Zero-Keep Filter Pruning for Energy/Power Efficient Deep Neural Networks

et al. 2021

View full text Add to dashboard Cite

Recent deep learning models succeed in achieving high accuracy and fast inference time, but they require high-performance computing resources because they have a large number of parameters. However, not all systems have high-performance hardware. Sometimes, a deep learning model needs to be run on edge devices such as IoT devices or smartphones. On edge devices, however, limited computing resources are available and the amount of computation must be reduced to launch the deep learning models. Pruning is one of the well-known approaches for deriving light-weight models by eliminating weights, channels or filters. In this work, we propose “zero-keep filter pruning” for energy-efficient deep neural networks. The proposed method maximizes the number of zero elements in filters by replacing small values with zero and pruning the filter that has the lowest number of zeros. In the conventional approach, the filters that have the highest number of zeros are generally pruned. As a result, through this zero-keep filter pruning, we can have the filters that have many zeros in a model. We compared the results of the proposed method with the random filter pruning and proved that our method shows better performance with many fewer non-zero elements with a marginal drop in accuracy. Finally, we discuss a possible multiplier architecture, zero-skip multiplier circuit, which skips the multiplications with zero to accelerate and reduce energy consumption.

show abstract

Exploring Lottery Ticket Hypothesis in Spiking Neural Networks

Kim

Park

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Dissecting Lottery Ticket Transformers: Structural and Behavioral Study of Sparse Neural Machine Translation

Cited by 4 publications

References 27 publications

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization

Zero-Keep Filter Pruning for Energy/Power Efficient Deep Neural Networks

Exploring Lottery Ticket Hypothesis in Spiking Neural Networks

Contact Info

Product

Resources

About