Successfully Applying the Stabilized Lottery Ticket Hypothesis to the Transformer Architecture

Brix, Christopher; Bahar, Parnia; Ney, Hermann

doi:10.48550/arxiv.2005.03454

Cited by 4 publications

(5 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(d) Success of LTs. While it is exciting to see widespread applicability of LTs in different domains (Brix, Bahar, and Ney 2020;Li et al 2020;Venkatesh et al 2020), the results presented in this paper suggest this success may be due to the underlying pruning algorithm (and transfer learning) rather than LT initializations themselves.…”

Section: Methodsmentioning

confidence: 73%

Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win

Evci

Ioannou

Keskin

et al. 2022

AAAI

View full text Add to dashboard Cite

Sparse Neural Networks (NNs) can match the generalization of dense NNs using a fraction of the compute/storage for inference, and have the potential to enable efficient training. However, naively training unstructured sparse NNs from random initialization results in significantly worse generalization, with the notable exceptions of Lottery Tickets (LTs) and Dynamic Sparse Training (DST). In this work, we attempt to answer: (1) why training unstructured sparse networks from random initialization performs poorly and; (2) what makes LTs and DST the exceptions? We show that sparse NNs have poor gradient flow at initialization and propose a modified initialization for unstructured connectivity. Furthermore, we find that DST methods significantly improve gradient flow during training over traditional sparse training methods. Finally, we show that LTs do not improve gradient flow, rather their success lies in re-learning the pruning solution they are derived from — however, this comes at the cost of learning novel solutions.

show abstract

Section: Methodsmentioning

confidence: 73%

Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win

Evci

Ioannou

Keskin

et al. 2022

AAAI

View full text Add to dashboard Cite

show abstract

“…In the context of the compression approach applied to all models of NLP, it exhibits the idea of masking out the weights that have low magnitude and do not contribute much to the output. The core intent of pruning is first to train massive neural networks and then mask out the weights and finally reach a sparse sub-network that does all the heavy lifting of the neural network (Brix et al, 2020) [123], Behnke and Headfield used [124] pruning applied to neural networks to fasten inference and utilized group lasso regularization to help prune the entire heads and feed-forward connections allowing the model to speed up by 51%.…”

Section: Pruningmentioning

confidence: 99%

Machine Translation Systems Based on Classical-Statistical-Deep-Learning Approaches

et al. 2023

View full text Add to dashboard Cite

Over recent years, machine translation has achieved astounding accomplishments. Machine translation has become more evident with the need to understand the information available on the internet in different languages and due to the up-scaled exchange in international trade. The enhanced computing speed due to advancements in the hardware components and easy accessibility of the monolingual and bilingual data are the significant factors that have added up to boost the success of machine translation. This paper investigates the machine translation models developed so far to the current state-of-the-art providing a solid understanding of different architectures with the comparative evaluation and future directions for the translation task. Because hybrid models, neural machine translation, and statistical machine translation are the types of machine translation that are utilized the most frequently, it is essential to have an understanding of how each one functions. A comprehensive comprehension of the several approaches to machine translation would be made possible as a result of this. In order to understand the advantages and disadvantages of the various approaches, it is necessary to conduct an in-depth comparison of several models on a variety of benchmark datasets. The accuracy of translations from multiple models is compared using metrics such as the BLEU score, TER score, and METEOR score.

show abstract

“…Although IMP methods [84,2,18,5,46] [24,82,51]. Based on IMP technique, researchers found the existence of LTH in various applications including visual recognition tasks [24], natural language processing [4,7,53], reinforcement learning [69,81], generative model [31], low-cost neural network ensembling [46], and improving robustness [8]. Although LTH has been actively explored in ANN domain, LTH for SNNs is rarely studied.…”

Section: Lottery Ticket Hypothesismentioning

confidence: 99%

“…The discovered sub-networks and their corresponding initialization parameters are referred as winning tickets. Based on LTH, a line of works successfully have shown the existence of winning tickets across various tasks such as standard recognition task [80,22,24], reinforcement learning [69,81], natural language processing [4,7,53], and generative model [31]. Along the same line, our primary research objective is to investigate the existence of winning tickets in deep SNNs which has a different type of neuronal dynamics from the common ANNs.…”

Section: Introductionmentioning

confidence: 99%

Exploring Lottery Ticket Hypothesis in Spiking Neural Networks

Kim¹,

Li²,

Park³

et al. 2022

Preprint

View full text Add to dashboard Cite

Spiking Neural Networks (SNNs) have recently emerged as a new generation of low-power deep neural networks, which is suitable to be implemented on low-power mobile/edge devices. As such devices have limited memory storage, neural pruning on SNNs has been widely explored in recent years. Most existing SNN pruning works focus on shallow SNNs (2∼6 layers), however, deeper SNNs (≥16 layers) are proposed by state-of-the-art SNN works, which is difficult to be compatible with the current SNN pruning work. To scale up a pruning technique towards deep SNNs, we investigate Lottery Ticket Hypothesis (LTH) which states that dense networks contain smaller subnetworks (i.e., winning tickets) that achieve comparable performance to the dense networks. Our studies on LTH reveal that the winning tickets consistently exist in deep SNNs across various datasets and architectures, providing up to 97% sparsity without huge performance degradation. However, the iterative searching process of LTH brings a huge training computational cost when combined with the multiple timesteps of SNNs. To alleviate such heavy searching cost, we propose Early-Time (ET) ticket where we find the important weight connectivity from a smaller number of timesteps. The proposed ET ticket can be seamlessly combined with a common pruning techniques for finding winning tickets, such as Iterative Magnitude Pruning (IMP) and Early-Bird (EB) tickets. Our experiment results show that the proposed ET ticket reduces search time by up to 38% compared to IMP or EB methods. Code is available at Github.

show abstract

Successfully Applying the Stabilized Lottery Ticket Hypothesis to the Transformer Architecture

Cited by 4 publications

References 2 publications

Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win

Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win

Machine Translation Systems Based on Classical-Statistical-Deep-Learning Approaches

Exploring Lottery Ticket Hypothesis in Spiking Neural Networks

Contact Info

Product

Resources

About