Dissecting Lottery Ticket Transformers: Structural and Behavioral Study of Sparse Neural Machine Translation

Zuo

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

et al. 2021

The Lottery Ticket Hypothesis suggests that an over-parametrized network consists of "lottery tickets", and training a certain collection of them (i.e., a subnetwork) can match the performance of the full model. In this paper, we study such a collection of tickets, which is referred to as "winning tickets", in extremely over-parametrized models, e.g., pre-trained language models. We observe that at certain compression ratios, the generalization performance of the winning tickets can not only match but also exceed that of the full model. In particular, we observe a phase transition phenomenon: As the compression ratio increases, generalization performance of the winning tickets first improves then deteriorates after a certain threshold. We refer to the tickets on the threshold as "super tickets". We further show that the phase transition is task and model dependent -as the model size becomes larger and the training data set becomes smaller, the transition becomes more pronounced. Our experiments on the GLUE benchmark show that the super tickets improve single task fine-tuning by 0.9 points on BERT-base and 1.0 points on BERT-large, in terms of task-average score. We also demonstrate that adaptively sharing the super tickets across tasks benefits multi-task learning 1 .

Section: Structured and Unstructured Lthsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization

Zuo

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

et al. 2021

“…The Transformer architecture (Vaswani et al, 2017) became the backbone of the state-of-the-art models in a variety of tasks Raffel et al, 2019;Adiwardana et al, 2020;Brown et al, 2020). This spurred a significant interest in better understanding inner workings of these models (Vig and Belinkov, 2019;Clark et al, 2019;Kharitonov and Chaabouni, 2020;Hahn, 2020;Movva and Zhao, 2020;Chaabouni et al, 2021;Merrill et al, 2021;Sinha et al, 2021). Most of these works have focussed specifically on how models generalize and capture structure across samples that are similar.…”

Section: Introductionmentioning

confidence: 99%

How BPE Affects Memorization in Transformers

Kharitonov¹,

Baroni²,

Hupkes³

2021

Preprint

Training data memorization in NLP can both be beneficial (e.g., closed-book QA) and undesirable (personal data extraction). In any case, successful model training requires a non-trivial amount of memorization to store word spellings, various linguistic idiosyncrasies and common knowledge. However, little is known about what affects the memorization behavior of NLP models, as the field tends to focus on the equally important question of generalization. In this work, we demonstrate that the size of the subword vocabulary learned by Byte-Pair Encoding (BPE) greatly affects both ability and tendency of standard Transformer models to memorize training data, even when we control for the number of learned parameters. We find that with a large subword vocabulary size, Transformer models fit random mappings more easily and are more vulnerable to membership inference attacks. Similarly, given a prompt, Transformer-based language models with large subword vocabularies reproduce the training data more often. We conjecture this effect is caused by reduction in the sequences' length that happens as the BPE vocabulary grows. Our findings can allow a more informed choice of hyper-parameters, that is better tailored for a particular use-case.

“…Previous studies [3,17] have shown that pruned neural networks evolve to substantially different representations while striving to preserve overall accuracy. In Section 3, we have demonstrated that knowledge distillation can effectively mitigate both pruning and data induced bias in compressed networks.…”

Section: Explaining Model Bias Using Model Similaritymentioning

confidence: 99%

“…Literature on network pruning has been historically focused on accuracy [14,5] with recently work on robustness [7,22]. Movva and Zhao [17] investigated the impact of pruning on layer similarities of NLP models using LinearCKA [12]. Ansuini et al and Blakeney et al also investigated how pruning can change representations using similarity based measures [2,3].…”

Section: Related Workmentioning

confidence: 99%

Simon Says: Evaluating and Mitigating Bias in Pruned Neural Networks with Knowledge Distillation

Blakeney¹,

Huish²,

Yan³

et al. 2021

Preprint

In recent years the ubiquitous deployment of AI has posed great concerns in regards to algorithmic bias, discrimination, and fairness. Compared to traditional forms of bias or discrimination caused by humans, algorithmic bias generated by AI is more abstract and unintuitive therefore more difficult to explain and mitigate. A clear gap exists in the current literature on evaluating and mitigating bias in pruned neural networks. In this work, we strive to tackle the challenging issues of evaluating, mitigating, and explaining induced bias in pruned neural networks. Our paper makes three contributions. First, we propose two simple yet effective metrics, Combined Error Variance (CEV) and Symmetric Distance Error (SDE), to quantitatively evaluate the induced bias prevention quality of pruned models. Second, we demonstrate that knowledge distillation can mitigate induced bias in pruned neural networks, even with unbalanced datasets. Third, we reveal that model similarity has strong correlations with pruning induced bias, which provides a powerful method to explain why bias occurs in pruned neural networks. Our code is available at https://github.com/codestar12/pruning-distilation-bias Preprint. Under review.