Losing Heads in the Lottery: Pruning Transformer Attention in Neural Machine Translation

Zuo

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

et al. 2021

The Lottery Ticket Hypothesis suggests that an over-parametrized network consists of "lottery tickets", and training a certain collection of them (i.e., a subnetwork) can match the performance of the full model. In this paper, we study such a collection of tickets, which is referred to as "winning tickets", in extremely over-parametrized models, e.g., pre-trained language models. We observe that at certain compression ratios, the generalization performance of the winning tickets can not only match but also exceed that of the full model. In particular, we observe a phase transition phenomenon: As the compression ratio increases, generalization performance of the winning tickets first improves then deteriorates after a certain threshold. We refer to the tickets on the threshold as "super tickets". We further show that the phase transition is task and model dependent -as the model size becomes larger and the training data set becomes smaller, the transition becomes more pronounced. Our experiments on the GLUE benchmark show that the super tickets improve single task fine-tuning by 0.9 points on BERT-base and 1.0 points on BERT-large, in terms of task-average score. We also demonstrate that adaptively sharing the super tickets across tasks benefits multi-task learning 1 .

Section: Discussionmentioning

confidence: 99%

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization

Zuo

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

et al. 2021

Computational Linguistics

“…If this becomes possible we would be able to quantify the degree to which the world's writing systems have become on balance less logographic over time, an interesting computational twist on Gelb's original intuition. 35 For a different approach to this issue see Beinborn, Zesch, and Gurevych (2016), who train a model to predict spelling difficulty, based on corpora of spelling errors in three languages. 36 We note in passing that such burden of proof of broader interest is inconsistently applied across areas of computational linguistics.…”

Section: Discussionmentioning

confidence: 99%

“…One difficulty that naturally arises in the transformer setting is how to select the appropriate representation of attention weights given multiple self attention heads. There has been an increased research focus on analyzing the behavior of attention mechanisms in various flavors of transformer models in order to understand the linguistic function of the attention and also improve model compression schemes (Clark et al 2019;Michel, Levy, and Neubig 2019;Vig and Belinkov 2019;Voita et al 2019;Behnke and Heafield 2020;Wang et al 2020;Rogers, Kovaleva, and Rumshisky 2021). While in-depth investigation into the precise role the multiple attention heads play for logography is outside the scope of this work, we opt for a simple strategy whereby we inspect multiple attention heads in the top layer of the decoder-encoder attention block.…”

Section: Investigation Of Alternative Neural Attention Architecturesmentioning

confidence: 99%

The Taxonomy of Writing Systems: How to Measure How Logographic a System Is

Sproat¹,

Gutkin

2021

Taxonomies of writing systems since Gelb (1952) have classified systems based on what the written symbols represent: if they represent words or morphemes, they are logographic; if syllables, syllabic; if segments, alphabetic; etc. Sproat (2000) and Rogers (2005) broke with tradition by splitting the logographic and phonographic aspects into two dimensions, with logography being graded rather than a categorical distinction. A system could be syllabic, and highly logographic; or alphabetic, and mostly non-logographic. This accords better with how writing systems actually work, but neither author proposed a method for measuring logography. In this article we propose a novel measure of the degree of logography that uses an attention based sequence-to-sequence model trained to predict the spelling of a token from its pronunciation in context. In an ideal phonographic system, the model should need to attend to only the current token in order to compute how to spell it, and this would show in the attention matrix activations. In contrast, with a logographic system, where a given pronunciation might correspond to several different spellings, the model would need to attend to a broader context. The ratio of the activation outside the token and the total activation forms the basis of our measure. We compare this with a simple lexical measure, and an entropic measure, as well as several other neural models, and argue that on balance our attention-based measure accords best with intuition about how logographic various systems are. Our work provides the first quantifiable measure of the notion of logography that accords with linguistic intuition and, we argue, provides better insight into what this notion means.

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

“…Recent studies analyzed the roles of attention heads in the Transformer models either in language modeling (LM) (Michel et al, 2019;Clark et al, 2019;Jo and Myaeng, 2020) or NMT (Voita et al, 2019;Behnke and Heafield, 2020;Michel et al, 2019). It has been shown that a set of attention heads might be redundant at inference and can be pruned with almost no loss in performance.…”

Section: Related Workmentioning

confidence: 99%

Do Multilingual Neural Machine Translation Models Contain Language Pair Specific Attention Heads?

Kim¹,

Besacier²,

Nikoulina³

et al. 2021

Recent studies on the analysis of the multilingual representations focus on identifying whether there is an emergence of languageindependent representations, or whether a multilingual model partitions its weights among different languages. While most of such work has been conducted in a "black-box" manner, this paper aims to analyze individual components of a multilingual neural translation (NMT) model. In particular, we look at the encoder self-attention and encoder-decoder attention heads (in a many-to-one NMT model) that are more specific to the translation of a certain language pair than others by (1) employing metrics that quantify some aspects of the attention weights such as "variance" or "confidence", and (2) systematically ranking the importance of attention heads with respect to translation quality. Experimental results show that surprisingly, the set of most important attention heads are very similar across the language pairs and that it is possible to remove nearly one-third of the less important heads without hurting the translation quality greatly.