Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

Merrill, William; Ramanujan, Vivek; Goldberg, Yoav; Schwartz, Roy; Smith, Noah A.

doi:10.18653/v1/2021.emnlp-main.133

Cited by 10 publications

(6 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Transformer architecture (Vaswani et al, 2017) became the backbone of the state-of-the-art models in a variety of tasks Raffel et al, 2019;Adiwardana et al, 2020;Brown et al, 2020). This spurred a significant interest in better understanding inner workings of these models (Vig and Belinkov, 2019;Clark et al, 2019;Kharitonov and Chaabouni, 2020;Hahn, 2020;Movva and Zhao, 2020;Chaabouni et al, 2021;Merrill et al, 2021;Sinha et al, 2021). Most of these works have focussed specifically on how models generalize and capture structure across samples that are similar.…”

Section: Introductionmentioning

confidence: 99%

How BPE Affects Memorization in Transformers

Kharitonov¹,

Baroni²,

Hupkes³

2021

Preprint

View full text Add to dashboard Cite

Training data memorization in NLP can both be beneficial (e.g., closed-book QA) and undesirable (personal data extraction). In any case, successful model training requires a non-trivial amount of memorization to store word spellings, various linguistic idiosyncrasies and common knowledge. However, little is known about what affects the memorization behavior of NLP models, as the field tends to focus on the equally important question of generalization. In this work, we demonstrate that the size of the subword vocabulary learned by Byte-Pair Encoding (BPE) greatly affects both ability and tendency of standard Transformer models to memorize training data, even when we control for the number of learned parameters. We find that with a large subword vocabulary size, Transformer models fit random mappings more easily and are more vulnerable to membership inference attacks. Similarly, given a prompt, Transformer-based language models with large subword vocabularies reproduce the training data more often. We conjecture this effect is caused by reduction in the sequences' length that happens as the BPE vocabulary grows. Our findings can allow a more informed choice of hyper-parameters, that is better tailored for a particular use-case.

show abstract

Section: Introductionmentioning

confidence: 99%

How BPE Affects Memorization in Transformers

Kharitonov¹,

Baroni²,

Hupkes³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In §4.1, we claimed that the scaling effect of layer normalization has no effect on the decisions of our constructions for PARITY and FIRST. This is related to the property of approximate homogeneity studied by Merrill et al (2021).…”

Section: A Correctness Of Parity Constructionmentioning

confidence: 96%

Overcoming a Theoretical Limitation of Self-Attention

Chiang¹,

Cholak²

2022

Preprint

View full text Add to dashboard Cite

Although transformers are remarkably effective for many tasks, there are some surprisingly easy-looking regular languages that they struggle with. Hahn shows that for languages where acceptance depends on a single input symbol, a transformer's classification decisions become less and less confident (that is, with crossentropy approaching 1 bit per string) as input strings get longer and longer. We examine this limitation using two languages: PAR-ITY, the language of bit strings with an odd number of 1s, and FIRST, the language of bit strings starting with a 1. We demonstrate three ways of overcoming the limitation suggested by Hahn's lemma. First, we settle an open question by constructing a transformer that recognizes PARITY with perfect accuracy, and similarly for FIRST. Second, we use layer normalization to bring the cross-entropy of both models arbitrarily close to zero. Third, when transformers need to focus on a single position, as for FIRST, we find that they can fail to generalize to longer strings; we offer a simple remedy to this problem that also improves length generalization in machine translation.

show abstract

“…We also tested Euclidean distance but it did not produce clear clusters. This is likely caused by the growth of the norm of the weight vector during training (Merrill et al, 2020) that is unrelated to the data at hand ( §C). This explanation can also explain questions that were previously left open (Qin et al, 2022).…”

Section: Clusteringmentioning

confidence: 99%

Knowledge is a Region in Weight Space for Fine-tuned Language Models

Gueta¹,

Venezian²,

Raffel³

et al. 2023

Preprint

View full text Add to dashboard Cite

Research on neural networks has largely focused on understanding a single model trained on a single dataset. However, relatively little is known about the relationships between different models, especially those trained or tested on different datasets. We address this by studying how the weight space and underlying loss landscape of different models are interconnected. Specifically, we demonstrate that fine-tuned models that were optimized for high performance, reside in well-defined regions in weight space, and vice versathat any model that resides anywhere in those regions also has high performance. Specifically, we show that language models that have been fine-tuned on the same dataset form a tight cluster in the weight space and that models fine-tuned on different datasets from the same underlying task form a looser cluster. Moreover, traversing around the region between the models reaches new models that perform comparably or even better than models found via fine-tuning, even on tasks that the original models were not fine-tuned on. Our findings provide insight into the relationships between models, demonstrating that a model positioned between two similar models can acquire the knowledge of both. We leverage this finding and design a method to pick a better model for efficient fine-tuning. Specifically, we show that starting from the center of the region is as good or better than the pre-trained model in 11 of 12 datasets and improves accuracy by 3.06 on average.

show abstract

Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

Cited by 10 publications

References 22 publications

How BPE Affects Memorization in Transformers

How BPE Affects Memorization in Transformers

Overcoming a Theoretical Limitation of Self-Attention

Knowledge is a Region in Weight Space for Fine-tuned Language Models

Contact Info

Product

Resources

About