How LSTM Encodes Syntax: Exploring Context Vectors and Semi-Quantization on Natural Text

Shibata, Chihiro; Uchiumi, Kei; Mochihashi, Daichi

doi:10.18653/v1/2020.coling-main.356

Cited by 6 publications

(4 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Experiments suggest that LSTMs trained on synthetic tasks learn to implement counter memory (Weiss et al, 2018;Suzgun et al, 2019a), and that they fail on tasks requiring stacks and other deeper models of structure (Suzgun et al, 2019b). Similarly, Shibata et al (2020) found that LSTM language models trained on natural language data acquire saturated representations approximating counters.…”

Section: Nlp and Formal Language Theorymentioning

confidence: 90%

Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

Merrill¹,

Ramanujan²,

Goldberg³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically a variant of gradient descent (GD). To better understand this bias, we study the tendency for transformer parameters to grow in magnitude ( 2 norm) during training, and its implications for the emergent representations within self attention layers. Empirically, we document norm growth in the training of transformer language models, including T5 during its pretraining. As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions. Such "saturated" networks are known to have a reduced capacity compared to the full network family that can be described in terms of formal languages and automata. Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP. We leverage the emergent discrete structure in a saturated transformer to analyze the role of different attention heads, finding that some focus locally on a small number of positions, while other heads compute global averages, allowing counting. We believe understanding the interplay between these two capabilities may shed further light on the structure of computation within large transformers.

show abstract

Section: Nlp and Formal Language Theorymentioning

confidence: 90%

Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

Merrill¹,

Ramanujan²,

Goldberg³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…Phonology To study whether our models learn phonologically meaningful representations, we study our high-dimensionality hidden representation for each item of our vocabulary, as suggested in Madsen et al (2021). We reduce the dimensionality of our encoded representations using PCA (Pearson, 1901) andt-SNE (der Maaten andHinton, 2008) and look at the emerging underlying organisation of the phonetic space, as was done in Jacobs and Mailhot (2019) and Shibata et al (2020) for, respectively, seq2seq phonetic and LSTM syntactic representations analysis.…”

Section: Synchronic Probesmentioning

confidence: 99%

Probing Multilingual Cognate Prediction Models

Fourrier¹,

Sagot²

2022

Findings of the Association for Computational Linguistics: ACL 2022

View full text Add to dashboard Cite

Character-based neural machine translation models have become the reference models for cognate prediction, a historical linguistics task. So far, all linguistic interpretations about latent information captured by such models have been based on external analysis (accuracy, raw results, errors). In this paper, we investigate what probing can tell us about both models and previous interpretations, and learn that though our models store linguistic and diachronic information, they do not achieve it in previously assumed ways.

show abstract

“…Similarly, they cannot reliably reverse strings (Hao et al, 2018;Merrill, 2019). Shibata et al (2020) show that LSTM language models trained on natural language acquire semi-saturated representations where the gates tightly cluster around discrete values. Thus, sLSTMs appear to be a promising formal model of the counting behavior of LSTMs on both synthetic and natural tasks.…”

Section: Saturated Network As Automatamentioning

confidence: 99%

Formal Language Theory Meets Modern NLP

Merrill

2021

Preprint

View full text Add to dashboard Cite

How LSTM Encodes Syntax: Exploring Context Vectors and Semi-Quantization on Natural Text

Cited by 6 publications

References 23 publications

Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

Probing Multilingual Cognate Prediction Models

Formal Language Theory Meets Modern NLP

Contact Info

Product

Resources

About