LSTM Networks Can Perform Dynamic Counting

Süzgün, Mirac; Gehrmann, Sebastian; Belinkov, Yonatan; Shieber, Stuart M.

doi:10.18653/v1/w19-3905

Cited by 48 publications

(64 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There is empirical evidence that the hidden states of LSTM sequence-to-sequence models trained to perform machine translation models track sequence length by implementing something akin to a counter that increments during encoding and decrements during decoding (Shi et al, 2016). These results are consistent with theoretical and empirical findings that show that LSTMs can efficiently implement counting mechanisms (Weiss et al, 2018;Suzgun et al, 2019a;Merrill, 2020). Our experiments will show that tracking absolute token position by implementing something akin to these counters makes extrapolation difficult.…”

Section: Related Worksupporting

confidence: 64%

The EOS Decision and Length Extrapolation

Newman

Hewitt

Liang

et al. 2020

Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

View full text Add to dashboard Cite

Extrapolation to unseen sequence lengths is a challenge for neural generative models of language. In this work, we characterize the effect on length extrapolation of a modeling decision often overlooked: predicting the end of the generative process through the use of a special end-of-sequence (EOS) vocabulary item. We study an oracle setting-forcing models to generate to the correct sequence length at test time-to compare the lengthextrapolative behavior of networks trained to predict EOS (+EOS) with networks not trained to (-EOS). We find that -EOS substantially outperforms +EOS, for example extrapolating well to lengths 10 times longer than those seen at training time in a bracket closing task, as well as achieving a 40% improvement over +EOS in the difficult SCAN dataset length generalization task. By comparing the hidden states and dynamics of -EOS and +EOS models, we observe that +EOS models fail to generalize because they (1) unnecessarily stratify their hidden states by their linear position is a sequence (structures we call length manifolds) or (2) get stuck in clusters (which we refer to as length attractors) once the EOS token is the highest-probability prediction.

show abstract

Section: Related Worksupporting

confidence: 64%

The EOS Decision and Length Extrapolation

Newman

Hewitt

Liang

et al. 2020

Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

View full text Add to dashboard Cite

show abstract

“…Our results on the parentheses corpora do not necessarily provide proof that the LSTMs trained on the Nesting Parentheses corpus aren't encoding and utilizing hierarchical structure. In fact, previous research shows that LSTMs are able to suc-cessfully model stack-based hierarchical languages (Suzgun et al, 2019b;Yu et al, 2019;Suzgun et al, 2019a). What our results do indicate is that, in order for LSTMs to model human language, being able to model hierarchical structure is similar in utility to having access to a non-hierarchical ability to "look back" at one relevant dependency.…”

Section: Discussionsupporting

confidence: 53%

Learning Music Helps You Read: Using Transfer to Study Linguistic Structure in Language Models

Papadimitriou

Jurafsky

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

We propose transfer learning as a method for analyzing the encoding of grammatical structure in neural language models. We train LSTMs on non-linguistic data and evaluate their performance on natural language to assess which kinds of data induce generalizable structural features that LSTMs can use for natural language. We find that training on nonlinguistic data with latent structure (MIDI music or Java code) improves test performance on natural language, despite no overlap in surface form or vocabulary. To pinpoint the kinds of abstract structure that models may be encoding to lead to this improvement, we run similar experiments with two artificial parentheses languages: one which has a hierarchical recursive structure, and a control which has paired tokens but no recursion. Surprisingly, training a model on either of these artificial languages leads the same substantial gains when testing on natural language. Further experiments on transfer between natural languages controlling for vocabulary overlap show that zero-shot performance on a test language is highly correlated with typological syntactic similarity to the training language, suggesting that representations induced by pre-training correspond to the cross-linguistic syntactic properties. Our results provide insights into the ways that neural models represent abstract syntactic structure, and also about the kind of structural inductive biases which allow for natural language acquisition. 1

show abstract

“…On the other hand, a long line of research has sought to understand the capabilities of recurrent neural models such as the LSTMs (Hochreiter and Schmidhuber, 1997) . Recently, Weiss et al (2018), Suzgun et al (2019a) showed that LSTMs are capable of recognizing counter languages such as Dyck-1 and a n b n by learning to perform counting like behavior. Suzgun et al (2019a) showed that LSTMs can recognize shuffles of multiple Dyck-1 languages, also known as Shuffle-Dyck.…”

Section: Introductionmentioning

confidence: 99%

On the Ability and Limitations of Transformers to Recognize Formal Languages

Bhattamishra¹,

Ahuja²,

Goyal³

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Transformers have supplanted recurrent models in a large number of NLP tasks. However, the differences in their abilities to model different syntactic properties remain largely unknown. Past works suggest that LSTMs generalize very well on regular languages and have close connections with counter languages. In this work, we systematically study the ability of Transformers to model such languages as well as the role of its individual components in doing so. We first provide a construction of Transformers for a subclass of counter languages, including well-studied languages such as n-ary Boolean Expressions, Dyck-1, and its generalizations. In experiments, we find that Transformers do well on this subclass, and their learned mechanism strongly correlates with our construction. Perhaps surprisingly, in contrast to LSTMs, Transformers do well only on a subset of regular languages with degrading performance as we make languages more complex according to a well-known measure of complexity. Our analysis also provides insights on the role of self-attention mechanism in modeling certain behaviors and the influence of positional encoding schemes on the learning and generalization abilities of the model.

show abstract

LSTM Networks Can Perform Dynamic Counting

Cited by 48 publications

References 31 publications

The EOS Decision and Length Extrapolation

The EOS Decision and Length Extrapolation

Learning Music Helps You Read: Using Transfer to Study Linguistic Structure in Language Models

On the Ability and Limitations of Transformers to Recognize Formal Languages

Contact Info

Product

Resources

About