How Can Self-Attention Networks Recognize Dyck-n Languages?

Ebrahimi, Javid; Gelda, Dhruv; Zhang, Wei

doi:10.18653/v1/2020.findings-emnlp.384

Cited by 12 publications

(13 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A number of studies have considered selfattention models, especially in the past year. Ebrahimi et al (2020) investigated self-attention models using Dyck languages, and claimed that self-attention models with a starting symbol are able to generalise to longer sequences and deeper structures without learning recursion, as competitive LSTM models do. In contrast to us, they studied models trained autoregressively only.…”

Section: Related Workmentioning

confidence: 99%

Can the Transformer Learn Nested Recursion with Symbol Masking?

Bernardy

Maraev

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

We investigate if, given a simple symbol masking strategy, self-attention models are capable of learning nested structures and generalise over their depth. We do so in the simplest setting possible, namely languages consisting of nested parentheses of several kinds. We use encoder-only models, which we train to predict randomly masked symbols, in a BERTlike fashion. We find that the accuracy is well above random baseline, with accuracy consistently above 50% both when increasing nesting depth and distances between training and testing. However, we find that the predictions made correspond to a simple parenthesis counting strategy, rather than a push-down automaton. This suggests that self-attention models are not suitable for tasks which require generalisation to more complex instances of recursive structures than those found in the training set.

show abstract

Section: Related Workmentioning

confidence: 99%

Can the Transformer Learn Nested Recursion with Symbol Masking?

Bernardy

Maraev

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

show abstract

“…In practice, Bhattamishra et al (2020) show transformers can learn tasks requiring counting, and that they struggle when more complicated structural representations are required. Ebrahimi et al (2020) find that attention patterns of certain heads can emulate bounded stacks, but that this ability falls off sharply for longer sequences. Thus, the abilities of trained LSTMs and transformers appear to be predicted by the classes of problems solvable by their saturated counterparts.…”

Section: Nlp and Formal Language Theorymentioning

confidence: 84%

Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

Merrill¹,

Ramanujan²,

Goldberg³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically a variant of gradient descent (GD). To better understand this bias, we study the tendency for transformer parameters to grow in magnitude ( 2 norm) during training, and its implications for the emergent representations within self attention layers. Empirically, we document norm growth in the training of transformer language models, including T5 during its pretraining. As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions. Such "saturated" networks are known to have a reduced capacity compared to the full network family that can be described in terms of formal languages and automata. Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP. We leverage the emergent discrete structure in a saturated transformer to analyze the role of different attention heads, finding that some focus locally on a small number of positions, while other heads compute global averages, allowing counting. We believe understanding the interplay between these two capabilities may shed further light on the structure of computation within large transformers.

show abstract

“…Bhattamishra et al (2020a) prove a soft-attention network with positional masking (but no positional encodings) can solve Dyck 1 but not Dyck 2 . Despite the expressivity issues theoretically posed by the above work, empirical findings have shown Transformers can learn Dyck k from finite samples and outperform LSTM (Ebrahimi et al, 2020). Our work addresses the theory-practice discrepancy by using positional encodings and modeling Dyck k,D .…”

Section: Related Workmentioning

confidence: 97%

“…Please refer to details in Appendix B.4. Ebrahimi et al (2020): the second layer of a twolayer Transformer trained on Dyck k often produces virtually hard attention, where tokens attend to the stack-top open bracket (or start token). It also explains why such a pattern is found less systematically as input depth increases, as ( 6) is hard to learn and generalize to unbounded depth in practice.…”

Section: Second Layer -Depth Matchingmentioning

confidence: 99%

Self-Attention Networks Can Process Bounded Hierarchical Languages

Yao¹,

Peng²,

Papadimitriou³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Despite their impressive performance in NLP, self-attention networks were recently proved to be limited for processing formal languages with hierarchical structure, such as Dyck k , the language consisting of well-nested parentheses of k types. This suggested that natural language can be approximated well with models that are too weak for formal languages, or that the role of hierarchy and recursion in natural language might be limited. We qualify this implication by proving that self-attention networks can process Dyck k,D , the subset of Dyck k with depth bounded by D, which arguably better captures the bounded hierarchical structure of natural language. Specifically, we construct a hard-attention network with D + 1 layers and O(log k) memory size (per token per layer) that recognizes Dyck k,D , and a soft-attention network with two layers and O(log k) memory size that generates Dyck k,D . Experiments show that self-attention networks trained on Dyck k,D generalize to longer inputs with near-perfect accuracy, and also verify the theoretical memory advantage of self-attention networks over recurrent networks. 1

show abstract

How Can Self-Attention Networks Recognize Dyck-n Languages?

Cited by 12 publications

References 16 publications

Can the Transformer Learn Nested Recursion with Symbol Masking?

Can the Transformer Learn Nested Recursion with Symbol Masking?

Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

Self-Attention Networks Can Process Bounded Hierarchical Languages

Contact Info

Product

Resources

About