Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.384
|View full text |Cite
|
Sign up to set email alerts
|

How Can Self-Attention Networks Recognize Dyck-n Languages?

Abstract: We focus on the recognition of Dyck-n (D n ) languages with self-attention (SA) networks, which has been deemed to be a difficult task for these networks. We compare the performance of two variants of SA, one with a starting symbol (SA + ) and one without (SA − ). Our results show that SA + is able to generalize to longer sequences and deeper dependencies. For D 2 , we find that SA − completely breaks down on long sequences whereas the accuracy of SA + is 58.82%. We find attention maps learned by SA + to be am… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 12 publications
(13 citation statements)
references
References 16 publications
0
9
0
Order By: Relevance
“…A number of studies have considered selfattention models, especially in the past year. Ebrahimi et al (2020) investigated self-attention models using Dyck languages, and claimed that self-attention models with a starting symbol are able to generalise to longer sequences and deeper structures without learning recursion, as competitive LSTM models do. In contrast to us, they studied models trained autoregressively only.…”
Section: Related Workmentioning
confidence: 99%
“…A number of studies have considered selfattention models, especially in the past year. Ebrahimi et al (2020) investigated self-attention models using Dyck languages, and claimed that self-attention models with a starting symbol are able to generalise to longer sequences and deeper structures without learning recursion, as competitive LSTM models do. In contrast to us, they studied models trained autoregressively only.…”
Section: Related Workmentioning
confidence: 99%
“…In practice, Bhattamishra et al (2020) show transformers can learn tasks requiring counting, and that they struggle when more complicated structural representations are required. Ebrahimi et al (2020) find that attention patterns of certain heads can emulate bounded stacks, but that this ability falls off sharply for longer sequences. Thus, the abilities of trained LSTMs and transformers appear to be predicted by the classes of problems solvable by their saturated counterparts.…”
Section: Nlp and Formal Language Theorymentioning
confidence: 84%
“…Bhattamishra et al (2020a) prove a soft-attention network with positional masking (but no positional encodings) can solve Dyck 1 but not Dyck 2 . Despite the expressivity issues theoretically posed by the above work, empirical findings have shown Transformers can learn Dyck k from finite samples and outperform LSTM (Ebrahimi et al, 2020). Our work addresses the theory-practice discrepancy by using positional encodings and modeling Dyck k,D .…”
Section: Related Workmentioning
confidence: 97%
“…Please refer to details in Appendix B.4. Ebrahimi et al (2020): the second layer of a twolayer Transformer trained on Dyck k often produces virtually hard attention, where tokens attend to the stack-top open bracket (or start token). It also explains why such a pattern is found less systematically as input depth increases, as ( 6) is hard to learn and generalize to unbounded depth in practice.…”
Section: Second Layer -Depth Matchingmentioning
confidence: 99%