How Can Self-Attention Networks Recognize Dyck-n Languages?

Ebrahimi, Javid; Gelda, Dhruv; Zhang, Wei

doi:10.48550/arxiv.2010.04303

Cited by 1 publication

(1 citation statement)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Connections between LSTMs and counter automata have also been established empirically (Suzgun et al, 2019a) and theoretically (Merrill et al, 2020). More recently, multiple works have investigated the ability of Transformers to recognize various regular, context-free (Ebrahimi et al, 2020;Yao et al, 2021;Bhattamishra et al, 2020b), and mildly context-sensitive languages (Wang, 2021).…”

Section: H Additional Related Workmentioning

confidence: 99%

Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions

Bhattamishra¹,

Patel²,

Kanade³

et al. 2023

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

Despite the widespread success of Transformers on NLP tasks, recent works have found that they struggle to model several formal languages when compared to recurrent models. This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models. In this work, we conduct an extensive empirical study on Boolean functions to demonstrate the following: (i) Random Transformers are relatively more biased towards functions of low sensitivity. (ii) When trained on Boolean functions, both Transformers and LSTMs prioritize learning functions of low sensitivity, with Transformers ultimately converging to functions of lower sensitivity. (iii) On sparse Boolean functions which have low sensitivity, we find that Transformers generalize near perfectly even in the presence of noisy labels whereas LSTMs overfit and achieve poor generalization accuracy. Overall, our results provide strong quantifiable evidence that suggests differences in the inductive biases of Transformers and recurrent models which may help explain Transformer's effective generalization performance despite relatively limited expressiveness.

show abstract