Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021
DOI: 10.18653/v1/2021.emnlp-main.49
|View full text |Cite
|
Sign up to set email alerts
|

The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers

Abstract: Recently, many datasets have been proposed to test the systematic generalization ability of neural networks. The companion baseline Transformers, typically trained with default hyper-parameters from standard tasks, are shown to fail dramatically. Here we demonstrate that by revisiting model configurations as basic as scaling of embeddings, early stopping, relative positional embedding, and Universal Transformer variants, we can drastically improve the performance of Transformers on systematic generalization. W… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
33
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
4

Relationship

0
10

Authors

Journals

citations
Cited by 31 publications
(58 citation statements)
references
References 18 publications
(10 reference statements)
1
33
0
Order By: Relevance
“…Improving generalization Many approaches have been recently proposed to examine and improve generalization, including the effect of training size and architecture , data augmentation (Andreas, 2020;Akyürek et al, 2021;Guo et al, 2021), data sampling (Oren et al, 2021), model architecture Bogin et al, 2021b;Chen et al, 2020), intermediate representations and different training techniques (Oren et al, 2020;Csordás et al, 2021).…”
Section: Related Workmentioning
confidence: 99%
“…Improving generalization Many approaches have been recently proposed to examine and improve generalization, including the effect of training size and architecture , data augmentation (Andreas, 2020;Akyürek et al, 2021;Guo et al, 2021), data sampling (Oren et al, 2021), model architecture Bogin et al, 2021b;Chen et al, 2020), intermediate representations and different training techniques (Oren et al, 2020;Csordás et al, 2021).…”
Section: Related Workmentioning
confidence: 99%
“…Improving Generalization Many approaches to examining and improving compositional generalization have been proposed, including specialized architectures with inductive bias for compositional generalization Bogin et al, 2021b;Gordon et al, 2020;, data augmentation (Andreas, 2020; Akyürek et al, 2021;Guo et al, 2021), modifications to training methodology (Oren et al, 2020;Csordás et al, 2021) and meta learning (Conklin et al, 2021;Lake, 2019). Data-based approaches have the advantage of being modelagnostic and hence can be used in conjunction with pretrained models.…”
Section: Related Workmentioning
confidence: 99%
“…Early self-attention mechanisms have added representations of the absolute positions of tokens to its inputs (Vaswani et al, 2017). However, we use representations of relative positions, or distances between tokens, in line with recent work showing that relative attention is advantageous, particularly on length-generalization tasks (Shaw et al, 2018;Csordás et al, 2021). By considering logarithmic distances, our model is also encouraged to attend to more recent tokens during decoding, which can be desirable when programs consist of multiple smaller parts.…”
Section: Baseline Transformermentioning
confidence: 99%