“…Models based on Transformers (Vaswani et al, 2017), such as BERT (Devlin et al, 2018), or variants (Yang et al, 2019;Lan et al, 2019;Raffel et al, 2019) yield state-of-the-art results in many NLP tasks such as language modeling (Child et al, 2019;Sukhbaatar et al, 2019;Rae et al, 2019;Kitaev et al, 2020), question answering Lan et al, 2019;Zaheer et al, 2020;Beltagy et al, 2020), and summarization (Zhang et al, 2019). However, existing studies show that they do not have good compositional generalization.…”