Assessing Phrasal Representation and Composition in Transformers

Yu, Lang; Ettinger, Allyson

doi:10.18653/v1/2020.emnlp-main.397

Cited by 46 publications

(89 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Whether our models have learned to solve tasks in robust and generalizable ways has been a topic of much recent interest. Challenging test sets have shown that many state-of-the-art NLP models struggle with compositionality Kim and Linzen, 2020;Yu and Ettinger, 2020;White et al, 2020), and find it difficult to pass the myriad stress tests for social May et al, 2019;Nangia et al, 2020) and/or linguistic competencies (Geiger et al, 2018;Naik et al, 2018;Glockner et al, 2018;White et al, 2018;Warstadt et al, 2019;Gauthier et al, 2020;Hossain et al, 2020;Jeretic et al, 2020;Lewis et al, 2020;Saha et al, 2020;Schuster et al, 2020;Sugawara et al, 2020;. Yet, challenge sets may suffer from performance instability (Liu et al, 2019a;Rozen et al, 2019; and often lack sufficient statistical power (Card et al, 2020), suggesting that, although they may be valuable assessment tools, they are not sufficient for ensuring that our models have achieved the learning targets we set for them.…”

Section: Challenge Sets and Adversarial Settingsmentioning

confidence: 99%

Dynabench: Rethinking Benchmarking in NLP

Kiela¹,

Bartolo²,

Nie³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

109

View full text Add to dashboard Cite

We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-inthe-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.

show abstract

Section: Challenge Sets and Adversarial Settingsmentioning

confidence: 99%

Dynabench: Rethinking Benchmarking in NLP

Kiela¹,

Bartolo²,

Nie³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

109

View full text Add to dashboard Cite

show abstract

“…Phrase and sentence composition has drawn frequent attention in analysis of neural models, often focusing on analysis of internal representations and downstream task behavior (Ettinger et al, 2018;Conneau et al, 2019;Nandakumar et al, 2019;Yu and Ettinger, 2020;Bhathena et al, 2020;Mu and Andreas, 2020;Andreas, 2019). Some work investigates compositionality via constructing linguistic (Keysers et al, 2019) and non-linguistic (Liška et al, 2018;Hupkes et al, 2018;Baan et al, 2019) and Ettinger (2020).…”

Section: Related Workmentioning

confidence: 99%

“…The versatility of these pre-trained models suggests that they may acquire fairly robust linguistic knowledge and capacity for natural language "understanding". However, an emerging body of analysis demonstrates a level of superficiality in these models' handling of language (Niven and Kao, 2019;Kim and Linzen, 2020;Ettinger, 2020;Yu and Ettinger, 2020).…”

Section: Introductionmentioning

confidence: 99%

“…In particular, although composition-a model's capacity to combine meaning units into more com-plex units reflecting phrase meanings-is an indispensable component of language understanding, when testing for composition in pre-trained transformer representations, Yu and Ettinger (2020) report that these representations reflect word content of phrases, but don't show signs of more sophisticated humanlike composition beyond word content. In the present paper we perform a direct followup of that study, asking whether models will show better evidence of composition after fine-tuning on tasks that are good candidates for requiring composition: 1) the Quora Question Pairs dataset in Paraphrase Adversaries from Word Scrambling (PAWS-QQP) , an adversarial paraphrase dataset forcing models to classify paraphrases with high lexical overlap, and 2) the Stanford Sentiment Treebank (Socher et al, 2013), a sentiment dataset with fine-grained phrase labels to promote composition.…”

Section: Introductionmentioning

confidence: 99%

“…In the present paper we perform a direct followup of that study, asking whether models will show better evidence of composition after fine-tuning on tasks that are good candidates for requiring composition: 1) the Quora Question Pairs dataset in Paraphrase Adversaries from Word Scrambling (PAWS-QQP) , an adversarial paraphrase dataset forcing models to classify paraphrases with high lexical overlap, and 2) the Stanford Sentiment Treebank (Socher et al, 2013), a sentiment dataset with fine-grained phrase labels to promote composition. We base our analysis on the tests proposed by Yu and Ettinger (2020), which rely on alignment with human judgments of phrase pair similarities, and which leverage control of lexical overlap to target compositionality. We fine-tune and evaluate the same models and representation types tested in that paper, for optimal comparison.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

On the Interplay Between Fine-tuning and Composition in Transformers

Yu¹,

Ettinger²

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Self Cite

View full text Add to dashboard Cite

Pre-trained transformer language models have shown remarkable performance on a variety of NLP tasks. However, recent research has suggested that phrase-level representations in these models reflect heavy influences of lexical content, but lack evidence of sophisticated, compositional phrase information (Yu and Ettinger, 2020). Here we investigate the impact of fine-tuning on the capacity of contextualized embeddings to capture phrase meaning information beyond lexical content. Specifically, we fine-tune models on an adversarial paraphrase classification task with high lexical overlap, and on a sentiment classification task. After fine-tuning, we analyze phrasal representations in controlled settings following prior work. We find that fine-tuning largely fails to benefit compositionality in these representations, though training on sentiment yields a small, localized benefit for certain models. In follow-up analyses, we identify confounding cues in the paraphrase dataset that may explain the lack of composition benefits from that task, and we discuss potential factors underlying the localized benefits from sentiment training.

show abstract