COGS: A Compositional Generalization Challenge Based on Semantic Interpretation

Kim, Najoung; Linzen, Tal

doi:10.18653/v1/2020.emnlp-main.731

Cited by 128 publications

(209 citation statements)

References 37 publications

Supporting

Mentioning

207

Contrasting

Order By: Relevance

“…Whether our models have learned to solve tasks in robust and generalizable ways has been a topic of much recent interest. Challenging test sets have shown that many state-of-the-art NLP models struggle with compositionality Kim and Linzen, 2020;Yu and Ettinger, 2020;White et al, 2020), and find it difficult to pass the myriad stress tests for social May et al, 2019;Nangia et al, 2020) and/or linguistic competencies (Geiger et al, 2018;Naik et al, 2018;Glockner et al, 2018;White et al, 2018;Warstadt et al, 2019;Gauthier et al, 2020;Hossain et al, 2020;Jeretic et al, 2020;Lewis et al, 2020;Saha et al, 2020;Schuster et al, 2020;Sugawara et al, 2020;. Yet, challenge sets may suffer from performance instability (Liu et al, 2019a;Rozen et al, 2019; and often lack sufficient statistical power (Card et al, 2020), suggesting that, although they may be valuable assessment tools, they are not sufficient for ensuring that our models have achieved the learning targets we set for them.…”

Section: Challenge Sets and Adversarial Settingsmentioning

confidence: 99%

Dynabench: Rethinking Benchmarking in NLP

Kiela¹,

Bartolo²,

Nie³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

105

View full text Add to dashboard Cite

We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-inthe-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.

show abstract

Section: Challenge Sets and Adversarial Settingsmentioning

confidence: 99%

Dynabench: Rethinking Benchmarking in NLP

Kiela¹,

Bartolo²,

Nie³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

105

View full text Add to dashboard Cite

show abstract

“…The task is framed as a sequence generation task. We use the recently proposed (Kim and Linzen, 2020) dataset.…”

Section: Results On Compositional Generalization Challenge and Semantic Parsingmentioning

confidence: 99%

Are Pretrained Convolutions Better than Pretrained Transformers?

Tay¹,

Dehghani²,

Gupta³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

In the era of pre-trained language models, Transformers are the de facto choice of model architectures.While recent research has shown promise in entirely convolutional, or CNN, architectures, they have not been explored using the pre-train-fine-tune paradigm. In the context of language models, are convolutional models competitive to Transformers when pre-trained? This paper investigates this research question and presents several interesting findings. Across an extensive set of experiments on 8 datasets/tasks, we find that CNN-based pre-trained models are competitive and outperform their Transformer counterpart in certain scenarios, albeit with caveats. Overall, the findings outlined in this paper suggest that conflating pre-training and architectural advances is misguided and that both advances should be considered independently. We believe our research paves the way for a healthy amount of optimism in alternative architectures.

show abstract

“…However, this is unlikely to produce human-like learning and generalisation, particularly in terms of extrapolation beyond the training distribution. For example, Kim and Linzen (2020) find that neural models of semantic parsing struggle to generalise from shallower -e.g. Ava saw the ball in the bottle on the table -to more deeply nested structures -e.g.…”

Section: Discussionmentioning

confidence: 99%

Priorless Recurrent Networks Learn Curiously

Mitchell¹,

Bowers²

2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

Recently, domain-general recurrent neural networks, without explicit linguistic inductive biases, have been shown to successfully reproduce a range of human language behaviours, such as accurately predicting number agreement between nouns and verbs. We show that such networks will also learn number agreement within unnatural sentence structures, i.e. structures that are not found within any natural languages and which humans struggle to process. These results suggest that the models are learning from their input in a manner that is substantially different from human language acquisition, and we undertake an analysis of how the learned knowledge is stored in the weights of the network. We find that while the model has an effective understanding of singular versus plural for individual sentences, there is a lack of a unified concept of number agreement connecting these processes across the full range of inputs. Moreover, the weights handling natural and unnatural structures overlap substantially, in a way that underlines the non-human-like nature of the knowledge learned by the network.

show abstract

COGS: A Compositional Generalization Challenge Based on Semantic Interpretation

Cited by 128 publications

References 37 publications

Dynabench: Rethinking Benchmarking in NLP

Dynabench: Rethinking Benchmarking in NLP

Are Pretrained Convolutions Better than Pretrained Transformers?

Priorless Recurrent Networks Learn Curiously

Contact Info

Product

Resources

About