We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-inthe-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.
Neural language models (LMs) perform well on tasks that require sensitivity to syntactic structure. Drawing on the syntactic priming paradigm from psycholinguistics, we propose a novel technique to analyze the representations that enable such success. By establishing a gradient similarity metric between structures, this technique allows us to reconstruct the organization of the LMs' syntactic representational space. We use this technique to demonstrate that LSTM LMs' representations of different types of sentences with relative clauses are organized hierarchically in a linguistically interpretable manner, suggesting that the LMs track abstract properties of the sentence.1 Wells et al. (2009) measured priming effects for relative clauses, not dative constructions. For work on priming in production with dative constructions, see Kaschak et al. (2011).
We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-inthe-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.
Syntactically ambiguous sentences that are disambiguated in favor of a less preferred parse are read more slowly than their unambiguous counterparts. This is called a garden path effect. Recent self-paced reading studies have found that this effect decreased as participants are exposed to many such syntactically ambiguous sentences over the course of an experiment. This decrease has been interpreted as evidence that readers rapidly calibrate their expectations to a new environment in order to minimize their surprise when they encounter these initially unexpected syntactic structures (Fine, Jaeger, Farmer, & Qian, 2013). However, such expectation recalibration, referred to as syntactic adaptation, is only one possible explanation for the observed decrease in garden-path effect: recent studies have argued that this decrease may be driven by increased familiarity with the self-paced reading paradigm (task adaptation; Harrington Stack, James, and Watson, 2018). The goal of this paper is to tease apart these two explanations. We demonstrate that there is evidence for rapid syntactic adaptation over and above task adaptation. In a series of power analyses, we should that a large number of participants are necessary in order to have adequate power to detect this effect of syntactic adaptation. The issue of power is exacerbated in experiments designed to detect modulations of the basic syntactic adaptation effect. Such experiments are likely to be underpowered with even 1200 participants. We conclude that syntactic adaptation can in fact be detected using self-paced reading, but that other paradigms (eye-tracking) may be more effective for studying this phenomenon.
The stimuli for all experiments and the scripts used to create the experiments, generate the plots and run all the analyses are available on OSF: https://osf.io/qd8ye/
There has been increased awareness that individuals need not have a binary gender identity (i.e., male or female), but rather, gender identities exist on a spectrum. With this increased awareness, there has also been an increase in the use of they as a singular pronoun when referring to individuals with a non-binary gender identity. Has the processing of singular they changed along with a change in its usage? Previous studies have demonstrated that sentences in which they is co-indexed with singular antecedents, are judged acceptable and are easy to process, but only if the antecedents are non-referential and/or have ambiguous gender; co-indexing they with referential antecedents with unambiguous gender (e.g., Mary) results in lower acceptability ratings and greater processing effort. We investigated whether participants who frequently interacted with individuals with a non-binary gender identity and/or identified as having a non-binary gender themselves would process sentences in which themselves was co-indexed with singular antecedents similarly. We found a significant P600 effect for sentences in which themselves was co-indexed with singular referential antecedents with unambiguous gender, but failed to find a P600 effect when the antecedents were non-referential and/or had an ambiguous gender. This pattern of results is consistent with behavioural results from previous studies, suggesting that the change in the usage of singular they has not resulted in a corresponding change in the way in which this pronoun is processed.
Prediction has been proposed as an overarching principle that explains human information processing in language and beyond. To what degree can processing difficulty in syntactically complex sentences - one of the major concerns of psycholinguistics - be explained by predictability, as estimated using computational language models? A precise, quantitative test of this question requires a much larger scale data collection effort than has been done in the past. We present the Syntactic Ambiguity Processing Benchmark, a dataset of self-paced reading times from 2000 participants, who read a diverse set of complex English sentences. This dataset makes it possible to measure processing difficulty associated with individual syntactic constructions, and even individual sentences, precisely enough to rigorously test the predictions of computational models of language comprehension. We find that the predictions of language models with two different architectures sharply diverge from the reading time data, dramatically underpredicting processing difficulty, failing to predict relative difficulty among different syntactic ambiguous constructions, and only partially explaining item-wise variability. These findings suggest that prediction is most likely insufficient on its own to explain human syntactic processing.
Given the increasingly prominent role NLP models (will) play in our lives, it is important for human expectations of model behavior to align with actual model behavior. Using Natural Language Inference (NLI) as a case study, we investigate the extent to which human-generated explanations of models' inference decisions align with how models actually make these decisions. More specifically, we define three alignment metrics that quantify how well natural language explanations align with model sensitivity to input words, as measured by integrated gradients. Then, we evaluate eight different models (the base and large versions of BERT, RoBERTa and ELECTRA, as well as an RNN and bag-of-words model), and find that the BERT-base model has the highest alignment with human-generated explanations, for all alignment metrics.Focusing in on transformers, we find that the base versions tend to have higher alignment with human-generated explanations than their larger counterparts, suggesting that increasing the number of model parameters leads, in some cases, to worse alignment with human explanations.Finally, we find that a model's alignment with human explanations is not predicted by the model's accuracy, suggesting that accuracy and alignment are complementary ways to evaluate models.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.