Probing What Different NLP Tasks Teach Machines about Function Word Comprehension

Kim, Najoung; Patel, Roma; Poliak, Adam; Xia, Patrick; Wang, Alex; McCoy, Tom; Tenney, Ian; Ross, Alexis; Linzen, Tal; Durme, Benjamin Van; Bowman, Samuel R.; Pavlick, Ellie

doi:10.18653/v1/s19-1026

Cited by 80 publications

(67 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use well-established datasets for our probing tasks, including the edge-probing suite from Tenney et al (2019b), function word oriented tasks from Kim et al (2019), and sentence-level probing datasets (SentEval; Conneau et al, 2018).…”

Section: Probing Tasksmentioning

confidence: 99%

“…We use the following five datasets: AJ-CoLA is a task that tests for a model's understanding of general grammaticality using the Corpus of Linguistic Acceptability (CoLA) (Warstadt et al, 2019b), which is drawn from 22 theoretical linguistics publications. The other tasks concern the behaviors of specific classes of function words, using the dataset by Kim et al (2019): AJ-WH is a task that tests a model's ability to detect if a wh-word in a sentence has been swapped with another wh-word, which tests a model's ability to identify the antecedent associated with the wh-word. AJ-Def is a task that tests a model's ability to detect if the definite/indefinite articles in a given sentence have been swapped.…”

Section: Probing Tasksmentioning

confidence: 99%

See 1 more Smart Citation

Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work?

Pruksachatkun

Phang

Liu

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Self Cite

138

View full text Add to dashboard Cite

While pretrained models such as BERT have shown large gains across natural language understanding tasks, their performance can be improved by further training the model on a data-rich intermediate task, before fine-tuning it on a target task. However, it is still poorly understood when and why intermediate-task training is beneficial for a given target task. To investigate this, we perform a large-scale study on the pretrained RoBERTa model with 110 intermediate-target task combinations. We further evaluate all trained models with 25 probing tasks meant to reveal the specific skills that drive transfer. We observe that intermediate tasks requiring high-level inference and reasoning abilities tend to work best. We also observe that target task performance is strongly correlated with higher-level abilities such as coreference resolution. However, we fail to observe more granular correlations between probing and target task performance, highlighting the need for further work on broad-coverage probing benchmarks. We also observe evidence that the forgetting of knowledge learned during pretraining may limit our analysis, highlighting the need for further work on transfer learning methods in these settings.

show abstract

Section: Probing Tasksmentioning

confidence: 99%

Section: Probing Tasksmentioning

confidence: 99%

Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work?

Pruksachatkun

Phang

Liu

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Self Cite

138

View full text Add to dashboard Cite

show abstract

“…If, on the other hand, interpretability is defined as the possibility to provide a post-hoc, compact natural language explanation of why a certain output was produced in response to a certain input, then humans and complex artificial models can, in principle, be equally interpretable. Saliency maps (Simonyan, Vedaldi, & Zisserman, 2013), behavioral testing (Ribeiro, Wu, Guestrin, & Singh, 2020), probing methods (Bolukbasi, Chang, Zou, Saligrama, & Kalai, 2016;Bordia & Bowman, 2019;Gardner et al, 2020;Kim et al, 2019;Linzen & Baroni, 2020), and adversarial attacks (I. I.…”

Section: Interpretabilitymentioning

confidence: 99%

Putting psychology to the test: Rethinking model evaluation through benchmarking and prediction

Rocca¹,

Yarkoni²

2020

Preprint

View full text Add to dashboard Cite

Consensus on standards for evaluating models and theories is an integral part of every science. Nonetheless, in psychology, relatively little focus has been placed on defining reliable communal metrics to assess model performance. Evaluation practices are often idiosyncratic, and are affected by a number of shortcomings (e.g., failure to assess models' ability to generalize to unseen data) that make it difficult to discriminate between good and bad models. Drawing inspiration from fields like machine learning and statistical genetics, we argue in favor of introducing common benchmarks as a means of overcoming the lack of reliable model evaluation criteria currently observed in psychology. We discuss a number of principles benchmarks should satisfy to achieve maximal utility; identify concrete steps the community could take to promote the development of such benchmarks; and address a number of potential pitfalls and concerns that may arise in the course of implementation. We argue that reaching consensus on common evaluation benchmarks will foster cumulative progress in psychology, and encourage researchers to place heavier emphasis on the practical utility of scientific models.

show abstract

“…A portion of past work on analyzing pre-trained encoders is mainly based on clean data. As mentioned in Tenney et al (2019a), these studies can be roughly divided into two categories: (1) designing controlled tasks to probe whether a specific linguistic phenomenon is captured by models (Conneau et al, 2018;Peters et al, 2019;Tenney et al, 2019b;Kim et al, 2019), or (2) decomposing the model structure and exploring what linguistic property is encoded (Tenney et al, 2019a;Jawahar et al, 2019;Clark et al, 2019). However, these studies do not analyze how grammatical errors affect model behaviors.…”

Section: Related Workmentioning

confidence: 99%

On the Robustness of Language Encoders against Grammatical Errors

Yin¹,

Long²,

Meng³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

We conduct a thorough study to diagnose the behaviors of pre-trained language encoders (ELMo, BERT, and RoBERTa) when confronted with natural grammatical errors. Specifically, we collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data. We use this approach to facilitate debugging models on downstream applications. Results confirm that the performance of all tested models is affected but the degree of impact varies. To interpret model behaviors, we further design a linguistic acceptability task to reveal their abilities in identifying ungrammatical sentences and the position of errors. We find that fixed contextual encoders with a simple classifier trained on the prediction of sentence correctness are able to locate error positions. We also design a cloze test for BERT and discover that BERT captures the interaction between errors and specific tokens in context. Our results shed light on understanding the robustness and behaviors of language encoders against grammatical errors.

show abstract

Probing What Different NLP Tasks Teach Machines about Function Word Comprehension

Cited by 80 publications

References 34 publications

Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work?

Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work?

Putting psychology to the test: Rethinking model evaluation through benchmarking and prediction

On the Robustness of Language Encoders against Grammatical Errors

Contact Info

Product

Resources

About