TaxiNLI: Taking a Ride up the NLU Hill

Joshi, Pratik; Aditya, Somak; Sathe, Aalok; Choudhury, Monojit

doi:10.18653/v1/2020.conll-1.4

Cited by 20 publications

(26 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This ability to analyze the specific kinds of reasoning transformers have become proficient in is a clear advantage psychometrics have over typical NLP evaluations. The NLP community is becoming increasingly aware of the need to construct more fine-grained evaluation benchmarks (Wang et al, 2018;Joshi et al, 2020b), and we believe our work complements these efforts nicely.…”

Section: Discussionmentioning

confidence: 80%

See 1 more Smart Citation

Can Transformer Language Models Predict Psychometric Properties?

Laverghetta¹,

Nighojkar²,

Mirzakhalov³

et al. 2021

Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics

View full text Add to dashboard Cite

Transformer-based language models (LMs) continue to advance state-of-the-art performance on NLP benchmark tasks, including tasks designed to mimic human-inspired "commonsense" competencies. To better understand the degree to which LMs can be said to have certain linguistic reasoning skills, researchers are beginning to adapt the tools and concepts of the field of psychometrics. But to what extent can the benefits flow in the other direction? I.e., can LMs be of use in predicting what the psychometric properties of test items will be when those items are given to human participants? We gather responses from numerous human participants and LMs (transformerand non-transformer-based) on a broad diagnostic test of linguistic competencies. We then use the responses to calculate standard psychometric properties of the items in the diagnostic test, using the human responses and the LM responses separately. We then determine how well these two sets of predictions match. We find cases in which transformerbased LMs predict psychometric properties consistently well in certain categories but consistently poorly in others, thus providing new insights into fundamental similarities and differences between human and LM reasoning. 1

show abstract

Section: Discussionmentioning

confidence: 80%

“…Besides the GLUE diagnostic, other taxonomies have been proposed, such as TaxiNLI (Joshi et al, 2020b). Although TaxiNLI includes some types of reasoning which have no clear analogue in GLUE, many of their categories are quite similar.…”

Section: Related Workmentioning

confidence: 99%

Can Transformer Language Models Predict Psychometric Properties?

Laverghetta¹,

Nighojkar²,

Mirzakhalov³

et al. 2021

Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics

View full text Add to dashboard Cite

show abstract

“…Additionally, it might consist of alternate "explanations", features correlated with the task label in the dataset while not being taskrelevant, which models can exploit to give the impression of good performance at the task itself. Two analysis methods have emerged to address this limitation: 1) Diagnostic examples, where a small number of samples in a test set are annotated with linguistic phenomena of interest, and task accuracy is reported on these samples (Williams et al, 2018;Joshi et al, 2020). However, it is difficult to determine if models perform well on diagnostic examples because they actually learn the linguistic competency, or if they exploit spurious correlations in the data Gururangan et al, 2018;Poliak et al, 2018).…”

Section: Background and Related Workmentioning

confidence: 99%

Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance?

Ravichander¹,

Belinkov²,

Hovy³

2021

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

View full text Add to dashboard Cite

Although neural models have achieved impressive results on several NLP benchmarks, little is understood about the mechanisms they use to perform language tasks. Thus, much recent attention has been devoted to analyzing the sentence representations learned by neural encoders, through the lens of 'probing' tasks. However, to what extent was the information encoded in sentence representations, as discovered through a probe, actually used by the model to perform its task? In this work, we examine this probing paradigm through a case study in Natural Language Inference, showing that models can learn to encode linguistic properties even if they are not needed for the task on which the model was trained. We further identify that pretrained word embeddings play a considerable role in encoding these properties rather than the training task itself, highlighting the importance of careful controls when designing probing experiments. Finally, through a set of controlled synthetic tasks, we demonstrate models can encode these properties considerably above chance-level even when distributed in the data as random noise, calling into question the interpretation of absolute claims on probing tasks. 1 * Supported by the Viterbi Fellowship in the Center for Computer Engineering at the Technion.

show abstract

“…Indeed, the fact that pretrained transformers can be used to create meaningful clusters has been shown in other recent works (c.f. Aharoni and Goldberg (2020); Joshi et al (2020)).…”

Section: Drecamentioning

confidence: 99%

DReCa: A General Task Augmentation Strategy for Few-Shot Natural Language Inference

Murty

Hashimoto

Manning

2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Meta-learning promises few-shot learners that quickly adapt to new distributions by repurposing knowledge acquired from previous training. However, we believe meta-learning has not yet succeeded in NLP due to the lack of a well-defined task distribution, leading to attempts that treat datasets as tasks. Such an ad hoc task distribution causes problems of quantity and quality. Since there's only a handful of datasets for any NLP problem, meta-learners tend to overfit their adaptation mechanism and, since NLP datasets are highly heterogeneous, many learning episodes have poor transfer between their support and query sets, which discourages the meta-learner from adapting. To alleviate these issues, we propose DRECA (Decomposing datasets into Reasoning Categories), a simple method for discovering and using latent reasoning categories in a dataset, to form additional high quality tasks. DRECA works by splitting examples into label groups, embedding them with a finetuned BERT model and then clustering each group into reasoning categories. Across four few-shot NLI problems, we demonstrate that using DRECA improves the accuracy of meta-learners by 1.5-4%.

show abstract

TaxiNLI: Taking a Ride up the NLU Hill

Cited by 20 publications

References 32 publications

Can Transformer Language Models Predict Psychometric Properties?

Can Transformer Language Models Predict Psychometric Properties?

Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance?

DReCa: A General Task Augmentation Strategy for Few-Shot Natural Language Inference

Contact Info

Product

Resources

About