OCNLI: Original Chinese Natural Language Inference

Hu, Hai; Richardson, Kyle; Xu, Liang; Li, Lü; Kübler, Sandra; Moss, Lawrence S.

doi:10.18653/v1/2020.findings-emnlp.314

Cited by 48 publications

(31 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We verify our findings with three popular English NLI datasets-SNLI (Bowman et al, 2015), MultiNLI (Williams et al, 2018b) and ANLI (Nie et al, 2020))-and one Chinese one, OCNLI (Hu et al, 2020a). It is thus less likely that our findings result from some quirk of English or a particular tokenization strategy.…”

supporting

confidence: 77%

See 1 more Smart Citation

UnNatural Language Inference

Sinha¹,

Parthasarathi²,

Pineau³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Recent investigations into the inner-workings of state-of-the-art large-scale pre-trained Transformer-based Natural Language Understanding (NLU) models indicate that they appear to know humanlike syntax, at least to some extent. We provide novel evidence that complicates this claim: we find that state-of-the-art Natural Language Inference (NLI) models assign the same labels to permuted examples as they do to the original, i.e. they are largely invariant to random wordorder permutations. This behavior notably differs from that of humans; we struggle with ungrammatical sentences. To measure the severity of this issue, we propose a suite of metrics and investigate which properties of particular permutations lead models to be word-order invariant. In the MNLI dataset, for example, we find almost all (98.7%) examples contain at least one permutation which elicits the gold label. Models are sometimes even able to assign gold labels to permutations that they originally failed to predict correctly. We provide a comprehensive empirical evaluation of this phenomenon, and further show that this issue exists for both Transformers and pre-Transformer RNN / ConvNet based encoders, as well as across multiple languages (English and Mandarin Chinese). Our code and data are available at https://github.com/facebookresearch/unlu.

show abstract

supporting

confidence: 77%

“…We train all models on MNLI, and evaluate on in-distribution (SNLI and MNLI) and out-of-distribution datasets (ANLI). We independently verify results of (a) using both our fine-tuned model using HuggingFace Transformers (Hu et al, 2020a). Bold marks the highest value per metric (red shows the model is insensitive to permutation).…”

Section: Resultsmentioning

confidence: 99%

UnNatural Language Inference

Sinha¹,

Parthasarathi²,

Pineau³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

show abstract

“…Another approach brings computational linguists directly into the crowdsourcing process. This was recently demonstrated at a small scale by Hu et al (2020) with OCNLI: They show that it is possible to significantly improve data quality issues by making small interventions during the crowdsourcing process-like offering additional bonus payments for examples that avoid overused words and constructions-without significantly limiting annotators' freedom to independently construct creative examples.…”

Section: Improving Validitymentioning

confidence: 99%

What Will it Take to Fix Benchmarking in Natural Language Understanding?

Bowman

Dahl

2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Evaluation for many natural language understanding (NLU) tasks is broken: Unreliable and biased systems score so highly on standard benchmarks that there is little room for researchers who develop better systems to demonstrate their improvements. The recent trend to abandon IID benchmarks in favor of adversarially-constructed, out-of-distribution test sets ensures that current models will perform poorly, but ultimately only obscures the abilities that we want our benchmarks to measure. In this position paper, we lay out four criteria that we argue NLU benchmarks should meet. We argue most current benchmarks fail at these criteria, and that adversarial data collection does not meaningfully address the causes of these failures. Instead, restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets, the reliability with which they are annotated, their size, and the ways they handle social bias.

show abstract

“…In this writing task, we provide a context passage drawn from the Open American National Corpus (Ide and Suderman, 2006). 4 Inspired by Hu et al (2020), we ask workers to write two questions per passage with four answer choices each. We direct workers to ensure that the questions are answerable given the passage and that there is only one correct answer for each question.…”

Section: Writing Examplesmentioning

confidence: 99%

What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?

Nangia¹,

Sugawara²,

Trivedi³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Crowdsourcing is widely used to create data for common natural language understanding tasks. Despite the importance of these datasets for measuring and refining model understanding of language, there has been little focus on the crowdsourcing methods used for collecting the datasets. In this paper, we compare the efficacy of interventions that have been proposed in prior work as ways of improving data quality. We use multiple-choice question answering as a testbed and run a randomized trial by assigning crowdworkers to write questions under one of four different data collection protocols. We find that asking workers to write explanations for their examples is an ineffective stand-alone strategy for boosting NLU example difficulty. However, we find that training crowdworkers, and then using an iterative process of collecting data, sending feedback, and qualifying workers based on expert judgments is an effective means of collecting challenging data. But using crowdsourced, instead of expert judgments, to qualify workers and send feedback does not prove to be effective. We observe that the data from the iterative protocol with expert assessments is more challenging by several measures. Notably, the humanmodel gap on the unanimous agreement portion of this data is, on average, twice as large as the gap for the baseline protocol data. * Equal contribution. † Work done while at New York University. 10 We use pretrained models distributed with HuggingFace Transformers (Wolf et al., 2020). batch size of 8, learning rate of 1.0 × 10 −5 , and finetune the models using the Adam optimizer for 4 epochs on the RACE dataset.

show abstract

OCNLI: Original Chinese Natural Language Inference

Cited by 48 publications

References 41 publications

UnNatural Language Inference

UnNatural Language Inference

What Will it Take to Fix Benchmarking in Natural Language Understanding?

What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?

Contact Info

Product

Resources

About