Uncertain Natural Language Inference

Chen, Tongfei; Jiang, Zhengping; Poliak, Adam; Sakaguchi, Kei; Durme, Benjamin Van

doi:10.18653/v1/2020.acl-main.774

Cited by 38 publications

(42 citation statements)

References 33 publications

(22 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In this case, a piece of evidence contradicts a relative clause in the claim but does not refute the entire claim. Similar problems regarding the uncertainty of NLI tasks have been pointed out in previous works (Zaenen et al, 2005;Pavlick and Kwiatkowski, 2019;Chen et al, 2020a).…”

Section: Claim Labelingsupporting

confidence: 76%

“…However, we find that the decision between RE-FUTED and NOTENOUGHINFO can be ambiguous in many-hop claims and even the high-quality, trained annotators from Appen, instead of Mturk, cannot consistently choose the correct label from these two classes. Recent works (Pavlick and Kwiatkowski, 2019;Chen et al, 2020a) have raised concern over the uncertainty of NLI tasks with categorical labels and proposed to shift to a probabilistic scale. Since this work is mainly targeting the many-hop retrieval, we combine the REFUTED and NOTENOUGHINFO into a single class, namely NOT-SUPPORTED.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

HoVer: A Dataset for Many-Hop Fact Extraction And Claim Verification

Jiang¹,

Bordia

Zheng³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

We introduce HOVER (HOppy VERification), a dataset for many-hop evidence extraction and fact verification. It challenges models to extract facts from several Wikipedia articles that are relevant to a claim and classify whether the claim is SUPPORTED or NOT-SUPPORTED by the facts. In HOVER, the claims require evidence to be extracted from as many as four English Wikipedia articles and embody reasoning graphs of diverse shapes. Moreover, most of the 3/4-hop claims are written in multiple sentences, which adds to the complexity of understanding long-range dependency relations such as coreference. We show that the performance of an existing stateof-the-art semantic-matching model degrades significantly on our dataset as the number of reasoning hops increases, hence demonstrating the necessity of many-hop reasoning to achieve strong results. We hope that the introduction of this challenging dataset and the accompanying evaluation task will encourage research in many-hop fact retrieval and information verification. 1

show abstract

Section: Claim Labelingsupporting

confidence: 76%

Section: Introductionmentioning

confidence: 99%

HoVer: A Dataset for Many-Hop Fact Extraction And Claim Verification

Jiang¹,

Bordia

Zheng³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

show abstract

“…It gained tremendous popularity again 10 years later, with the release of the large-scale Stanford Natural Language Inference dataset (SNLI; Bowman et al, 2015), that facilitated training neural models, and which was followed by several other datasets in that nature (Williams et al, 2018;Nie et al, 2019). But-among other criticisms of the task-it has been shown that people generally don't agree on entailment annotations (Pavlick and Kwiatkowski, 2019), and new variants of the task suggested to shift away from categorical labels to ordinal or numeric values denoting plausibility (Zhang et al, 2017;Sakaguchi and Van Durme, 2018;Chen et al, 2020). In this paper we focus on the defeasibil-ity of textual entailments, a less well-studied phenomenon in this context.…”

Section: Background and Related Workmentioning

confidence: 99%

Thinking Like a Skeptic: Defeasible Inference in Natural Language

Rudinger¹,

Shwartz

Hwang

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Defeasible inference is a mode of reasoning in which an inference (X is a bird, therefore X flies) may be weakened or overturned in light of new evidence (X is a penguin). Though long recognized in classical AI and philosophy, defeasible inference has not been extensively studied in the context of contemporary data-driven research on natural language inference and commonsense reasoning. We introduce Defeasible NLI (abbreviated δ-NLI), a dataset for defeasible inference in natural language. δ-NLI contains extensions to three existing inference datasets covering diverse modes of reasoning: common sense, natural language inference, and social norms. From δ-NLI, we develop both a classification and generation task for defeasible inference, and demonstrate that the generation task is much more challenging. Despite lagging human performance, however, generative models trained on this data are capable of writing sentences that weaken or strengthen a specified inference up to 68% of the time.

show abstract

“…(Williams et al, 2018), JOCI (Zhang et al, 2017) and DNC (Poliak et al, 2018)). In their study, annotators had to select the degree to which a premise entails a hypothesis, on a scale (Chen et al, 2020) (instead of discrete labels). Pavlick and Kwiatkowski (2019) show that even though these datasets are reported to have high agreement scores, specific examples suffer from inherent disagreements.…”

Section: Resultsmentioning

confidence: 99%

“…In light of the low agreements on explicit modeling of the task of complement coercion, we turn to a different crowdsourcing approach which was proven successful for many linguistic phenomena -using NLI as discussed above ( §2). NLI was used to collect data for a wide range of linguistic phenomena: Paraphrase Inference, Anaphora Resolution, Numerical Reasoning, Implicatures and more (White et al, 2017;Poliak et al, 2018;Jeretic et al, 2020;Yanaka et al, 2020;Naik et al, 2018) (see Poliak (2020)). Therefore, we take a similar approach, with similar methodologies, and make use of NLI as an evaluation setup for the complement coercion phenomenon.…”

Section: Nli For Complement Coercionmentioning

confidence: 99%

The Extraordinary Failure of Complement Coercion Crowdsourcing

Elazar¹,

Basmov²,

Ravfogel³

et al. 2020

Proceedings of the First Workshop on Insights From Negative Results in NLP

View full text Add to dashboard Cite

Crowdsourcing has eased and scaled up the collection of linguistic annotation in recent years. In this work, we follow known methodologies of collecting labeled data for the complement coercion phenomenon. These are constructions with an implied action -e.g., "I started a new book I bought last week", where the implied action is reading. We aim to collect annotated data for this phenomenon by reducing it to either of two known tasks: Explicit Completion and Natural Language Inference. However, in both cases, crowdsourcing resulted in low agreement scores, even though we followed the same methodologies as in previous work. Why does the same process fail to yield high agreement scores? We specify our modeling schemes, highlight the differences with previous work and provide some insights about the task and possible explanations for the failure. We conclude that specific phenomena require tailored solutions, not only in specialized algorithms, but also in data collection methods.

show abstract

Uncertain Natural Language Inference

Cited by 38 publications

References 33 publications

HoVer: A Dataset for Many-Hop Fact Extraction And Claim Verification

HoVer: A Dataset for Many-Hop Fact Extraction And Claim Verification

Thinking Like a Skeptic: Defeasible Inference in Natural Language

The Extraordinary Failure of Complement Coercion Crowdsourcing

Contact Info

Product

Resources

About