A survey on Recognizing Textual Entailment as an NLP Evaluation

Poliak, Adam

doi:10.18653/v1/2020.eval4nlp-1.10

Cited by 23 publications

(12 citation statements)

References 62 publications

(70 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our LingFeatured NLIdataset focuses on a specific linguistic phenomenon: the opposition of factivity vs. nonfactivity and the relation of these categories to semantic features such as entailment, contradiction and neutrality. We conclude that the specified datasets allow for a better specialization of ML models to narrow their scope of features to generalize (see [Poliak, 2020]). The three most important features of our dataset are as follows:…”

Section: Language Materials and Our Datasetmentioning

confidence: 85%

Polish Natural Language Inference and Factivity -- an Expert-based Dataset and Benchmarks

Ziembicki¹,

Wróblewska²,

Seweryn³

2022

Preprint

View full text Add to dashboard Cite

Despite recent breakthroughs in Machine Learning for Natural Language Processing, the Natural Language Inference (NLI) problems still constitute a challenge. To this purpose we contribute a new dataset that focuses exclusively on the factivity phenomenon; however, our task remains the same as other NLI tasks, i.e. prediction of entailment, contradiction or neutral (ECN). The dataset contains entirely natural language utterances in Polish and gathers 2,432 verb-complement pairs and 309 unique verbs. The dataset is based on the National Corpus of Polish (NKJP) and is a representative sample in regards to frequency of main verbs and other linguistic features (e.g. occurrence of internal negation).We found that transformer BERT-based models working on sentences obtained relatively good results (≈ 89% F1 score). Even though better results were achieved using linguistic features (≈ 91% F1 score), this model requires more human labour (humans in the loop) because features were prepared manually by expert linguists. BERT-based models consuming only the input sentences show that they capture most of the complexity of NLI/factivity. Complex cases in the phenomenone.g. cases with entitlement (E) and non-factive verbs -remain an open issue for further research.

show abstract

Section: Language Materials and Our Datasetmentioning

confidence: 85%

Polish Natural Language Inference and Factivity -- an Expert-based Dataset and Benchmarks

Ziembicki¹,

Wróblewska²,

Seweryn³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…As mentioned in the introduction, the NLI task (Dagan et al, 2006(Dagan et al, , 2013, sometimes called Recognizing Textual Entailment (RTE), was extensively studied by the NLP community over the past several years as a semantic reasoning benchmark (see Poliak, 2020;Storks et al, 2019, for surveys). The field of fact verification (Vlachos and Riedel, 2014) also recently gained increased attention (Bekoulis et al, 2021;Kotonya and Toni, 2020;Guo et al, 2022;Zeng et al, 2021), sharing similar pair-wise semantic inference challenges, together with evidence retrieval.…”

Section: Related Workmentioning

confidence: 99%

Stretching Sentence-pair NLI Models to Reason over Long Documents and Clusters

Schuster¹,

Chen²,

Buthpitiya³

et al. 2022

Preprint

View full text Add to dashboard Cite

Natural Language Inference (NLI) has been extensively studied by the NLP community as a framework for estimating the semantic relation between sentence pairs. While early work identified certain biases in NLI models, recent advancements in modeling and datasets demonstrated promising performance. In this work, we further explore the direct zero-shot applicability of NLI models to real applications, beyond the sentence-pair setting they were trained on. First, we analyze the robustness of these models to longer and outof-domain inputs. Then, we develop new aggregation methods to allow operating over full documents, reaching state-of-the-art performance on the ContractNLI dataset. Interestingly, we find NLI scores to provide strong retrieval signals, leading to more relevant evidence extractions compared to common similarity-based methods. Finally, we go further and investigate whole document clusters to identify both discrepancies and consensus among sources. In a test case, we find real inconsistencies between Wikipedia pages in different languages about the same topic.

show abstract

“…We follow recent work that test for an expanded range of inference patterns in RTE systems (Bernardy and Chatzikyriakidis, 2019) by evaluating how well RTE models capture specific linguistic phenomena, such as pragmatic inferences (Jeretic et al, 2020), veridicality , and others (Pavlick and Callison-Burch, 2016;White et al, 2017;Dasgupta et al, 2018;Naik et al, 2018;Glockner et al, 2018;Kim et al, 2019;Kober et al, 2019;Richardson et al, 2020;Yanaka et al, 2020;Vashishtha et al, 2020;Poliak, 2020).…”

Section: Related Workmentioning

confidence: 99%

Figurative Language in Recognizing Textual Entailment

Chakrabarty¹,

Ghosh²,

Poliak³

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Self Cite

View full text Add to dashboard Cite

We introduce a collection of recognizing textual entailment (RTE) datasets focused on figurative language. We leverage five existing datasets annotated for a variety of figurative language -simile, metaphor, and irony -and frame them into over 12,500 RTE examples.We evaluate how well state-of-the-art models trained on popular RTE datasets capture different aspects of figurative language. Our results and analyses indicate that these models might not sufficiently capture figurative language, struggling to perform pragmatic inference and reasoning about world knowledge. Ultimately, our datasets provide a challenging testbed for evaluating RTE models.

show abstract

A survey on Recognizing Textual Entailment as an NLP Evaluation

Cited by 23 publications

References 62 publications

Polish Natural Language Inference and Factivity -- an Expert-based Dataset and Benchmarks

Polish Natural Language Inference and Factivity -- an Expert-based Dataset and Benchmarks

Stretching Sentence-pair NLI Models to Reason over Long Documents and Clusters

Figurative Language in Recognizing Textual Entailment

Contact Info

Product

Resources

About