The Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS) Shared Task

Aly, Rami; Guo, Zhijiang; Schlichtkrull, Michael Sejr; Thorne, James; Vlachos, Andreas; Christodoulopoulos, Christos; Cocarascu, Oana; Mittal, Arpit

doi:10.18653/v1/2021.fever-1.1

Cited by 50 publications

(64 citation statements)

References 23 publications

(12 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• Structured Knowledge Grounding -We use several component tasks from UnifiedSKG (Xie et al, 2022), namely WikiTQ (Pasupat & Liang, 2015), CompWQ , FetaQA (Nan et al, 2021), HybridQA , WikiSQL (Zhong et al, 2017), TabFat , Feverous (Aly et al, 2021), SQA (Iyyer et al, 2017), MTOP and DART (Nan et al, 2020). We select datasets that are relatively convenient to perform evaluation and uses mainstream metrics such as accuracy or exact match instead of obscure ones or those that require significant domain specific post-processing.…”

Section: Datasets For Supervised Finetuningmentioning

confidence: 99%

UL2: Unifying Language Learning Paradigms

Tay¹,

Dehghani²,

Trần³

et al. 2022

Preprint

View full text Add to dashboard Cite

Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -two concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across multiple diverse setups. Finally, by scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised NLP tasks ranging from language generation (with automated and human evaluation), language understanding, text classification, question answering, commonsense reasoning, long text reasoning, structured knowledge grounding and information retrieval. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. We release Flax-based T5X model checkpoints for the 20B model at https: //github.com/google-research/google-research/tree/master/ul2. * Yi and Mostafa are co-leads of this project and are denoted with * . denotes technical research contributors. denotes data & infrastructure contributions. denotes advising contributions. Don, denoted with is the senior most last author.

show abstract

Section: Datasets For Supervised Finetuningmentioning

confidence: 99%

UL2: Unifying Language Learning Paradigms

Tay¹,

Dehghani²,

Trần³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In this paper, we evaluate non-committal answers such as "No comment" or "I don't know" as true, even when there's a sense in which the model "knows" a true answer. 1 It follows from our definition that a model counts as perfectly truthful if it answers "No comment" for every question. In practice we want answers that are both truthful and informative (i.e.…”

Section: Defining the Truthfulness Objectivementioning

confidence: 99%

“…Truthfulness is relevant to many applications include generating news stories [22], summarization [12,28,40,45], conversational dialog [38,36], and question answering [10,23,25,27]. A related line of research is automated fact-checking [43,1,2], where the focus is on evaluation of statements rather than generation.…”

Section: Related Workmentioning

confidence: 99%

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Lin¹,

Hilton²,

Evans³

2021

Preprint

View full text Add to dashboard Cite

We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. We crafted questions that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. We tested GPT-3, GPT-Neo/J, GPT-2 and a T5-based model. The best model was truthful on 58% of questions, while human performance was 94%. Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans. The largest models were generally the least truthful. For example, the 6B-parameter GPT-J model was 17% less truthful than its 125M-parameter counterpart. This contrasts with other NLP tasks, where performance improves with model size. However, this result is expected if false answers are learned from the training distribution. We suggest that scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web."The enemy of truth is blind acceptance." Anonymous 1. Accidental misuse. Due to lack of rigorous testing, deployed models make false statements to users. This could lead to deception and distrust [42].2. Blocking positive applications. In applications like medical or legal advice, there are high standards for factual accuracy. Even if models have relevant knowledge, people may avoid deploying them without clear evidence they are reliably truthful.3. Malicious misuse. If models can generate plausible false statements, they could be used to deceive humans via disinformation or fraud. By contrast, models that are reliably truthful would be harder to deploy for deceptive uses.To address these concerns, it is valuable to quantify how truthful models are. In particular: How likely are models to make false statements across a range of contexts and questions? Better measurement will help in producing more truthful models and in understanding the risks of deceptive models.Preprint. Under review.

show abstract

“…Fact-checking Vlachos and Riedel (2014) proposed to decompose the fact-checking process into three components: identifying check-worthy claims, retrieving evidence, and producing verdicts. Various datasets have been proposed, including human-generated claims based on Wikepedia (Thorne et al, 2018;Chen et al, 2019;Jiang et al, 2020;Schuster et al, 2021;Aly et al, 2021), real-world political claims (Wang, 2017;Alhindi et al, 2018;Augenstein et al, 2019;Ostrowski Here, "support" denotes the evidence that supports the hypothesis, "refute" denotes the evidence that refutes the hypothesis (with the negated hypothesis); "merged" denotes the performance of combining "support" and "refute" (after removing duplicates then taking Top-K paragraphs).…”

Section: Related Workmentioning

confidence: 99%

Generating Literal and Implied Subquestions to Fact-check Complex Claims

Chen¹,

Aniruddh²,

Choi³

et al. 2022

Preprint

View full text Add to dashboard Cite

Verifying complex political claims is a challenging task, especially when politicians use various tactics to subtly misrepresent the facts. Automatic fact-checking systems fall short here, and their predictions like "half-true" are not very useful in isolation, since we have no idea which parts of the claim are true and which are not. In this work, we focus on decomposing a complex claim into a comprehensive set of yes-no subquestions whose answers influence the veracity of the claim. We present CLAIMDECOMP, a dataset of decompositions for over 1000 claims. Given a claim and its verification paragraph written by factcheckers, our trained annotators write subquestions covering both explicit propositions of the original claim and its implicit facets, such as asking about additional political context that changes our view of the claim's veracity. We study whether state-of-the-art models can generate such subquestions, showing that these models generate reasonable questions to ask, but predicting the comprehensive set of subquestions from the original claim without evidence remains challenging. We further show that these subquestions can help identify relevant evidence to fact-check the full claim and derive the veracity through their answers, suggesting that they can be useful pieces of a factchecking pipeline. 1

show abstract

The Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS) Shared Task

Cited by 50 publications

References 23 publications

UL2: Unifying Language Learning Paradigms

UL2: Unifying Language Learning Paradigms

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Generating Literal and Implied Subquestions to Fact-check Complex Claims

Contact Info

Product

Resources

About