Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.468
|View full text |Cite
|
Sign up to set email alerts
|

Training Question Answering Models From Synthetic Data

Abstract: Question and answer generation is a data augmentation method that aims to improve question answering (QA) models given the limited amount of human labeled data. However, a considerable gap remains between synthetic and human-generated question-answer pairs. This work aims to narrow this gap by taking advantage of large language models and explores several factors such as model size, quality of pretrained models, scale of data synthesized, and algorithmic choices. On the SQUAD1.1 question answering task, we ach… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
67
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 94 publications
(83 citation statements)
references
References 22 publications
2
67
0
Order By: Relevance
“…SQuAD 1.1 dev set is used to select the best model during training. As a baseline for QA data generation, we implemented a three-stage pipeline similar to the state-of-the-art approach of Puri et al (2020). We call this baseline QGen, which generates a question given a passage and extracted span, q ∼ p(q|a, c).…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…SQuAD 1.1 dev set is used to select the best model during training. As a baseline for QA data generation, we implemented a three-stage pipeline similar to the state-of-the-art approach of Puri et al (2020). We call this baseline QGen, which generates a question given a passage and extracted span, q ∼ p(q|a, c).…”
Section: Methodsmentioning
confidence: 99%
“…Recent work have used the round-trip filtering method Puri et al, 2020) to prune the synthetic QA set and improve data quality. This method consists of two steps: (1) using an RC model to provide answers to the generated questions; (2) dropping the QA pairs for which the answer of the RC model does not match the span detected answer.…”
Section: Filteringmentioning
confidence: 99%
See 2 more Smart Citations
“…These approaches use generative adversarial networks (Zhao et al, 2018b) and populationbased optimization algorithms (Alzantot et al, 2018). Previous work has also presented methods to generate questions for reading comprehension (Heilman and Smith, 2010;Rus et al, 2011;Alberti et al, 2019;Puri et al, 2020a), online tutoring (Lindberg et al, 2013), factual QA (Serban et al, 2016) and visual question generation (Mostafazadeh et al, 2016). A comprehensive survey on neural question generation can be found in Pan et al (2019).…”
Section: Related Workmentioning
confidence: 99%