Training Question Answering Models From Synthetic Data

Puri, Raul; Spring, Ryan; Shoeybi, Mohammad; Patwary, Mostofa; Catanzaro, Bryan

doi:10.18653/v1/2020.emnlp-main.468

Cited by 94 publications

(83 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…SQuAD 1.1 dev set is used to select the best model during training. As a baseline for QA data generation, we implemented a three-stage pipeline similar to the state-of-the-art approach of Puri et al (2020). We call this baseline QGen, which generates a question given a passage and extracted span, q ∼ p(q|a, c).…”

Section: Methodsmentioning

confidence: 99%

“…Recent work have used the round-trip filtering method Puri et al, 2020) to prune the synthetic QA set and improve data quality. This method consists of two steps: (1) using an RC model to provide answers to the generated questions; (2) dropping the QA pairs for which the answer of the RC model does not match the span detected answer.…”

Section: Filteringmentioning

confidence: 99%

“…However, they only got improvement when only a small amount of human labeled data is available. Recently, with the help of large pre-trained language models, Alberti et al ( 2019) and Puri et al (2020) have been able to improve the performance of RC models using generated questions. However, they need two extra BERT models to identify high-quality answer spans, and filter out low-quality questionanswer pairs.…”

Section: Related Workmentioning

confidence: 99%

“…Although many past works have proposed different strategies for question generation, they have limited or no success in improving the downstream QA task (Du et al, 2017;Sun et al, 2018;Song et al, 2018;Klein and Nabi, 2019;Wang et al, 2020;Ma et al, 2020;Chen et al, 2020;Tuan et al, 2019). Some recent approaches for synthetic QA data generation based on large pretrained language models (LM) have started to demonstrate success in improving the downstream Reading Comprehension (RC) task with automatically generated data Puri et al, 2020). However, these approaches typically consist of multi-stage systems that use three modules: span/answer detector, question generator and question filtering.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

End-to-End Synthetic Data Generation for Domain Adaptation of Question Answering Systems

Shakeri¹,

Santos²,

Zhu³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

We propose an end-to-end approach for synthetic QA data generation. Our model comprises a single transformer-based encoderdecoder network that is trained end-to-end to generate both answers and questions. In a nutshell, we feed a passage to the encoder and ask the decoder to generate a question and an answer token-by-token. The likelihood produced in the generation process is used as a filtering score, which avoids the need for a separate filtering model. Our generator is trained by finetuning a pretrained LM using maximum likelihood estimation. The experimental results indicate significant improvements in the domain adaptation of QA models outperforming current state-of-the-art methods. * *equal contribution. † Siamak Shakeri is currently with Google. The work was done when he was at AWS AI.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Filteringmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

End-to-End Synthetic Data Generation for Domain Adaptation of Question Answering Systems

Shakeri¹,

Santos²,

Zhu³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…These approaches use generative adversarial networks (Zhao et al, 2018b) and populationbased optimization algorithms (Alzantot et al, 2018). Previous work has also presented methods to generate questions for reading comprehension (Heilman and Smith, 2010;Rus et al, 2011;Alberti et al, 2019;Puri et al, 2020a), online tutoring (Lindberg et al, 2013), factual QA (Serban et al, 2016) and visual question generation (Mostafazadeh et al, 2016). A comprehensive survey on neural question generation can be found in Pan et al (2019).…”

Section: Related Workmentioning

confidence: 99%

Generative Data Augmentation for Commonsense Reasoning

Yang

Malaviya²,

Fernandez

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Recent advances in commonsense reasoning depend on large-scale human-annotated training sets to achieve peak performance. However, manual curation of training sets is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit to. We propose a novel generative data augmentation technique, G-DAUG c , that aims to achieve more accurate and robust learning in a low-resource setting. Our approach generates synthetic examples using pretrained language models, and selects the most informative and diverse set of examples for data augmentation. On experiments with multiple commonsense reasoning benchmarks, G-DAUG c consistently outperforms existing data augmentation methods based on back-translation, establishing a new state-of-the-art on WINOGRANDE, CODAH, and COMMONSENSEQA, and also enhances out-of-distribution generalization, proving to be more robust against adversaries or perturbations. Our analysis demonstrates that G-DAUG c produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance.

show abstract

Leveraging Customer Reviews for E-commerce Query Generation

Lien

Zhang

Harper

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Customer reviews are an effective source of information about what people deem important in products (e.g. “strong zipper” for tents). These crowd-created descriptors not only highlight key product attributes, but can also complement seller-provided product descriptions. Motivated by this, we propose to leverage customer reviews to generate queries pertinent to target products in an e-commerce setting. While there has been work on automatic query generation, it often relied on proprietary user search data to generate query-document training pairs for learning supervised models. We take a different view and focus on leveraging reviews without training on search logs, making reproduction more viable by the public. Our method adopts an ensemble of the statistical properties of review terms and a zero-shot neural model trained on adapted external corpus to synthesize queries. Compared to competitive baselines, we show that the generated queries based on our method both better align with actual customer queries and can benefit retrieval effectiveness.

show abstract

Training Question Answering Models From Synthetic Data

Cited by 94 publications

References 22 publications

End-to-End Synthetic Data Generation for Domain Adaptation of Question Answering Systems

End-to-End Synthetic Data Generation for Domain Adaptation of Question Answering Systems

Generative Data Augmentation for Commonsense Reasoning

Leveraging Customer Reviews for E-commerce Query Generation

Contact Info

Product

Resources

About