Elsevier OA CC-By Corpus

Kershaw, Daniel; Koeling, Rob

doi:10.48550/arxiv.2008.00774

Cited by 1 publication

(1 citation statement)

References 9 publications

(13 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Using Dialogizer, we generate four ConvQA datasets for use in experiments. These datasets are developed by leveraging four source-text datasets from diverse domains: Wikipedia, PubMed, CC-News (Hamborg et al, 2017), and Elsevier OA CC-By (Kershaw and Koeling, 2020). Each dataset is named after its corresponding source dataset, namely WikiDialog2, PubmedDialog, CC-newsDialog, and ElsevierDialog.…”

Section: Generated Datasetsmentioning

confidence: 99%

Dialogizer: Context-aware Conversational-QA Dataset Generation from Textual Sources

Hwang,

Kim,

Bae

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

To address the data scarcity issue in Conversational question answering (ConvQA), a dialog inpainting method, which utilizes documents to generate ConvQA datasets, has been proposed. However, the original dialog inpainting model is trained solely on the dialog reconstruction task, resulting in the generation of questions with low contextual relevance due to insufficient learning of question-answer alignment. To overcome this limitation, we propose a novel framework called Dialogizer, which has the capability to automatically generate ConvQA datasets with high contextual relevance from textual sources. The framework incorporates two training tasks: question-answer matching (QAM) and topic-aware dialog generation (TDG). Moreover, re-ranking is conducted during the inference phase based on the contextual relevance of the generated questions. Using our framework, we produce four Con-vQA datasets by utilizing documents from multiple domains as the primary source. Through automatic evaluation using diverse metrics, as well as human evaluation, we validate that our proposed framework exhibits the ability to generate datasets of higher quality compared to the baseline dialog inpainting model.

show abstract

Section: Generated Datasetsmentioning

confidence: 99%