Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval 2020
DOI: 10.1145/3409256.3409836
|View full text |Cite
|
Sign up to set email alerts
|

Sanitizing Synthetic Training Data Generation for Question Answering over Knowledge Graphs

Abstract: Synthetic data generation is important to training and evaluating neural models for question answering over knowledge graphs. The quality of the data and the partitioning of the datasets into training, validation and test splits impact the performance of the models trained on this data. If the synthetic data generation depends on templates, as is the predominant approach for this task, there may be a leakage of information via a shared basis of templates across data splits if the partitioning is not performed … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
1
1

Relationship

2
6

Authors

Journals

citations
Cited by 10 publications
(8 citation statements)
references
References 31 publications
0
8
0
Order By: Relevance
“…We compare model performance on the original versus rewritten NL questions in our samples. Specifically, we use neural KGQA models trained on the DBNQA * [16] and GrailQA [11] datasets. 1 The question we seek to answer is how quality improvements on the input NL questions impact the answer prediction effectiveness of the models.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…We compare model performance on the original versus rewritten NL questions in our samples. Specifically, we use neural KGQA models trained on the DBNQA * [16] and GrailQA [11] datasets. 1 The question we seek to answer is how quality improvements on the input NL questions impact the answer prediction effectiveness of the models.…”
Section: Methodsmentioning
confidence: 99%
“…The extant datasets are taken as the basis to extract templates for both formal queries and NL questions, and those templates are then instantiated with different entity and predicate bindings. DBNQA* [16] partitions DBNQA [12] into training, validation, and test splits based on the underlying templates, avoiding leakage of information between training and test splits. The instances are identical to DBNQA, and so we use DBNQA* in our experiments.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…More broadly, in Knowledge-Graph Question Answering (KG-QA), work has exploited KG to generate synthetic data in unseen domains (Linjordet, 2020;Trivedi et al, 2017;Linjordet and Balog, 2020). Our work extends visually-grounded questions with valid common sense KG triplets.…”
Section: Related Workmentioning
confidence: 99%
“…In practice, though, the datasets often still contain redundant data points. The redundancies inherent in text data, such as paraphrases, synonyms, etc., can be especially problematic, resulting in train-test leaks [19,24,29]. For instance, the training and test sets of the ELI5 dataset [13] for question answering were created using TF-IDF as a heuristic to eliminate redundancies between them.…”
Section: Background and Related Workmentioning
confidence: 99%