Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2021
DOI: 10.18653/v1/2021.naacl-main.278
|View full text |Cite
|
Sign up to set email alerts
|

Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training

Abstract: Prior work on Data-To-Text Generation, the task of converting knowledge graph (KG) triples into natural text, focused on domainspecific benchmark datasets. In this paper, however, we verbalize the entire English Wikidata KG, and discuss the unique challenges associated with a broad, open-domain, largescale verbalization. We further show that verbalizing a comprehensive, encyclopedic KG like Wikidata can be used to integrate structured KGs and natural language corpora. In contrast to the many architectures that… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
49
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 78 publications
(87 citation statements)
references
References 36 publications
0
49
0
Order By: Relevance
“…• We adapted the large-scale TEKGEN corpus (Agarwal et al, 2021) for T2G and G2T tasks and confirmed the benefit of SCST-based fine-tuning approach over CE-trained baselines.…”
Section: Introductionmentioning
confidence: 61%
See 1 more Smart Citation
“…• We adapted the large-scale TEKGEN corpus (Agarwal et al, 2021) for T2G and G2T tasks and confirmed the benefit of SCST-based fine-tuning approach over CE-trained baselines.…”
Section: Introductionmentioning
confidence: 61%
“…In this work, we are interested in leveraging the power of PLMs for both G2T and T2G generation tasks, and will demonstrate the strength of our approach by improving upon the best results of the WebNLG+ 2020 Challenge (rev 3.0) as reported by Castro Ferreira et al (2020a) for both T2G (Semantic Parsing) and G2T (Data-to-Text) tasks. We will also present results for the TEKGEN Corpus (Agarwal et al, 2021) to show performance on a different, much larger dataset. To illustrate the task of generation, Fig.…”
Section: Introductionmentioning
confidence: 99%
“…pretraining. Using knowledge information for pretraining language models Sun et al, 2019;Guu et al, 2020;Wang et al, 2021b;Agarwal et al, 2021;Verga et al, 2021) has recently grown in popularity and has achieved substantial improvements on knowledge-driven tasks such as question answering and named entity recognition. Instead of using knowledge informa-tion for improving downstream knowledge-driven tasks, we focus on using knowledge information for improving the generation capability of the language model itself.…”
Section: Knowledge-enhancedmentioning
confidence: 99%
“…However, to ensure wider linguistic variety as well as accuracy of the mapping, we use verbalizations of knowledge graph triples that are synthesized through a sequence to sequence model. Concretely, we use generated sentences from KELM (Agarwal et al, 2020), which are not grounded with Wikidata IDs, and generate a post-hoc mapping back to Wikidata.For example, given the sentence: "The Slice of Life manga series The Film Lives On was written by Osamu Tezuka." we map it to the Wikidata triple (Q11332517,P50,Q193300).…”
Section: The Wikinldb Datasetmentioning
confidence: 99%