2020
DOI: 10.48550/arxiv.2010.12688
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training

Abstract: Generating natural sentences from Knowledge Graph (KG) triples, known as Data-To-Text Generation, is a task with many datasets for which numerous complex systems have been developed. However, no prior work has attempted to perform this generation at scale by converting an entire KG into natural text. In this paper, we verbalize the entire Wikidata KG, and create a KG-Text aligned corpus in the training process 1 . We discuss the challenges in verbalizing an entire KG versus verbalizing smaller datasets. We fur… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(13 citation statements)
references
References 19 publications
0
10
0
Order By: Relevance
“…As the Fig. 4 shows, Oshin et al [69] propose pipelines for training the TEKGEN model and generating the KELM corpus. The author experiments the evaluation using two open domain question answering datasets and one knowledge probing dataset.…”
Section: Generate Data Base On Big Model and Knowledge Graphmentioning
confidence: 99%
“…As the Fig. 4 shows, Oshin et al [69] propose pipelines for training the TEKGEN model and generating the KELM corpus. The author experiments the evaluation using two open domain question answering datasets and one knowledge probing dataset.…”
Section: Generate Data Base On Big Model and Knowledge Graphmentioning
confidence: 99%
“…Note that since the mapping from m to s is many-to-one, the semantic ambiguity may exist. In order to mitigate semantic ambiguity and implement the triple-to-text conversion, the pre-training model Text-to-Text Transfer Transformer model (T5) is fine-tuned on our training corpus [13]. Since the pre-training model T5 is fed by billions of sentences, it can take context into account when the reconstructed text is generated.…”
Section: Semantic Ambiguitymentioning
confidence: 99%
“…The implementation of the semantic symbol abstraction is to align the input text m with the triplet(h, r, t), where h, r and t denote the head entity, the relation and the tail entity of the knowledge graph, respectively. It is realized by an Text2KG alignment algorithm, as shown in Table 1 [13]. Note that for each sentence in the input text, all triples that have h and t are matched.…”
Section: B Semantic Symbol Abstractionmentioning
confidence: 99%
See 2 more Smart Citations