Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training

Agarwal, Oshin; Ge, Heming; Shakeri, Siamak; Al‐Rfou, Rami

doi:10.18653/v1/2021.naacl-main.278

Cited by 78 publications

(87 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• We adapted the large-scale TEKGEN corpus (Agarwal et al, 2021) for T2G and G2T tasks and confirmed the benefit of SCST-based fine-tuning approach over CE-trained baselines.…”

Section: Introductionmentioning

confidence: 61%

“…In this work, we are interested in leveraging the power of PLMs for both G2T and T2G generation tasks, and will demonstrate the strength of our approach by improving upon the best results of the WebNLG+ 2020 Challenge (rev 3.0) as reported by Castro Ferreira et al (2020a) for both T2G (Semantic Parsing) and G2T (Data-to-Text) tasks. We will also present results for the TEKGEN Corpus (Agarwal et al, 2021) to show performance on a different, much larger dataset. To illustrate the task of generation, Fig.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

ReGen: Reinforcement Learning for Text and Knowledge Base Generation using Pretrained Language Models

Dognin¹,

Padhi²,

Melnyk³

et al. 2021

Preprint

View full text Add to dashboard Cite

Automatic construction of relevant Knowledge Bases (KBs) from text, and generation of semantically meaningful text from KBs are both long-standing goals in Machine Learning. In this paper, we present ReGen, a bidirectional generation of text and graph leveraging Reinforcement Learning (RL) to improve performance. Graph linearization enables us to re-frame both tasks as a sequence to sequence generation problem regardless of the generative direction, which in turn allows the use of Reinforcement Learning for sequence training where the model itself is employed as its own critic leading to Self-Critical Sequence Training (SCST). We present an extensive investigation demonstrating that the use of RL via SCST benefits graph and text generation on WebNLG+ 2020 and TEKGEN datasets.Our system provides state-of-the-art results on WebNLG+ 2020 by significantly improving upon published results from the WebNLG 2020+ Challenge for both text-to-graph and graph-to-text generation tasks.

show abstract

“…• We adapted the large-scale TEKGEN corpus (Agarwal et al, 2021) for T2G and G2T tasks and confirmed the benefit of SCST-based fine-tuning approach over CE-trained baselines.…”

Section: Introductionmentioning

confidence: 61%

Section: Introductionmentioning

confidence: 99%

ReGen: Reinforcement Learning for Text and Knowledge Base Generation using Pretrained Language Models

Dognin¹,

Padhi²,

Melnyk³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…pretraining. Using knowledge information for pretraining language models Sun et al, 2019;Guu et al, 2020;Wang et al, 2021b;Agarwal et al, 2021;Verga et al, 2021) has recently grown in popularity and has achieved substantial improvements on knowledge-driven tasks such as question answering and named entity recognition. Instead of using knowledge informa-tion for improving downstream knowledge-driven tasks, we focus on using knowledge information for improving the generation capability of the language model itself.…”

Section: Knowledge-enhancedmentioning

confidence: 99%

Relational Memory Augmented Language Models

Liu¹,

Yogatama²,

Blunsom³

2022

Preprint

View full text Add to dashboard Cite

We present a memory-augmented approach to condition an autoregressive language model on a knowledge graph. We represent the graph as a collection of relation triples and retrieve relevant relations for a given context to improve text generation. Experiments on WikiText-103, WMT19, and en-wik8 English datasets demonstrate that our approach produces a better language model in terms of perplexity and bits per character. We also show that relational memory improves coherence, is complementary to token-based memory, and enables causal interventions. Our model provides a simple yet effective way to combine an autoregressive language model and a knowledge graph for more coherent and logical generation.

show abstract

“…However, to ensure wider linguistic variety as well as accuracy of the mapping, we use verbalizations of knowledge graph triples that are synthesized through a sequence to sequence model. Concretely, we use generated sentences from KELM (Agarwal et al, 2020), which are not grounded with Wikidata IDs, and generate a post-hoc mapping back to Wikidata.For example, given the sentence: "The Slice of Life manga series The Film Lives On was written by Osamu Tezuka." we map it to the Wikidata triple (Q11332517,P50,Q193300).…”

Section: The Wikinldb Datasetmentioning

confidence: 99%

Database reasoning over text

Thorne¹,

Yazdani²,

Saeidi³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Neural models have shown impressive performance gains in answering queries from natural language text. However, existing works are unable to support database queries, such as "List/Count all female athletes who were born in 20th century", which require reasoning over sets of relevant facts with operations such as join, filtering and aggregation. We show that while state-of-the-art transformer models perform very well for small databases, they exhibit limitations in processing noisy data, numerical operations, and queries that aggregate facts. We propose a modular architecture to answer these database-style queries over multiple spans from text and aggregating these at scale. We evaluate the architecture using WIKINLDB, 1 a novel dataset for exploring such queries. Our architecture scales to databases containing thousands of facts whereas contemporary models are limited by how many facts can be encoded. In direct comparison on small databases, our approach increases overall answer accuracy from 85% to 90%. On larger databases, our approach retains its accuracy whereas transformer baselines could not encode the context.

show abstract

Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training

Cited by 78 publications

References 36 publications

ReGen: Reinforcement Learning for Text and Knowledge Base Generation using Pretrained Language Models

ReGen: Reinforcement Learning for Text and Knowledge Base Generation using Pretrained Language Models

Relational Memory Augmented Language Models

Database reasoning over text

Contact Info

Product

Resources

About