Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training

Agarwal, Oshin; Ge, Heming; Shakeri, Siamak; Al‐Rfou, Rami

doi:10.48550/arxiv.2010.12688

Cited by 8 publications

(13 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As the Fig. 4 shows, Oshin et al [69] propose pipelines for training the TEKGEN model and generating the KELM corpus. The author experiments the evaluation using two open domain question answering datasets and one knowledge probing dataset.…”

Section: Generate Data Base On Big Model and Knowledge Graphmentioning

confidence: 99%

A Roadmap for Big Model

Yuan¹,

Zhao²,

Jiahong³

et al. 2022

Preprint

View full text Add to dashboard Cite

domains indexed by Google News. It contains 31 million documents with an average length of 793 BPE tokens. Like C4, it excludes examples with duplicate URLs. News dumps from December 2016 through March 2019 were used as training data, articles published in April 2019 from the April 2019 dump were used for evaluation. OpenWebText2(OWT2). OWT2 is an enhanced version of the original OpenWebTextCorpus, including content from multiple languages, document metadata, multiple dataset versions, and open source replication code, covering all Reddit submissions from 2005 up until April 2020. PubMed Central(PMC). PMC is a free full-text archive of biomedical and life sciences journal literature from the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). The dataset is updated daily. In addition to full-text articles, they contain corrections, retractions, and expressions of concern, as well as file lists that include metadata for articles in each dataset.PMC obtained by open registration in Amazon Web Services (AWS) includes The PMC Open Access Subset and The Author Manuscript Dataset. The PMC Open Access Subset includes all articles and preprints in PMC with a machine-readable Creative Commons license that allows reuse. The Author Manuscript Dataset includes accepted author manuscripts collected under a funder policy in PMC and made available in machine-readable formats for text mining. ArXiv. ArXiv is a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more. It provides open access to academic articles, covering many subdisciplines from vast branches of physics to computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics, which is helpful to the potential downstream applications of the research field. In addition, the writing language of LaTeX also contributes to the study of language models. Colossal Clean Crawled Corpus(C4). C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It is based on Common Crawl dataset and was used to train the T5 text-to-text Transformer models. The cleaned English version of C4 has 364,868,901 training examples and 364,608 validation examples, while the uncleaned English version has 1,063,805,324 training examples and 1,065,029 validation examples; the realnewslike version has 13,799,838 training examples and 13,863 validation examples, while the webtextlike version has 4,500,788 training examples and 4,493 validation examples. Wiki-40B. Wikipedia (Wiki-40B) is a clean-up text collection containing more than 40 Wikipedia language editions of pages corresponding to entities. The dataset is split into train/validation/test sets for each language. The training set has 2,926,536 examples, the validation set has 163,597 examples, and the test set has 162,274 examples. Wiki-40B is cleaned by a page filter to remove ambiguous, redirected, deleted, and non-physical pages. CLUECorpus2020. CLUECorpus2020 ...

show abstract

Section: Generate Data Base On Big Model and Knowledge Graphmentioning

confidence: 99%

A Roadmap for Big Model

Yuan¹,

Zhao²,

Jiahong³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Note that since the mapping from m to s is many-to-one, the semantic ambiguity may exist. In order to mitigate semantic ambiguity and implement the triple-to-text conversion, the pre-training model Text-to-Text Transfer Transformer model (T5) is fine-tuned on our training corpus [13]. Since the pre-training model T5 is fed by billions of sentences, it can take context into account when the reconstructed text is generated.…”

Section: Semantic Ambiguitymentioning

confidence: 99%

“…The implementation of the semantic symbol abstraction is to align the input text m with the triplet(h, r, t), where h, r and t denote the head entity, the relation and the tail entity of the knowledge graph, respectively. It is realized by an Text2KG alignment algorithm, as shown in Table 1 [13]. Note that for each sentence in the input text, all triples that have h and t are matched.…”

Section: B Semantic Symbol Abstractionmentioning

confidence: 99%

“…The relations are not required to be matched since there are many ways to express them. Similar to the work in [13], the relation can be expressed as long as the message mentions the head entity and tail entity.…”

Section: B Semantic Symbol Abstractionmentioning

confidence: 99%

“…The training set of WebNLG English dataset [13] is used to fine-tune the T5 model. It contains knowledge graphs and text from various domains including Airport, Artist, Astronaut, Athlete, Building, Celestial Body, City, Comics Character, Food, Mode of Transportation, Monument, Politician, Sports Team, University and Written Work [13]. The test sets include three additional domains, namely, Film, Scientist, and Musical Work.…”

Section: A Dataset and Finetuning Parametersmentioning

confidence: 99%

See 2 more Smart Citations

Cognitive Semantic Communication Systems Driven by Knowledge Graph

Zhou,

Li,

Zhang

et al. 2022

Preprint

View full text Add to dashboard Cite

Semantic communication is envisioned as a promising technique to break through the Shannon limit. However, the existing semantic communication frameworks do not involve inference and error correction, which limits the achievable performance. In this paper, in order to tackle this issue, a cognitive semantic communication framework is proposed by exploiting knowledge graph. Moreover, a simple, general and interpretable solution for semantic information detection is developed by exploiting triples as semantic symbols. It also allows the receiver to correct errors occurring at the symbolic level. Furthermore, the pre-trained model is fine-tuned to recover semantic information, which overcomes the drawback that a fixed bit length coding is used to encode sentences of different lengths. Simulation results on the public WebNLG corpus show that our proposed system is superior to other benchmark systems in terms of the data compression rate and the reliability of communication.

show abstract

A Unified Approach to Semantic Information and Communication Based on Probabilistic Logic

2022

View full text Add to dashboard Cite

Traditionally, studies on technical communication (TC) are based on stochastic modeling and manipulation. This is not sufficient for semantic communication (SC) where semantic elements are logically connected, rather than stochastically correlated. To fill this void, by leveraging a logical programming language called probabilistic logic (ProbLog), we propose a unified approach to semantic information and communication through the interplay between TC and SC. Building on the well-established existing TC layer, we introduce, in this paper, a SC layer that utilizes knowledge bases of communicating parties for the exchange of semantic information. These knowledge bases are logically described, manipulated, and exploited using ProbLog. To allow efficient interactions between SC and TC layers, various measures are proposed in this paper using the entropy of a clause in a knowledge base. These measurements can account for various technical problems in SC, such as message selection to improve the receiver's knowledge base. Extending this, we present few selected examples of how the SC and TC layers interact with each other, s while taking into account constraints of physical channels and efficiently utilizing channel resources.

show abstract

Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training

Cited by 8 publications

References 19 publications

A Roadmap for Big Model

A Roadmap for Big Model

Cognitive Semantic Communication Systems Driven by Knowledge Graph

A Unified Approach to Semantic Information and Communication Based on Probabilistic Logic

Contact Info

Product

Resources

About