2021
DOI: 10.48550/arxiv.2107.02137
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

Yu Sun,
Shuohuan Wang,
Shikun Feng
et al.

Abstract: Pre-trained models have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks. Recent works such as T5 [1] and have shown that scaling up pre-trained language models can improve their generalization abilities. Particularly, the GPT-3 model with 175 billion parameters shows its strong task-agnostic zero-shot/few-shot learning capabilities. Despite their success, these large-scale models are trained on plain texts without introducing knowledge such as linguistic knowledge and worl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
75
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 57 publications
(92 citation statements)
references
References 48 publications
(67 reference statements)
0
75
0
1
Order By: Relevance
“…Retrieval in language models. Several retrieval-based methods have recently been developed for question answering, controllable generation, and machine translation (Guu et al, 2020;Lee et al, 2019;Lewis et al, 2020;Sun et al, 2021;Borgeaud et al, 2021). The general scheme in such methods is to combine a parametric model (like a BERT-style masked language model or a pre-trained seq2seq model) with a non-parametric retrieval system.…”
Section: Related Workmentioning
confidence: 99%
“…Retrieval in language models. Several retrieval-based methods have recently been developed for question answering, controllable generation, and machine translation (Guu et al, 2020;Lee et al, 2019;Lewis et al, 2020;Sun et al, 2021;Borgeaud et al, 2021). The general scheme in such methods is to combine a parametric model (like a BERT-style masked language model or a pre-trained seq2seq model) with a non-parametric retrieval system.…”
Section: Related Workmentioning
confidence: 99%
“…Pre-trained language models have achieved remarkable improvements in many NLP tasks, and many variants of PTMs have been proposed. For example, GPT, GPT-2 and GPT-3 (Radford et al, 2018(Radford et al, , 2019Brown et al, 2020), BERT (Devlin et al, 2019), XLNet (Yang et al, 2019) and ALBERT (Lan et al, 2019), ERNIE , BART (Lewis et al, 2020) and RoBERTa (Liu et al, 2019b structure is modified, and knowledge-aware tasks are designed (Zhang et al, 2019;Liu et al, 2020b;Sun et al, 2021;Liu et al, 2020a;Su et al, 2021). For example, ERNIE 3.0 (Sun et al, 2021) appends triples, e.g., (Andersen, Write, Nightingale), ahead of the original input sentence, and designs tasks to predict the relation "Write" in the triple.…”
Section: Related Workmentioning
confidence: 99%
“…In addition, most large-scale models are trained in an auto-regressive way, but [6] shows that such models have poorer performance with traditional fine-tuning when adapting to downstream language understanding tasks. In order to solve these problems, a unified framework called ERNIE 3.0 [2] was proposed to train large-scale knowledge-enhanced models on large-scale plain texts and a large-scale knowledge graph by fusing the auto-regressive network and the auto-encoding network.…”
Section: Large-scale Pre-trainingmentioning
confidence: 99%
“…A significant improvement has been achieved on various natural language processing tasks for knowledge-enhanced pre-trained models with the base or large model size, such as ERNIE, ERNIE 2.0, and SpanBERT [51], in which the base/large model size represent 12/24 layers Transformer respectively. In order to explore the effectiveness of knowledge enhanced large-scale pre-trained model, a Continual Multi-Paradigms Unified Pre-training Framework named ERNIE 3.0 Framework is proposed in [2] to pre-train model on massive unsupervised corpus including plain texts and knowledge graphs. Specifically, ERNIE 3.0 Framework allows collaborative pre-training among multi-task paradigms, in which various types of pre-training tasks are incrementally deployed in the corresponding task paradigm to enable the model to learn different levels of knowledge, i.e., valuable lexical, syntactic and semantic information, more effectively.…”
Section: Model Distillation Of Language Modelsmentioning
confidence: 99%
See 1 more Smart Citation