ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

Sun, Yu; Wang, Shuohuan; Feng, Shikun; Ding, Siyu; Pang, Chao; Shang, Junyuan; Liu, Jiaxiang; Chen, Xuyi; Zhao, Yanbin; Lu, Yuxiang; Liu, Weixin; Wu, Zhihua; Gong, Weibao; Liang, Jianzhong; Shang, Zhizhou; Sun, Peng; Liu, Wei; Ouyang, Xuan; Yu, Dianhai; Tian, Hao; Wu, Hua; Wang, Haifeng

doi:10.48550/arxiv.2107.02137

Cited by 57 publications

(92 citation statements)

References 48 publications

(67 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Retrieval in language models. Several retrieval-based methods have recently been developed for question answering, controllable generation, and machine translation (Guu et al, 2020;Lee et al, 2019;Lewis et al, 2020;Sun et al, 2021;Borgeaud et al, 2021). The general scheme in such methods is to combine a parametric model (like a BERT-style masked language model or a pre-trained seq2seq model) with a non-parametric retrieval system.…”

Section: Related Workmentioning

confidence: 99%

Retrieval-Augmented Reinforcement Learning

Goyal¹,

Friesen²,

Banino³

et al. 2022

Preprint

View full text Add to dashboard Cite

Most deep reinforcement learning (RL) algorithms distill experience into parametric behavior policies or value functions via gradient updates. While effective, this approach has several disadvantages: (1) it is computationally expensive, (2) it can take many updates to integrate experiences into the parametric model, (3) experiences that are not fully integrated do not appropriately influence the agent's behavior, and (4) behavior is limited by the capacity of the model. In this paper we explore an alternative paradigm in which we train a network to map a dataset of past experiences to optimal behavior. Specifically, we augment an RL agent with a retrieval process (parameterized as a neural network) that has direct access to a dataset of experiences. This dataset can come from the agent's past experiences, expert demonstrations, or any other relevant source. The retrieval process is trained to retrieve information from the dataset that may be useful in the current context, to help the agent achieve its goal faster and more efficiently. We integrate our method into two different RL agents: an offline DQN agent and an online R2D2 agent. In offline multi-task problems, we show that the retrieval-augmented DQN agent avoids task interference and learns faster than the baseline DQN agent. On Atari, we show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores. We run extensive ablations to measure the contributions of the components of our proposed method.

show abstract

Section: Related Workmentioning

confidence: 99%

Retrieval-Augmented Reinforcement Learning

Goyal¹,

Friesen²,

Banino³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Pre-trained language models have achieved remarkable improvements in many NLP tasks, and many variants of PTMs have been proposed. For example, GPT, GPT-2 and GPT-3 (Radford et al, 2018(Radford et al, , 2019Brown et al, 2020), BERT (Devlin et al, 2019), XLNet (Yang et al, 2019) and ALBERT (Lan et al, 2019), ERNIE , BART (Lewis et al, 2020) and RoBERTa (Liu et al, 2019b structure is modified, and knowledge-aware tasks are designed (Zhang et al, 2019;Liu et al, 2020b;Sun et al, 2021;Liu et al, 2020a;Su et al, 2021). For example, ERNIE 3.0 (Sun et al, 2021) appends triples, e.g., (Andersen, Write, Nightingale), ahead of the original input sentence, and designs tasks to predict the relation "Write" in the triple.…”

Section: Related Workmentioning

confidence: 99%

KESA: A Knowledge Enhanced Approach For Sentiment Analysis

Zhao¹,

Ma²,

Ren³

2022

Preprint

View full text Add to dashboard Cite

Though some recent works focus on injecting sentiment knowledge into pre-trained language models, they usually design mask and reconstruction tasks in the post-training phase. In this paper, we aim to benefit from sentiment knowledge in a lighter way. To achieve this goal, we study sentence-level sentiment analysis and, correspondingly, propose two sentiment-aware auxiliary tasks named sentiment word cloze and conditional sentiment prediction. The first task learns to select the correct sentiment words within the input, given the overall sentiment polarity as prior knowledge. On the contrary, the second task predicts the overall sentiment polarity given the sentiment polarity of the word as prior knowledge. In addition, two kinds of label combination methods are investigated to unify multiple types of labels in each task. We argue that more information can promote the models to learn more profound semantic representation. We implement it in a straightforward way to verify this hypothesis. The experimental results demonstrate that our approach consistently outperforms pre-trained models and is additive to existing knowledgeenhanced post-trained models. The code and data are released at https://github. com/lshowway/KESA.

show abstract

“…In addition, most large-scale models are trained in an auto-regressive way, but [6] shows that such models have poorer performance with traditional fine-tuning when adapting to downstream language understanding tasks. In order to solve these problems, a unified framework called ERNIE 3.0 [2] was proposed to train large-scale knowledge-enhanced models on large-scale plain texts and a large-scale knowledge graph by fusing the auto-regressive network and the auto-encoding network.…”

Section: Large-scale Pre-trainingmentioning

confidence: 99%

“…A significant improvement has been achieved on various natural language processing tasks for knowledge-enhanced pre-trained models with the base or large model size, such as ERNIE, ERNIE 2.0, and SpanBERT [51], in which the base/large model size represent 12/24 layers Transformer respectively. In order to explore the effectiveness of knowledge enhanced large-scale pre-trained model, a Continual Multi-Paradigms Unified Pre-training Framework named ERNIE 3.0 Framework is proposed in [2] to pre-train model on massive unsupervised corpus including plain texts and knowledge graphs. Specifically, ERNIE 3.0 Framework allows collaborative pre-training among multi-task paradigms, in which various types of pre-training tasks are incrementally deployed in the corresponding task paradigm to enable the model to learn different levels of knowledge, i.e., valuable lexical, syntactic and semantic information, more effectively.…”

Section: Model Distillation Of Language Modelsmentioning

confidence: 99%

“…Most of existing large-scale models were pre-trained on plain texts without integrating knowledge. ERNIE 3.0 [2] tried to incorporate knowledge such as linguistic knowledge and world knowledge into large-scale pre-trained language models. ERNIE 3.0 pre-trained Transformers on massive unstructured texts and knowledge graphs to learn different levels of knowledge, such as lexical, syntactic, and semantic information.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

Wang¹,

Sun²,

Xiang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Pre-trained language models have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks. has shown that scaling up pre-trained language models can further exploit their enormous potential. A unified framework named ERNIE 3.0 [2] was recently proposed for pre-training large-scale knowledge enhanced models and trained a model with 10 billion parameters. ERNIE 3.0 outperformed the state-of-the-art models on various NLP tasks. In order to explore the performance of scaling up ERNIE 3.0, we train a hundred-billion-parameter model called ERNIE 3.0 Titan with up to 260 billion parameters on the PaddlePaddle [3] platform. Furthermore, we design a self-supervised adversarial loss and a controllable language modeling loss to make ERNIE 3.0 Titan generate credible and controllable texts. To reduce the computation overhead and carbon emission, we propose an online distillation framework for ERNIE 3.0 Titan, where the teacher model will teach students and train itself simultaneously. ERNIE 3.0 Titan is the largest Chinese dense pre-trained model so far. Empirical results show that the ERNIE 3.0 Titan outperforms the state-of-the-art models on 68 NLP datasets.

show abstract

ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

Cited by 57 publications

References 48 publications

Retrieval-Augmented Reinforcement Learning

Retrieval-Augmented Reinforcement Learning

KESA: A Knowledge Enhanced Approach For Sentiment Analysis

ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

Contact Info

Product

Resources

About