ERNIE: Enhanced Representation through Knowledge Integration

Sun, Yu; Wang, Shuohuan; Li, Yukun; Feng, Shikun; Chen, Xuyi; Zhang, Han; Tian, Xin; Zhu, Danxiang; Tian, Hao; Wu, Hua

doi:10.48550/arxiv.1904.09223

Cited by 337 publications

(290 citation statements)

References 15 publications

Supporting

Mentioning

253

Contrasting

Order By: Relevance

“…Besides, MLM randomly masks out some independent words, which are the smallest semantic units in English but may not have complete semantics in other languages, such as Chinese. Thus, ERNIE (Baidu) (Sun et al, 2019b) introduces entity-level and phrase-level masking, where multiple words that represent the same semantic meaning are masked. This achieves good transferability on Chinese NLP tasks.…”

Section: Generative Learningmentioning

confidence: 99%

“…(Peters et al, 2018) serves as the baseline. GPT (Radford et al, 2018), BERT Large (Devlin et al, 2019), T5 (Raffel et al, 2020), and ERNIE (Sun et al, 2019b) have different architectures. RoBERTa , XLM (Lample and Conneau, 2019), and SpanBERT (Joshi et al, 2020) share the same architecture as BERT Large but employ different pre-training methods.…”

Section: Pre-trainingmentioning

confidence: 99%

See 1 more Smart Citation

Transferability in Deep Learning: A Survey

Jiang¹,

Yang²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

The success of deep learning algorithms generally depends on large-scale data, while humans appear to have inherent ability of knowledge transfer, by recognizing and applying relevant knowledge from previous learning experiences when encountering and solving unseen tasks. Such an ability to acquire and reuse knowledge is known as transferability in deep learning. It has formed the long-term quest towards making deep learning as data-efficient as human learning, and has been motivating fruitful design of more powerful deep learning algorithms. We present this survey to connect different isolated areas in deep learning with their relation to transferability, and to provide a unified and complete view to investigating transferability through the whole lifecycle of deep learning. The survey elaborates the fundamental goals and challenges in parallel with the core principles and methods, covering recent cornerstones in deep architectures, pre-training, task adaptation and domain adaptation. This highlights unanswered questions on the appropriate objectives for learning transferable knowledge and for adapting the knowledge to new tasks and domains, avoiding catastrophic forgetting and negative transfer. Finally, we implement a benchmark and an open-source library, enabling a fair evaluation of deep learning methods in terms of transferability.

show abstract

Section: Generative Learningmentioning

confidence: 99%

Section: Pre-trainingmentioning

confidence: 99%

Transferability in Deep Learning: A Survey

Jiang¹,

Yang²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Contrastively, our work is integrating knowledge from a large MoE model. Sun et al (2019) proposed to integrate knowledge by using knowledge masking strategies. Please note our knowledge integration is different from theirs.…”

Section: Knowledge Integrationmentioning

confidence: 99%

One Student Knows All Experts Know: From Sparse to Dense

Xue¹,

He²,

Ren³

et al. 2022

Preprint

View full text Add to dashboard Cite

Human education system trains one student by multiple experts. Mixture-of-experts (MoE) is a powerful sparse architecture including multiple experts. However, sparse MoE model is hard to implement, easy to overfit, and not hardwarefriendly. In this work, inspired by human education model, we propose a novel task, knowledge integration, to obtain a dense student model (OneS) as knowledgeable as one sparse MoE. We investigate this task by proposing a general training framework including knowledge gathering and knowledge distillation. Specifically, we first propose Singular Value Decomposition Knowledge Gathering (SVD-KG) to gather key knowledge from different pretrained experts. We then refine the dense student model by knowledge distillation to offset the noise from gathering. On Im-ageNet, our OneS preserves 61.7% benefits from MoE. OneS can achieve 78.4% top-1 accuracy with only 15M parameters. On four natural language processing datasets, OneS obtains 88.2% MoE benefits and outperforms SoTA by 51.7% using the same architecture and training data. In addition, compared with the MoE counterpart, OneS can achieve 3.7× inference speedup due to the hardware-friendly architecture.

show abstract

“…Task-aware Language models. A recent line of works has been focused on bridging the gap between the selfsupervision task and the downstream tasks which is inherent to multi-purpose pretrained models (Sun et al 2019;Tian et al 2020;Chang et al 2020). In (Joshi et al 2020), spans of texts are masked rather than single tokens, resulting in a language model oriented to span-selection tasks.…”

Section: Related Workmentioning

confidence: 99%

Fortunately, Discourse Markers Can Enhance Language Models for Sentiment Analysis

Ein-Dor¹,

Shnayderman²,

Spector³

et al. 2022

Preprint

View full text Add to dashboard Cite

In recent years, pretrained language models have revolutionized the NLP world, while achieving state of the art performance in various downstream tasks. However, in many cases, these models do not perform well when labeled data is scarce and the model is expected to perform in the zero or few shot setting. Recently, several works have shown that continual pretraining or performing a second phase of pretraining (inter-training) which is better aligned with the downstream task, can lead to improved results, especially in the scarce data setting. Here, we propose to leverage sentiment-carrying discourse markers to generate large-scale weakly-labeled data, which in turn can be used to adapt language models for sentiment analysis. Extensive experimental results show the value of our approach on various benchmark datasets, including the finance domain. Code, models and data are available at https://github.com/ibm/tslmdiscourse-markers.

show abstract

ERNIE: Enhanced Representation through Knowledge Integration

Cited by 337 publications

References 15 publications

Transferability in Deep Learning: A Survey

Transferability in Deep Learning: A Survey

One Student Knows All Experts Know: From Sparse to Dense

Fortunately, Discourse Markers Can Enhance Language Models for Sentiment Analysis

Contact Info

Product

Resources

About