Improving Pre-Trained Multilingual Model with Vocabulary Expansion

Wang, Hai; Yu, Dian; Sun, Kai; Chen, Jianshu; Yu, Dong

doi:10.18653/v1/k19-1030

Cited by 18 publications

(18 citation statements)

References 58 publications

(50 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our vector space alignment strategy is inspired by cross-lingual word vector alignment (e.g., Mikolov et al (2013b);Smith et al (2017)). A related method was recently applied by Wang et al (2019a) to map cross-lingual word vectors into the multilingual BERT wordpiece vector space.…”

Section: Vector Space Alignmentmentioning

confidence: 99%

E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT

Poerner

Waltinger

Schütze

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

108

115

View full text Add to dashboard Cite

We present a novel way of injecting factual knowledge about entities into the pretrained BERT model (Devlin et al., 2019): We align Wikipedia2Vec entity vectors (Yamada et al., 2016) with BERT's native wordpiece vector space and use the aligned entity vectors as if they were wordpiece vectors. The resulting entity-enhanced version of BERT (called E-BERT) is similar in spirit to ERNIE (Zhang et al., 2019) and KnowBert (Peters et al., 2019), but it requires no expensive further pretraining of the BERT encoder. We evaluate E-BERT on unsupervised question answering (QA), supervised relation classification (RC) and entity linking (EL). On all three tasks, E-BERT outperforms BERT and other baselines. We also show quantitatively that the original BERT model is overly reliant on the surface form of entity names (e.g., guessing that someone with an Italian-sounding name speaks Italian), and that E-BERT mitigates this problem.

show abstract

Section: Vector Space Alignmentmentioning

confidence: 99%

E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT

Poerner

Waltinger

Schütze

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

108

115

View full text Add to dashboard Cite

show abstract

“…Moreover, SciBERT (Beltagy et al, 2019) found that in-domain vocabulary is helpful but not significant while we attribute it to the inefficiency of implicit learning of in-domain vocabulary. To represent OOV words in multilingual settings, the mixture mapping method (Wang et al, 2019) utilized a mixture of English subwords embedding, but it has been shown useless for domain-specific words by Tai et al (2020). ExBERT (Tai et al, 2020) applied an extension module to adapt an augmenting embedding for the in-domain vocabulary but it still needs large continuous pre-training.…”

Section: Related Workmentioning

confidence: 99%

Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation

Diao¹,

Xu²,

Hongjin³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Large pre-trained models such as BERT are known to improve different downstream NLP tasks, even when such a model is trained on a generic domain. Moreover, recent studies have shown that when large domain-specific corpora are available, continued pre-training on domain-specific data can further improve the performance of in-domain tasks. However, this practice requires significant domainspecific data and computational resources which may not always be available. In this paper, we aim to adapt a generic pretrained model with a relatively small amount of domain-specific data. We demonstrate that by explicitly incorporating the multi-granularity information of unseen and domain-specific words via the adaptation of (word based) ngrams, the performance of a generic pretrained model can be greatly improved. Specifically, we introduce a Transformer-based Domainaware N-gram Adaptor, T-DNA, to effectively learn and incorporate the semantic representation of different combinations of words in the new domain. Experimental results illustrate the effectiveness of T-DNA on eight lowresource downstream tasks from four domains. We show that T-DNA is able to achieve significant improvements compared to existing methods on most tasks using limited data with lower computational costs. Moreover, further analyses demonstrate the importance and effectiveness of both unseen words and the information of different granularities. 1

show abstract

“…Here, we add new tokens to the vocabulary and increase the model size, motivated by prior work [29]. As this increases the network parameters, these are used as a secondary baseline to be compared with surrogates.…”

Section: Additional Tokensmentioning

confidence: 99%

Effects and Mitigation of Out-of-vocabulary in Universal Language Models

Moon

Okazaki

2021

Journal of Information Processing

View full text Add to dashboard Cite

One of the most important recent natural language processing (NLP) trends is transfer learning -using representations from language models implemented through a neural network to perform other tasks. While transfer learning is a promising and robust method, downstream task performance in transfer learning depends on the robustness of the backbone model's vocabulary, which in turn represents both the positive and negative characteristics of the corpus used to train it. With subword tokenization, out-of-vocabulary (OOV) is generally assumed to be a solved problem. Still, in languages with a large alphabet such as Chinese, Japanese, and Korean (CJK), this assumption does not hold. In our work, we demonstrate the adverse effects of OOV in the context of transfer learning in CJK languages, then propose a novel approach to maximize the utility of a pre-trained model suffering from OOV. Additionally, we further investigate the correlation of OOV to task performance and explore if and how mitigation can salvage a model with high OOV.

show abstract

Improving Pre-Trained Multilingual Model with Vocabulary Expansion

Cited by 18 publications

References 58 publications

E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT

E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT

Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation

Effects and Mitigation of Out-of-vocabulary in Universal Language Models

Contact Info

Product

Resources

About