ETNLP: A Visual-Aided Systematic Approach to Select Pre-Trained Embeddings for a Downstream Task

Vu, Xuan-Son; Vu, Thanh; Tran, Son N.; Jiang, Lili

doi:10.26615/978-954-452-056-4_147

Cited by 18 publications

(19 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For NER, PhoBERT large produces 1.1 points higher F 1 than PhoBERT base . In addition, PhoBERT base obtains 2+ points higher than the previous SOTA feature-and neural network-based models VnCoreNLP-NER and BiLSTM-CNN-CRF (Ma and Hovy, 2016) [ ] 88.6 BiLSTM-max (Conneau et al, 2018) 66.4 VNER (Nguyen et al, 2019b) 89.6 mBiLSTM (Artetxe and Schwenk, 2019) 72.0 BiLSTM-CNN-CRF + ETNLP [♠] 91.1 multilingual BERT (Devlin et al, 2019) [ ] 69.5 VnCoreNLP-NER + ETNLP [♠] 91.3 XLM MLM+TLM (Conneau and Lample, 2019) 76.6 XLM-R base (our result) 92.0 XLM-R base (Conneau et al, 2020) 75.4 XLM-R large (our result) 92.8 XLM-R large (Conneau et al, 2020) 79.7 PhoBERT base 93.6 PhoBERT base 78.5 PhoBERT large 94.7 PhoBERT large 80.0 are trained with the set of 15K BERT-based ETNLP word embeddings (Vu et al, 2019).…”

Section: Resultsmentioning

confidence: 89%

“…The success of pre-trained BERT and its variants has largely been limited to the English language. For other languages, one could retrain a language-specific model using the BERT architecture (Cui et al, 2019;de Vries et al, 2019;Vu et al, 2019;Martin et al, 2020) or employ existing pre-trained multilingual BERT-based models (Devlin et al, 2019;Conneau and Lample, 2019;Conneau et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

“…• The Vietnamese Wikipedia corpus is the only data used to train monolingual language models (Vu et al, 2019), and it also is the only Vietnamese dataset which is included in the pretraining data used by all multilingual language * Work done during internship at VinAI Research. models except XLM-R.…”

Section: Introductionmentioning

confidence: 99%

“…Thang et al (2008) show that 85% of Vietnamese word types are composed of at least two syllables.2 Although performing word segmentation before applying BPE on the Vietnamese Wikipedia corpus, ETNLP(Vu et al, 2019) in fact does not publicly release any pre-trained BERT-based language model (https://github.com/ vietnlp/etnlp). In particular,Vu et al (2019) release a set of 15K BERT-based word embeddings specialized only for the Vietnamese NER task.…”

mentioning

confidence: 99%

See 3 more Smart Citations

PhoBERT: Pre-trained language models for Vietnamese

Nguyen

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

193

View full text Add to dashboard Cite

We present PhoBERT with two versions-PhoBERT base and PhoBERT large -the first public large-scale monolingual language models pre-trained for Vietnamese. Experimental results show that PhoBERT consistently outperforms the recent best pre-trained multilingual model XLM-R (Conneau et al., 2020) and improves the state-of-the-art in multiple Vietnamese-specific NLP tasks including Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference. We release PhoBERT to facilitate future research and downstream applications for Vietnamese NLP.

show abstract

Section: Resultsmentioning

confidence: 89%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

PhoBERT: Pre-trained language models for Vietnamese

Nguyen

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

193

View full text Add to dashboard Cite

show abstract

“…In this model, we use external knowledge sources as word embeddings. To explore the effectiveness of word embeddings, we evaluate the performance of our proposed model on with several word embeddings including Word2vec [73], Word2vec and Character2vec [74], fastText [75], ELMo [72], BERT [49] and MULTI [76]. In particular, we use pre-trained embeddings on Vietnamese Wikipedia proposed by [76] for all experiments of our proposed method.…”

Section: External Knowledge Integrationmentioning

confidence: 99%

Enhancing Lexical-Based Approach With External Knowledge for Vietnamese Multiple-Choice Machine Reading Comprehension

et al. 2020

View full text Add to dashboard Cite

Although Vietnamese is the 17 th most popular native-speaker language a in the world, there are not many research studies on Vietnamese machine reading comprehension (MRC), the task of understanding a text and answering questions about it. One of the reasons is because of the lack of high-quality benchmark datasets for this task. In this work, we construct a dataset which consists of 2,783 pairs of multiple-choice questions and answers based on 417 Vietnamese texts which are commonly used for teaching reading comprehension for elementary school pupils. In addition, we propose a lexicalbased MRC method that utilizes semantic similarity measures and external knowledge sources to analyze questions and extract answers from the given text. We compare the performance of the proposed model with several baseline lexical-based and neural network-based models. Our proposed method achieves 61.81% by accuracy, which is 5.51% higher than the best baseline model. We also measure human performance on our dataset and find that there is a big gap between machine-model and human performances. This indicates that significant progress can be made on this task. The dataset is freely available on our website b for research purposes.

show abstract

Pre-trained Language Models for Tagalog with Multi-source Data

Jiang

Lin

et al. 2021

Natural Language Processing and Chinese Computing

View full text Add to dashboard Cite

ETNLP: A Visual-Aided Systematic Approach to Select Pre-Trained Embeddings for a Downstream Task

Cited by 18 publications

References 19 publications

PhoBERT: Pre-trained language models for Vietnamese

PhoBERT: Pre-trained language models for Vietnamese

Enhancing Lexical-Based Approach With External Knowledge for Vietnamese Multiple-Choice Machine Reading Comprehension

Pre-trained Language Models for Tagalog with Multi-source Data

Contact Info

Product

Resources

About