PhoBERT: Pre-trained language models for Vietnamese

Nguyen, Dat Quoc; Nguyen, Anh Tuan

doi:10.18653/v1/2020.findings-emnlp.92

Cited by 206 publications

(86 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Trong bài báo này, chúng tôi đề xuất một mô hình dựa trên kiến trúc BERT. Chúng tôi sử dụng kiến trúc BERT được công bố bởi nghiên cứu của Viện VinAI [16]. Mô hình PhoBERT được tối ưu hoá sử dụng quá trình huấn luyện RoBERTa và được huấn luyện trên 20GB dữ liệu văn bản tiếng Việt.…”

Section: Kiến Trúc Mô Hìnhunclassified

“…Mô hình PhoBERT được tối ưu hoá sử dụng quá trình huấn luyện RoBERTa và được huấn luyện trên 20GB dữ liệu văn bản tiếng Việt. Kết quả được công bố trong bài báo [16] đã chứng tỏ rằng việc sử dụng mô hình BERT như là lớp nhúng từ đem lại kết quả tốt hơn so với các phương pháp học sâu khác. Bởi vì BERT cho phép chúng ta biểu diễn của từ vựng theo ngữ cảnh tốt hơn so với các phương pháp nhúng từ truyền thống trước đây như là word2vec hay Glove.…”

Section: Kiến Trúc Mô Hìnhunclassified

“…Đối với mô hình BERT, chúng tôi sử dụng PhoBERT [16] với kích thước lớp ẩn là 768 chiều và tổng số lớp biến đổi (tranformer layer) là 12. Giá trị tốc độ học của mô hình được thực nghiệm theo tập giá trị 2e-5, 3e-5, 4e-5 và lựa chọn giá trị tốt nhất là 2e-5.…”

Section: Chi Tiết Cài đặTunclassified

See 2 more Smart Citations

Ứng Dụng Mô Hình Bert Cho Bài Toán Phân Loại Hồ Sơ Theo Thời Hạn Bảo Quản

Sáu¹,

Toanh²

2021

TNUJST

View full text Add to dashboard Cite

Công tác lưu trữ hồ sơ tại các cơ quan, tổ chức có thẩm quyền là một vấn đề cần thiết trong việc quản lý và tổ chức bảo quản tài liệu. Tuy nhiên, hiện nay với số lượng hồ sơ lưu trữ ngày càng nhiều và có nhiều loại văn bản quy định lưu trữ khác nhau dẫn đến việc tình trạng quá tải tài liệu trong quá trình lưu trữ. Do đó, việc phân loại hồ sơ theo thời hạn bảo quản là một công đoạn rất quan trọng trong việc bảo quản, góp phần tối ưu hóa thành phần trong các phòng lưu trữ, tiết kiệm chi phí bảo quản tài liệu. Để góp phần giải quyết được vấn đề trên, trong bài báo này, chúng tôi trình bày nghiên cứu đánh giá sự hiệu quả của mô hình BERT so sánh với các thuật toán máy học truyền thống và mô hình học sâu trên các bộ dữ liệu thực tế hồ sơ lưu trữ theo thời hạn bảo quản ở các cơ quan. Kết quả nghiên cứu cho thấy rằng, mô hình BERT đạt kết quả tốt nhất với độ chính xác là 93,10%, độ phủ là 90,68% và độ đo F1 là 91,49%. Kết quả này cho thấy rằng, mô hình BERT có thể được áp dụng để xây dựng các hệ thống hỗ trợ phân loại hồ sơ theo thời hạn bảo quản là hoàn toàn khả thi.

show abstract

Section: Kiến Trúc Mô Hìnhunclassified

See 1 more Smart Citation

Ứng Dụng Mô Hình Bert Cho Bài Toán Phân Loại Hồ Sơ Theo Thời Hạn Bảo Quản

Sáu¹,

Toanh²

2021

TNUJST

View full text Add to dashboard Cite

show abstract

“…However, there have been studies [12] which show that monolingual models are generally more performant than multilingual models due to the differing sizes of pretraining data and a more accurate tokenization scheme [11]. This is very much apparent in pre-trained monolingual models in various languages, such as IndoBERT [30] for Indonesian, PhoBERT [31] for Vietnamese, WangchanBERTa [32] for Thai, whereby these monolingual models constantly outperform their multilingual counterparts in downstream tasks.…”

Section: Sundanese Language Modelingmentioning

confidence: 99%

Pre-Trained Transformer-Based Language Models for Sundanese

Wongso

Lucky

Suhartono

2021

Preprint

View full text Add to dashboard Cite

The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.

show abstract

“…The fourth model is W2V SentiWord in which the Word2vec vector is input to one channel and the sentiment word vector [21] is input to the other channel. The last model is W2V BERT in which two inputs to the two channels are the Word2vec vector and the BERT feature vector [15], respectively. First, these tables show that the accuracy of 2CV is remarkably better than IWV for both the LSTM-based and CNN-based models on all tested datasets.…”

Section: Performance Comparisonmentioning

confidence: 99%

A Two-Channel Model for Representation Learning in Vietnamese Sentiment Classification Problem

Nguyen¹,

Vu²,

Nguyen³

2020

JCC

View full text Add to dashboard Cite

Sentiment classification (SC) aims to determine whether a document conveys a positive or negative opinion. Due to the rapid development of the digital world, SC has become an important research topic that affects many aspects of our life. In SC based on machine learning, the representation of the document strongly influences on its accuracy. Word Embedding (WE)-based techniques, i.e., Word2vec techniques, are proved to be beneficial techniques to the SC problem. However, Word2vec is often not enough to represent the semantic of documents with complex sentences of Vietnamese. In this paper, we propose a new representation learning model called a \textbf{two-channel vector} to learn a higher-level feature of a document in SC. Our model uses two neural networks to learn the semantic feature, i.e., Word2vec and the syntactic feature, i.e., Part of Speech tag (POS). Two features are then combined and input to a \textit{Softmax} function to make the final classification. We carry out intensive experiments on $4$ recent Vietnamese sentiment datasets to evaluate the performance of the proposed architecture. The experimental results demonstrate that the proposed model can significantly enhance the accuracy of SC problems compared to two single models and a state-of-the-art ensemble method.

show abstract

PhoBERT: Pre-trained language models for Vietnamese

Cited by 206 publications

References 23 publications

Ứng Dụng Mô Hình Bert Cho Bài Toán Phân Loại Hồ Sơ Theo Thời Hạn Bảo Quản

Ứng Dụng Mô Hình Bert Cho Bài Toán Phân Loại Hồ Sơ Theo Thời Hạn Bảo Quản

Pre-Trained Transformer-Based Language Models for Sundanese

A Two-Channel Model for Representation Learning in Vietnamese Sentiment Classification Problem

Contact Info

Product

Resources

About