BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Tran, Nguyen Luong; Le, Duong Minh; Nguyen, Dat Quoc

doi:10.48550/arxiv.2109.09701

Cited by 2 publications

(1 citation statement)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Furthermore, the phobert-base model is the small architecture that is adapted to such a small dataset as the VieCap4H dataset, leading to a quick training time, which helps us conduct more experiments. We also try PhoBERT-large, BARTPho-syllable and BARTPho-word [19] pre-trained models, but it does not seem to operate well. The reason may be that the large architectures are not suitable for the small dataset as VieCap4H (contains 8032 samples).…”

Section: Language Embeddingmentioning

confidence: 99%

vieCap4H Challenge 2021: A transformer-based method for Healthcare Image Captioning in Vietnamese

Cao

Trinh

Nguyen

et al. 2022

JCSCE

View full text Add to dashboard Cite

The automatic image caption generation is attractive to both Computer Vision and Natural Language Processing research community because it lies in the gap between these two fields. Within the vieCap4H contest organized by VLSP 2021, we participate and present a Transformer-based solution for image captioning in the healthcare domain. In detail, we use grid features as visual presentation and pre-training a BERT-based language model from PhoBERT-base pre-trained model to obtain language presentation used in the Adaptive Decoder module in the RSTNet model. Besides, we indicate a suitable schedule with the self-critical training sequence (SCST) technique to achieve the best results. Through experiments, we achieve an average of 30.3% BLEU score on the public-test round and 28.9% on the private-test round, which ranks 3rd and 4th, respectively. Source code is available at https://github.com/caodoanh2001/uit-vlsp-viecap4h-solution.

show abstract

Section: Language Embeddingmentioning

confidence: 99%

vieCap4H Challenge 2021: A transformer-based method for Healthcare Image Captioning in Vietnamese

Cao

Trinh

Nguyen

et al. 2022

JCSCE

View full text Add to dashboard Cite

show abstract

A Novel Pretrained General-purpose Vision Language Model for the Vietnamese Language

Vu,

Pham,

Tran

2024

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

View full text Add to dashboard Cite

Lying in the cross-section of computer vision and natural language processing, vision language models are capable of processing images and text at once. These models are helpful in various tasks: text generation from image and vice versa, image-text retrieval, or visual navigation. Besides building a model trained on a dataset for a task, people also study general-purpose models to utilize many datasets for multitasks. Their two primary applications are image captioning and visual question answering. For English, large datasets and foundation models are already abundant. However, for Vietnamese, they are still limited. To expand the language range, this work proposes a pretrained general-purpose image-text model named VisualRoBERTa. A dataset of 600K images with captions (translated MS COCO 2017 from English to Vietnamese) is introduced to pretrain VisualRoBERTa. The model’s architecture is built using Convolutional Neural Network and Transformer blocks. Fine-tuning VisualRoBERTa shows promising results on the ViVQA dataset with 34.49% accuracy, 0.4173 BLEU 4, and 0.4390 RougeL (in visual question answering task), and best outcomes on the sViIC dataset with 0.6685 BLEU 4, 0.6320 RougeL (in image captioning task).

show abstract

BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Cited by 2 publications

References 0 publications

vieCap4H Challenge 2021: A transformer-based method for Healthcare Image Captioning in Vietnamese

vieCap4H Challenge 2021: A transformer-based method for Healthcare Image Captioning in Vietnamese

A Novel Pretrained General-purpose Vision Language Model for the Vietnamese Language

Contact Info

Product

Resources

About