VLSP 2021 - VieCap4H Challenge: Automatic Image Caption Generation for Healthcare Domain in Vietnamese

et al. 2022

JCSCE

The automatic image caption generation is attractive to both Computer Vision and Natural Language Processing research community because it lies in the gap between these two fields. Within the vieCap4H contest organized by VLSP 2021, we participate and present a Transformer-based solution for image captioning in the healthcare domain. In detail, we use grid features as visual presentation and pre-training a BERT-based language model from PhoBERT-base pre-trained model to obtain language presentation used in the Adaptive Decoder module in the RSTNet model. Besides, we indicate a suitable schedule with the self-critical training sequence (SCST) technique to achieve the best results. Through experiments, we achieve an average of 30.3% BLEU score on the public-test round and 28.9% on the private-test round, which ranks 3rd and 4th, respectively. Source code is available at https://github.com/caodoanh2001/uit-vlsp-viecap4h-solution.

Section: Other Hyperparametersmentioning

confidence: 99%

Section: Metricmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

vieCap4H Challenge 2021: A transformer-based method for Healthcare Image Captioning in Vietnamese

Cao

Trinh

et al. 2022

JCSCE

“…To encourage conducting research on Vietnamese image captioning, [2] created a dataset for Vietnamese domain, also serving as a premise for researching on Vietnamese image captioning for healthcare domain. The vieCap4H Challenge 2021 [3] aims to be a competition for developing machine learning algorithms that use Vietnamese to describe the visual content in healthcare settings, especially images that describe the COVID-19 pandemic. Similar to this task, the most recent studies were presented in [4] and [5] that proposed a network involving a deep Convolutional Neural Network (CNN) as an encoder and a Recurrent Neural Network (RNN) as a decoder.…”

Section: Introductionmentioning

confidence: 99%

vieCap4H Challenge 2021: Vietnamese Image Captioning for Healthcare Domain using Swin Transformer and Attention-based LSTM

Nguyên

Pham

et al. 2022

JCSCE

This study presents our approach on the automatic Vietnamese image captioning for healthcare domain in text processing tasks of Vietnamese Language and Speech Processing (VLSP) Challenge 2021, as shown in Figure~\ref{fig:example}. In recent years, image captioning often employs a convolutional neural network-based architecture as an encoder and a long short-term memory (LSTM) as a decoder to generate sentences. These models perform remarkably well in different datasets. Our proposed model also has an encoder and a decoder, but we instead use a Swin Transformer in the encoder, and a LSTM combined with an attention module in the decoder. The study presents our training experiments and techniques used during the competition. Our model achieves a BLEU4 score of 0.293 on the vietCap4H dataset, and the score is ranked the 3$^{rd}$ place on the private leaderboard. Our code can be found at \url{https://git.io/JDdJm}.

“…To evaluate the performance of different types of region features, Transformer-based [35] model is used to train to generating captions. Two benchmark datasets for image captioning in Vietnamese are used to evaluate the effectiveness: UIT-ViIC [13] and VieCap4H [14].…”

Section: Introductionmentioning

confidence: 99%

Empirical Study of Feature Extraction Approaches for Image Captioning in Vietnamese

2022

JCC

Image captioning is a challenging task that is still being addressed in the 2020s. The problem has the input as an image, and the output is the generated caption that describes the context of the input image. In this study, I focus on the image captioning problem in Vietnamese. In detail, I present the empirical study of feature extraction approaches using current state-of-the-art object detection methods to represent the images in the model space. Each type of feature is trained with the Transformer-based captioning model. I investigate the effectiveness of different feature types on two Vietnamese datasets: UIT-ViIC and VieCap4H, the two standard benchmark datasets. The experimental results show crucial insight into the feature extraction task for image captioning in Vietnamese.