A Review on Automatic Image Captioning Techniques

Nithya, K. C.; Kumar, Vaibhav

doi:10.1109/iccsp48568.2020.9182105

Cited by 6 publications

(4 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The efficiency of the model has been compared with two different datasets Flickr 8k, Flickr 30k and the state-of-theart methods using different captioning metrics. The Flickr 8k and Flickr 30k datasets, as signified by the names, consist of 8000 and 82,783 images, respectively, with five different captions for each image, describing the salient entities and features [16]. Two search methods, beam search by taking the beamwidth of '5', beamwidth '3′ and greedy search have been computed on the output probabilities from the model to evaluate the BLEU scores for each dataset, respectively.…”

Section: Results and Analysismentioning

confidence: 99%

“…Faster RCNN can also be incorporated to generate text descriptions about a specific image in the sequence of LSTM and RNN, which resulted in a BLEU-1 score of about 59.8 [15]. The Flickr 30k dataset has established its importance and advantage over the field of automatic image captioning (AIC), preferred to depict the most remarkable data out of images by finding the relationship among different objects present in the image [16]. The half and half bidirectional LSTM approach with CNN [17] has also depicted significant outcomes for picture extraction and captioning on Flicker complete datasets, thereby achieving a true positive ratio of 86% and a false positive ratio of 10%.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Dual-Modal Transformer with Enhanced Inter- and Intra-Modality Interactions for Image Captioning

et al. 2022

View full text Add to dashboard Cite

Image captioning is oriented towards describing an image with the best possible use of words that can provide a semantic, relatable meaning of the scenario inscribed. Different models can be used to accomplish this arduous task depending on the context and requirement of what needs to be achieved. An encoder–decoder model which uses the image feature vectors as an input to the encoder is often marked as one of the appropriate models to accomplish the captioning process. In the proposed work, a dual-modal transformer has been used which captures the intra- and inter-model interactions in a simultaneous manner within an attention block. The transformer architecture is quantitatively evaluated on a publicly available Microsoft Common Objects in Context (MS COCO) dataset yielding a Bilingual Evaluation Understudy (BLEU)-4 Score of 85.01. The efficacy of the model is evaluated on Flickr 8k, Flickr 30k datasets and MS COCO datasets and results for the same is compared and analysed with the state-of-the-art methods. The results shows that the proposed model outperformed when compared with conventional models, such as the encoder–decoder model and attention model.

show abstract

Section: Results and Analysismentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Dual-Modal Transformer with Enhanced Inter- and Intra-Modality Interactions for Image Captioning

et al. 2022

View full text Add to dashboard Cite

show abstract

“…This is while comparing several evaluation metrics including BLEU (1 to 4), CIDEr and METEOR. In [18], the paper put the spotlight on some of the advancement on the image captioning task until early 2020, where various approaches were discussed including N-cut, color-based segmentation and hybrid engine. It also discussed how model engineering and incorporating more hyper-parameters improve the overall pipeline and result in the best accuracy for such models.…”

Section: Related Workmentioning

confidence: 99%

A Thorough Review on Recent Deep Learning Methodologies for Image Captioning

Elhagry,

Kadaoui

2021

Preprint

View full text Add to dashboard Cite

Image Captioning is a task that combines computer vision and natural language processing, where it aims to generate descriptive legends for images. It is a two-fold process relying on accurate image understanding and correct language understanding both syntactically and semantically. It is becoming increasingly difficult to keep up with the latest research and findings in the field of image captioning due to the growing amount of knowledge available on the topic. There is not, however, enough coverage of those findings in the available review papers. We perform in this paper a run-through of the current techniques, datasets, benchmarks and evaluation metrics used in image captioning. The current research on the field is mostly focused on deep learning-based methods, where attention mechanisms along with deep reinforcement and adversarial learning appear to be in the forefront of this research topic. In this paper, we review recent methodologies such as UpDown, OSCAR, VIVO, Meta Learning and a model that uses conditional generative adversarial nets. Although the GANbased model achieves the highest score, UpDown represents an important basis for image captioning and OSCAR and VIVO are more useful as they use novel object captioning. This review paper serves as a roadmap for researchers to keep up to date with the latest contributions made in the field of image caption generation.

show abstract

“…As a popular challenge involving sequence modeling, the state-of-the-art (SOTA) problem of photo caption generation uses various approaches. For example, the Convolutional Neural Network, ConvNet, known as the CNN, is applied with other language architecture, like the Recurrent Neural Network (RNN), as a CNN-RNN-based framework approach [3]. This work uses the standard encoder-decoder architecture using a pre-trained CNN model to build feature vectors, and they are then fed into an RNN as the decoder generates the language description.…”

Section: Introductionmentioning

confidence: 99%

Pre-Trained CNN Architecture Analysis for Transformer-Based Indonesian Image Caption Generation Model

Mulyawan

Sunyoto

Muhammad

2023

JOIV : Int. J. Inform. Visualization

View full text Add to dashboard Cite

Classification and object recognition in image processing has significantly improved computer vision tasks. The method is often used for visual problems, especially in picture classification utilizing the Convolutional Neural Network (CNN). In the popular state-of-the-art (SOTA) task of generating a caption on an image, the implementation is often used for feature extraction of an image as an encoder. Instead of performing direct classification, these extracted features are sent from the encoder to the decoder section to generate the sequence. So, some CNN layers related to the classification task are not required. This study aims to determine which CNN pre-trained architecture or model performs best in extracting image features using a state-of-the-art Transformer model as its decoder. Unlike the original Transformer’s architecture, we implemented a vector-to-sequence way instead of sequence-to-sequence for the model. Indonesian Flickr8k and Flick30k datasets were used in this research. Evaluations were carried out using several pre-trained architectures, including ResNet18, ResNet34, ResNet50, ResNet101, VGG16, Efficientnet_b0, Efficientnet_b1, and Googlenet. The qualitative model inference results and quantitative evaluation scores were analyzed in this study. The test results show that the ResNet50 architecture can produce stable sequence generation with the highest accuracy value. With some experimentation, finetuning the encoder can significantly increase the model evaluation score. As for future work, further exploration with larger datasets like Flickr30k, MS COCO 14, MS COCO 17, and other image captioning datasets in Indonesian also implementing a new Transformers-based method can be used to get a better Indonesian automatic image captioning model.

show abstract

A Review on Automatic Image Captioning Techniques

Cited by 6 publications

References 13 publications

Dual-Modal Transformer with Enhanced Inter- and Intra-Modality Interactions for Image Captioning

Dual-Modal Transformer with Enhanced Inter- and Intra-Modality Interactions for Image Captioning

A Thorough Review on Recent Deep Learning Methodologies for Image Captioning

Pre-Trained CNN Architecture Analysis for Transformer-Based Indonesian Image Caption Generation Model

Contact Info

Product

Resources

About