Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning

Wang, Jing; Tang, Jinhui; Luo, Jiebo

doi:10.1145/3394171.3413753

Cited by 48 publications

(35 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, lots of complex attention mechanisms are proposed. [24] presents a multimodal attention network to manage information from different modalities. [25] proposes a multistage attention mechanism which operates in a coarse-to-fine manner to ensure global consistency and local accuracy.…”

Section: B Attention Mechanismmentioning

confidence: 99%

Multi-Gate Attention Network for Image Captioning

Jiang

et al. 2021

IEEE Access

View full text Add to dashboard Cite

Self-attention mechanism, which has been successfully applied to current encoder-decoder framework of image captioning, is used to enhance the feature representation in the image encoder and capture the most relevant information for the language decoder. However, most existing methods will assign attention weights to all candidate vectors, which implicitly hypothesizes that all vectors are relevant. Moreover, current self-attention mechanisms ignore the intra-object attention distribution, and only consider the inter-object relationships. In this paper, we propose a Multi-Gate Attention (MGA) block, which expands the traditional self-attention by equipping with additional Attention Weight Gate (AWG) module and Self-Gated (SG) module. The former constrains the attention weights to be assigned to the most contributive objects. The latter is adopted to consider the intra-object attention distribution and eliminate the irrelevant information in object feature vector. Furthermore, most current image captioning methods apply the original transformer designed for natural language processing task, to refine image features directly. Therefore, we propose a pre-layernorm transformer to simplify the transformer architecture and make it more efficient for image feature enhancement. By integrating MGA block with pre-layernorm transformer architecture into the image encoder and AWG module into the language decoder, we present a novel Multi-Gate Attention Network (MGAN). The experiments on MS COCO dataset indicate that the MGAN outperforms most of the state-of-the-art, and further experiments on other methods combined with MGA blocks demonstrate the generalizability of our proposal. INDEX TERMS Image captioning, self-attention, transformer, multi-gate attention.

show abstract

Section: B Attention Mechanismmentioning

confidence: 99%

Multi-Gate Attention Network for Image Captioning

Jiang

et al. 2021

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Visual-based image captioning models exploit features generated from images. Multimodal image captioning approaches exploit other modes of features in addition to image-based features such as candidate captions and text detected in images (Wang et al, 2020;.…”

Section: Related Workmentioning

confidence: 99%

Multi-Modal Image Captioning for the Visually Impaired

Ahsan¹,

Bhatt²,

Shah³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Rese

View full text Add to dashboard Cite

One of the ways blind people understand their surroundings is by clicking images and relying on descriptions generated by image captioning systems. Current work on captioning images for the visually impaired do not use the textual data present in the image when generating captions. This problem is critical as many visual scenes contain text. Moreover, up to 21% of the questions asked by blind people about the images they click pertain to the text present in them (Bigham et al., 2010). In this work, we propose altering AoANet, a state-of-the-art image captioning model, to leverage the text detected in the image as an input feature. In addition, we use a pointer-generator mechanism to copy the detected text to the caption when tokens need to be reproduced accurately. Our model outperforms AoANet on the benchmark dataset VizWiz, giving a 35% and 16.2% performance improvement on CIDEr and SPICE scores, respectively..

show abstract

“…Images taken by the blind may have quality issues, such as overexposure, but can better represent the real use case for visually-impaired people. Based on these two datasets, there have been some models [29,31,33,35] proposed to improve the quality of text-aware captions. Wang et al [29] propose to encode intrinsic spatial relationship between OCR tokens to generate more complete scene text information.…”

Section: Related Workmentioning

confidence: 99%

“…With the goal of describing the visual world to visually impaired people, it is essential to comprehend such scene texts beyond pure visual recognition [2,10,14,18,23,28,32]. Therefore, more recent works are focusing on the text-aware image captioning task [24,29,31,33,35], which aims to describe an image in natural sentences covering scene text information in the image.…”

Section: Introductionmentioning

confidence: 99%

Question-controlled Text-aware Image Captioning

Chen²,

Jin

2021

Preprint

View full text Add to dashboard Cite

For an image with multiple scene texts, different people may be interested in different text information. Current text-aware image captioning models are not able to generate distinctive captions according to various information needs. To explore how to generate personalized text-aware captions, we define a new challenging task, namely Question-controlled Text-aware Image Captioning (Qc-TextCap). With questions as control signals, this task requires models to understand questions, find related scene texts and describe them together with objects fluently in human language. Based on two existing text-aware captioning datasets, we automatically construct two datasets, ControlTextCaps and Con-trolVizWiz to support the task. We propose a novel Geometry and Question Aware Model (GQAM). GQAM first applies a Geometryinformed Visual Encoder to fuse region-level object features and region-level scene text features with considering spatial relationships. Then, we design a Question-guided Encoder to select the most relevant visual features for each question. Finally, GQAM generates a personalized text-aware caption with a Multimodal Decoder. Our model achieves better captioning performance and question answering ability than carefully designed baselines on both two datasets. With questions as control signals, our model generates more informative and diverse captions than the stateof-the-art text-aware captioning model. Our code and datasets are publicly available at https://github.com/HAWLYQ/Qc-TextCap. CCS CONCEPTS• Computing methodologies → Natural language generation; Computer vision.

show abstract

Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning

Cited by 48 publications

References 33 publications

Multi-Gate Attention Network for Image Captioning

Multi-Gate Attention Network for Image Captioning

Multi-Modal Image Captioning for the Visually Impaired

Question-controlled Text-aware Image Captioning

Contact Info

Product

Resources

About