Proceedings of the 28th ACM International Conference on Multimedia 2020
DOI: 10.1145/3394171.3413753
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning

Abstract: OCR-based image captioning is the task of automatically describing images based on reading and understanding written text contained in images. Compared to conventional image captioning, this task is more challenging, especially when the image contains multiple text tokens and visual objects. The difficulties originate from how to make full use of the knowledge contained in the textual entities to facilitate sentence generation and how to predict a text token based on the limited information provided by the ima… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
35
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 48 publications
(35 citation statements)
references
References 33 publications
0
35
0
Order By: Relevance
“…Moreover, lots of complex attention mechanisms are proposed. [24] presents a multimodal attention network to manage information from different modalities. [25] proposes a multistage attention mechanism which operates in a coarse-to-fine manner to ensure global consistency and local accuracy.…”
Section: B Attention Mechanismmentioning
confidence: 99%
“…Moreover, lots of complex attention mechanisms are proposed. [24] presents a multimodal attention network to manage information from different modalities. [25] proposes a multistage attention mechanism which operates in a coarse-to-fine manner to ensure global consistency and local accuracy.…”
Section: B Attention Mechanismmentioning
confidence: 99%
“…Visual-based image captioning models exploit features generated from images. Multimodal image captioning approaches exploit other modes of features in addition to image-based features such as candidate captions and text detected in images (Wang et al, 2020;.…”
Section: Related Workmentioning
confidence: 99%
“…Images taken by the blind may have quality issues, such as overexposure, but can better represent the real use case for visually-impaired people. Based on these two datasets, there have been some models [29,31,33,35] proposed to improve the quality of text-aware captions. Wang et al [29] propose to encode intrinsic spatial relationship between OCR tokens to generate more complete scene text information.…”
Section: Related Workmentioning
confidence: 99%
“…With the goal of describing the visual world to visually impaired people, it is essential to comprehend such scene texts beyond pure visual recognition [2,10,14,18,23,28,32]. Therefore, more recent works are focusing on the text-aware image captioning task [24,29,31,33,35], which aims to describe an image in natural sentences covering scene text information in the image.…”
Section: Introductionmentioning
confidence: 99%