2019
DOI: 10.48550/arxiv.1906.05963
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Image Captioning: Transforming Objects into Words

Abstract: Image captioning models typically follow an encoder-decoder architecture which uses abstract image feature vectors as input to the encoder. One of the most successful algorithms uses feature vectors extracted from the region proposals obtained from an object detector. In this work we introduce the Object Relation Transformer, that builds upon this approach by explicitly incorporating information about the spatial relationship between input detected objects through geometric attention. Quantitative and qualitat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
35
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 22 publications
(35 citation statements)
references
References 19 publications
(35 reference statements)
0
35
0
Order By: Relevance
“…To further leverage the visual cues, an attention mechanism is usually utilized [4,6,42] to focus on specific visual features. Moreover, recent models apply self-attention [16,43] or use an expressive visual Transformer [12] as an encoder [23]. Our work uses the expressive embedding of CLIP for visual representation.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…To further leverage the visual cues, an attention mechanism is usually utilized [4,6,42] to focus on specific visual features. Moreover, recent models apply self-attention [16,43] or use an expressive visual Transformer [12] as an encoder [23]. Our work uses the expressive embedding of CLIP for visual representation.…”
Section: Related Workmentioning
confidence: 99%
“…To produce the caption itself, a textual decoder is employed. Early works have used LSTM variants [8,38,39], while recent works [16,26] adopted the improved transformer architecture [36]. Built upon the transformer, one of the most notable works is BERT [11], demonstrating the dominance of the newly introduced paradigm.…”
Section: Related Workmentioning
confidence: 99%
“…There is a rich literature on image captioning studying different model structures and learning approaches. Recent works have proposed enhanced attention-based models to improve the performance, such as ORT [14], AoANet [16], M 2 Transformer [8], X-LAN [37], and RSTNet [57]. Besides, researchers have explored to leverage semantic attributes [26,53], scene graphs [50], and graph convolutional networks [52] for captioning.…”
Section: A Related Work On Image Captioningmentioning
confidence: 99%
“…Prior captioning approaches have involved attention mechanisms and their variants to capture spatial relationship between objects [6] for generating captions. In this paper, we explore both object detection (image tagging) and captioning as features for assisting the ad text-to-image query task.…”
Section: Textual Description Of Images: Object Detection and Image Ca...mentioning
confidence: 99%
“…For each image the model returns a list of inferred tags with confidence scores (we used tags with confidence above 0.8). For captioning, we used an object relation transformer (ORT) model [6] with a Faster R-CNN object detector. For our analysis, we trained two ORT models: one on Microsoft COCO 2014 Captions dataset [13] (COCO-captions model), and the other on Conceptual Captions dataset [21] (CC-captions model).…”
Section: Image Tags and Caption Metadatamentioning
confidence: 99%