2019
DOI: 10.1016/j.jvcir.2018.12.027
|View full text |Cite
|
Sign up to set email alerts
|

Scene graph captioner: Image captioning based on structural visual representation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
29
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
3
3

Relationship

1
9

Authors

Journals

citations
Cited by 80 publications
(33 citation statements)
references
References 17 publications
0
29
0
Order By: Relevance
“…representations have been used for multiple computer vision applications, including image retrieval [52], image captioning [53], and image generation [54]. In our work, we employ the rule-based scene graph parser from [55] to extract object candidates and their corresponding attributes from captions, and make use of this information to train a localization network for WSSS.…”
Section: F Scene Graphsmentioning
confidence: 99%
“…representations have been used for multiple computer vision applications, including image retrieval [52], image captioning [53], and image generation [54]. In our work, we employ the rule-based scene graph parser from [55] to extract object candidates and their corresponding attributes from captions, and make use of this information to train a localization network for WSSS.…”
Section: F Scene Graphsmentioning
confidence: 99%
“…These works could be sorted based on the attention network structure and the attention weight calculation method. (1) Single layer vs. multi-layer (see, for example, studies [3,3,4,[19][20][21]) employs a single layer implement of an attention mechanism by taking the hidden state as the query vector to extract visual features at each step, while the studies [8,16,17,22,23] chose a multi-layer attention implementation in their decoder. (2) Involving extra clues in attention weights or not-for example, studies [11,24,25] obtained their attention weights only from previous attention calculations, while the authors of [18] combined geometry clues with previously calculated weights into final attention weights.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, increasing researchers are engaged into generating the scene graph [40], where the node denotes the object and the edge represents the relationship between pairwise nodes. Due to the comprehensive and coherent visually-grounded knowledge, the scene graph can contribute to a variety of AI tasks, such as image retrieval [13], image captioning [41,45], image generation [12], and visual Q&A [21,25,36]. The scene graph is formed of directed edges, which connect two objects as the ⟨subject-predicate-object ⟩ triplet.…”
Section: Introductionmentioning
confidence: 99%