Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: Hu 2018
DOI: 10.18653/v1/n18-1198
|View full text |Cite
|
Sign up to set email alerts
|

Object Counts! Bringing Explicit Detections Back into Image Captioning

Abstract: The use of explicit object detectors as an intermediate step to image captioning -which used to constitute an essential stage in early work -is often bypassed in the currently dominant end-to-end approaches, where the language model is conditioned directly on a midlevel image embedding. We argue that explicit detections provide rich semantic information, and can thus be used as an interpretable representation to better understand why end-to-end image captioning systems work well. We provide an in-depth analysi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

2
15
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 28 publications
(17 citation statements)
references
References 24 publications
2
15
0
Order By: Relevance
“…Although our representation is really sparse on the object interactions, it captures the basic concept of the presence of more than one object of the same kind, and thus provides extra information. A similar trend was observed by Wang et al [29], who further explored encoding the geometric and size information of objects into the representation, and by Yin and Ordonez [33], who learn interactions using a specified object-layout RNN.…”
Section: Image Captioning Resultssupporting
confidence: 70%
“…Although our representation is really sparse on the object interactions, it captures the basic concept of the presence of more than one object of the same kind, and thus provides extra information. A similar trend was observed by Wang et al [29], who further explored encoding the geometric and size information of objects into the representation, and by Yin and Ordonez [33], who learn interactions using a specified object-layout RNN.…”
Section: Image Captioning Resultssupporting
confidence: 70%
“…Our conclusion (Section 5) is that it is feasible to frame NLI as a generation task and that incorporating non-linguistic information is potentially profitable. However, in line with recent research evaluating Vision-Language (VL) models (Shekhar et al, 2017;Wang et al, 2018;Vu et al, 2018;Tanti et al, 2019), we also find that current architectures are unable to ground textual representations in image data sufficiently.…”
Section: Introductionsupporting
confidence: 86%
“…This is especially the case since neural approaches to fundamental computer vision (CV) tasks have yielded significant improvements (LeCun et al, 2015), while also making it possible to use pretrained CV models in multimodal neural architectures, for example in tasks such as image captioning (Bernardi et al, 2016). However, recent work has cast doubt on the extent to which such models are truly exploiting image features in a multimodal space (Shekhar et al, 2017;Wang et al, 2018;Tanti et al, 2019). Indeed, Vu et al (2018) also find that image data contributes less than expected to determining the semantic relationship between premise-hypothesis pairs in the classic RTE labelling task.…”
mentioning
confidence: 99%
“…Notable advances have been made in conditioning image captioning on semantic priors of objects by using object detectors [18,30]. This conditioning is only limited to the objects (or nouns) in the caption and ignores the remainder, while our POS approach achieves coordination for the entire sentence.…”
Section: Image Captioningmentioning
confidence: 99%