Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing 2017
DOI: 10.18653/v1/d17-1098
|View full text |Cite
|
Sign up to set email alerts
|

Guided Open Vocabulary Image Captioning with Constrained Beam Search

Abstract: Existing image captioning models do not generalize well to out-of-domain images containing novel scenes or objects. This limitation severely hinders the use of these models in real world applications dealing with images in the wild. We address this problem using a flexible approach that enables existing deep captioning architectures to take advantage of image taggers at test time, without re-training. Our method uses constrained beam search to force the inclusion of selected tag words in the output, and fixed,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
213
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 167 publications
(213 citation statements)
references
References 31 publications
0
213
0
Order By: Relevance
“…The dimension of the image feature is 4,096. In order to introduce the novel objects into the final captions, a popular open-source pre-trained Faster-RCNN model [12] is adopted to detect and crop the objects in an image following [37], [10], [11]. Then, we reuse the VGG Net mentioned above to extract visual features of the detected objects.…”
Section: A Experimental Settingsmentioning
confidence: 99%
See 1 more Smart Citation
“…The dimension of the image feature is 4,096. In order to introduce the novel objects into the final captions, a popular open-source pre-trained Faster-RCNN model [12] is adopted to detect and crop the objects in an image following [37], [10], [11]. Then, we reuse the VGG Net mentioned above to extract visual features of the detected objects.…”
Section: A Experimental Settingsmentioning
confidence: 99%
“…Compared approaches. To evaluate on the held-out MSCOCO, results of our proposed method are compared with DCC [9], NOC [35], LSTM-C [36], Base+T4 [37], NTB+G [10] and DNOC [11] to demonstrate the competitiveness. During the methods, NTB+G and DNOC do not use the additional semantic data.…”
Section: A Experimental Settingsmentioning
confidence: 99%
“…To generalize better 'in the wild', we argue that captioning models should be able to leverage alternative data sources -such as object detection datasets -in order to describe objects not present in the caption corpora on which they are trained. Such objects which have detection annotations but are not present in caption corpora are referred to as novel objects and the task of describing images containing novel objects is termed novel object captioning [2,3,15,27,42,45,49]. Until now, novel object captioning approaches have been evaluated using a proof-ofconcept dataset introduced in [14].…”
Section: Introductionmentioning
confidence: 99%
“…The inability of unidirectional BS to consider both the future and past contexts leads models to fill the blank with words that clash abruptly with the context around the blanks (see red circles). 2014; Bahdanau et al, 2014;Gehring et al, 2017;Vaswani et al, 2017) are widely used in text generation tasks, including neural machine translation (Wu et al, 2016;Vaswani et al, 2017), image captioning (Anderson et al, 2017), abstractive summarization (See et al, 2017), and dialogue generation (Mei et al, 2017). Unfortunately, given a well-trained 2 neural seq2seq model or unconditional neural language model (Mikolov et al, 2010), it is a daunting task to directly apply it to text infilling task.…”
Section: Introductionmentioning
confidence: 99%