2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.01095
|View full text |Cite
|
Sign up to set email alerts
|

Fast, Diverse and Accurate Image Captioning Guided by Part-Of-Speech

Abstract: Image captioning is an ambiguous problem, with many suitable captions for an image. To address ambiguity, beam search is the de facto method for sampling multiple captions. However, beam search is computationally expensive and known to produce generic captions [8,10]. To address this concern, some variational auto-encoder (VAE) [32] and generative adversarial net (GAN) [5,25] based methods have been proposed. Though diverse, GAN and VAE are less accurate. In this paper, we first predict a meaningful summary of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
98
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
2
1

Relationship

2
5

Authors

Journals

citations
Cited by 130 publications
(101 citation statements)
references
References 26 publications
0
98
0
Order By: Relevance
“…Our Seq-CVAE method obtains high scores on standard captioning metrics. We obtain an accuracy comparable to both the very recently proposed POS approach [10] which uses a part-ofspeech prior and also to the AG-CVAE method [40]. Both these methods use additional information in the form of object vectors from a Faster-RCNN [32] during inference.…”
Section: Intention Modelmentioning
confidence: 63%
See 2 more Smart Citations
“…Our Seq-CVAE method obtains high scores on standard captioning metrics. We obtain an accuracy comparable to both the very recently proposed POS approach [10] which uses a part-ofspeech prior and also to the AG-CVAE method [40]. Both these methods use additional information in the form of object vectors from a Faster-RCNN [32] during inference.…”
Section: Intention Modelmentioning
confidence: 63%
“…For high-level control, one-hot encodings that represent observed objects or groups of objects are injected at the first step of the LSTM [40]. Very recently [10], more low-level control has also been discussed by conditioning on abstract representations of partof-speech tags. Again, the conditioning was achieved by changing the initial LSTM input.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…To the best of our knowledge, the POS tag information of language description has not been introduced in the video captioning task. While in image captioning, Deshpande et al treated the entire POS tag sequence given by benchmark dataset as a sample, and divided them in 1024 categories by a k-medoids cluster [10], which limits the diversity of POS sequence information. He et al controlled the input of image representations based on the predefined POS tag information of each ground-truth word [16], which can hardly obtained in practical scenario.…”
Section: Captioning With Pos Informationmentioning
confidence: 99%
“…Prior video captioning methods also neglect the syntactic structure of a sentence during the generation process. Analogic to the fact that words are the basic composition of a sentence, the part-of-speech (POS) [10] information of each word in a sentence is the basic structure of the grammar. Therefore, the POS information of the generated sentence is able to act as one prior knowledge to guide and regularize the sentence generation, if it can be obtained beforehand.…”
Section: Introductionmentioning
confidence: 99%