Fast, Diverse and Accurate Image Captioning Guided by Part-Of-Speech

Deshpande, Aditya; Aneja, Jyoti; Wang, Liwei; Schwing, Alexander G.; Forsyth, David

doi:10.1109/cvpr.2019.01095

Cited by 130 publications

(101 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our Seq-CVAE method obtains high scores on standard captioning metrics. We obtain an accuracy comparable to both the very recently proposed POS approach [10] which uses a part-ofspeech prior and also to the AG-CVAE method [40]. Both these methods use additional information in the form of object vectors from a Faster-RCNN [32] during inference.…”

Section: Intention Modelmentioning

confidence: 63%

“…For high-level control, one-hot encodings that represent observed objects or groups of objects are injected at the first step of the LSTM [40]. Very recently [10], more low-level control has also been discussed by conditioning on abstract representations of partof-speech tags. Again, the conditioning was achieved by changing the initial LSTM input.…”

Section: Introductionmentioning

confidence: 99%

“…However, adequately capturing the intricacies of diversity in language models is challenging. Recent works commonly resort to latent variable models augmented with more or less supervision from object detectors or part-of-speech tags [10,40]. Common to all those methods is the fact that the latent variable either only initializes the sentence generation process or is identical across the steps of generation.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning

Aneja

Agrawal

Batra

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

View full text Add to dashboard Cite

Diverse and accurate vision+language modeling is an important goal to retain creative freedom and maintain user engagement. However, adequately capturing the intricacies of diversity in language models is challenging. Recent works commonly resort to latent variable models augmented with more or less supervision from object detectors or part-of-speech tags [10,40]. Common to all those methods is the fact that the latent variable either only initializes the sentence generation process or is identical across the steps of generation. Both methods offer no fine-grained control. To address this concern, we propose Seq-CVAE which learns a latent space for every word position. We encourage this temporal latent space to capture the 'intention' about how to complete the sentence by mimicking a representation which summarizes the future. We illustrate the efficacy of the proposed approach to anticipate the sentence continuation on the challenging MSCOCO dataset, significantly improving diversity metrics compared to baselines while performing on par w.r.t. sentence quality.

show abstract

Section: Intention Modelmentioning

confidence: 63%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning

Aneja

Agrawal

Batra

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

View full text Add to dashboard Cite

show abstract

“…To the best of our knowledge, the POS tag information of language description has not been introduced in the video captioning task. While in image captioning, Deshpande et al treated the entire POS tag sequence given by benchmark dataset as a sample, and divided them in 1024 categories by a k-medoids cluster [10], which limits the diversity of POS sequence information. He et al controlled the input of image representations based on the predefined POS tag information of each ground-truth word [16], which can hardly obtained in practical scenario.…”

Section: Captioning With Pos Informationmentioning

confidence: 99%

“…Prior video captioning methods also neglect the syntactic structure of a sentence during the generation process. Analogic to the fact that words are the basic composition of a sentence, the part-of-speech (POS) [10] information of each word in a sentence is the basic structure of the grammar. Therefore, the POS information of the generated sentence is able to act as one prior knowledge to guide and regularize the sentence generation, if it can be obtained beforehand.…”

Section: Introductionmentioning

confidence: 99%

Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network

Wang

Zhang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

150

View full text Add to dashboard Cite

In this paper, we propose to guide the video caption generation with Part-of-Speech (POS) information, based on a gated fusion of multiple representations of input videos. We construct a novel gated fusion network, with one particularly designed cross-gating (CG) block, to effectively encode and fuse different types of representations, e.g., the motion and content features of an input video. One POS sequence generator relies on this fused representation to predict the global syntactic structure, which is thereafter leveraged to guide the video captioning generation and control the syntax of the generated sentence. Specifically, a gating strategy is proposed to dynamically and adaptively incorporate the global syntactic POS information into the decoder for generating each word. Experimental results on two benchmark datasets, namely MSR-VTT and MSVD, demonstrate that the proposed model can well exploit complementary information from multiple representations, resulting in improved performances. Moreover, the generated global POS information can well capture the global syntactic structure of the sentence, and thus be exploited to control the syntactic structure of the description. Such POS information not only boosts the video captioning performance but also improves the diversity of the generated captions. Our code is at: https://github.com/vsislab/ Controllable_XGating. * This work was done while Bairui Wang was a Research Intern with Tencent AI Lab. † Corresponding authors.

show abstract

Diverse and styled image captioning using singular value decomposition‐based mixture of recurrent experts

Heidari

Ghatee

Nickabadi

et al. 2022

Concurrency and Computation

View full text Add to dashboard Cite

With significant advances in vision and natural language processing, the generation of image captions becomes a need. Mathews, Xie, and He extended a new model to generate styled captions by separating semantics and style. In continuation of their work, here, a new captioning model is developed, including an image encoder to extract the features, a mixture of recurrent networks to embed the set of extracted features to a group of words, and a sentence generator that combines the obtained words as a stylized sentence. This Mixture of Recurrent Experts (MoRE) system uses a new training algorithm that derives singular value decomposition from weighting matrices of Recurrent Neural Networks (RNNs) to increase the diversity of captions. Each decomposition step depends on a distinctive factor based on the number of RNNs in MoRE. The used sentence generator gives a stylized language corpus without paired images. Besides, the styled and diverse captions are extracted without training on a densely labeled or styled dataset. MoRE on the COCO dataset generated diverse and stylized image captions without the necessity of extra‐labeling and improved descriptions in terms of content accuracy.

show abstract

Fast, Diverse and Accurate Image Captioning Guided by Part-Of-Speech

Cited by 130 publications

References 26 publications

Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning

Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning

Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network

Diverse and styled image captioning using singular value decomposition‐based mixture of recurrent experts

Contact Info

Product

Resources

About