2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00436
|View full text |Cite
|
Sign up to set email alerts
|

Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning

Abstract: Diverse and accurate vision+language modeling is an important goal to retain creative freedom and maintain user engagement. However, adequately capturing the intricacies of diversity in language models is challenging. Recent works commonly resort to latent variable models augmented with more or less supervision from object detectors or part-of-speech tags [10,40]. Common to all those methods is the fact that the latent variable either only initializes the sentence generation process or is identical across the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
45
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 50 publications
(46 citation statements)
references
References 27 publications
(35 reference statements)
0
45
0
Order By: Relevance
“…VAEs (Kingma and Welling 2014) incorporate a form of non-determinism within a model, making them a suitable candidate for models which require diverse outputs. VAEs have been successfully applied to text processing and generation (Miao, Yu, and Blunsom 2016;Bowman et al 2015), and widely used for diverse text generation (Jain, Zhang, and Schwing 2017;Aneja et al 2019;Lin et al 2020).…”
Section: Latent Variable Modelsmentioning
confidence: 99%
“…VAEs (Kingma and Welling 2014) incorporate a form of non-determinism within a model, making them a suitable candidate for models which require diverse outputs. VAEs have been successfully applied to text processing and generation (Miao, Yu, and Blunsom 2016;Bowman et al 2015), and widely used for diverse text generation (Jain, Zhang, and Schwing 2017;Aneja et al 2019;Lin et al 2020).…”
Section: Latent Variable Modelsmentioning
confidence: 99%
“…Considering LSTM units were complex and inherently sequential across time, J. Aneja, A. Deshpande, and A.G. Schwing [1] from group #11 developed a convolutional image captioning technique. In [58], J. Aneja et al proposed SeqCVAE which learns a latent space for every word position. K. Shuster et al [59] from cluster #12 proposed PERSONALITY-CAPTIONS, where the goal is to be as engaging to humans as possible by incorporating controllable style and personality traits.…”
Section: ) Other Research Communitiesmentioning
confidence: 99%
“…New techniques such as Transformer, reinforcement learning and GANs have been widely applied to solve image description problems, and unsupervised image captioning methods [92][93][94][95] become a new research hotspot. The form of captioning has become more diverse as it is no longer confined to the overall content of the image [58,81,96]. In addition, Vision-Language Pretraining (VLP) model is an emerging direction of image captioning and image understanding.…”
Section: Evolutionary Path Of Image Captioningmentioning
confidence: 99%
“…Generative models for data that is inherently sequential often couple each input term with a corresponding latent variable (e.g. [8]; applications to: image captioning in [1,4]; dialog generation in [21]; handwriting in [2]). Such models can be applied to other kinds of data by imposing a sequence, including Laplacian pyramid levels from images [3], sequences of resolutions [28], multi-scale feature representations [26].…”
Section: Sequential Latent Space Modelsmentioning
confidence: 99%