Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning

Aneja, Jyoti; Agrawal, Harsh; Batra, Dhruv; Schwing, Alexander G.

doi:10.1109/iccv.2019.00436

Cited by 50 publications

(46 citation statements)

References 27 publications

(35 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…VAEs (Kingma and Welling 2014) incorporate a form of non-determinism within a model, making them a suitable candidate for models which require diverse outputs. VAEs have been successfully applied to text processing and generation (Miao, Yu, and Blunsom 2016;Bowman et al 2015), and widely used for diverse text generation (Jain, Zhang, and Schwing 2017;Aneja et al 2019;Lin et al 2020).…”

Section: Latent Variable Modelsmentioning

confidence: 99%

Guiding Visual Question Generation

Vedd¹,

Wang²,

Rei³

et al. 2021

Preprint

View full text Add to dashboard Cite

In traditional Visual Question Generation (VQG), most images have multiple concepts (e.g. objects and categories) for which a question could be generated, but models are trained to mimic an arbitrary choice of concept as given in their training data. This makes training difficult and also poses issues for evaluation -multiple valid questions exist for most images but only one or a few are captured by the human references. We present Guiding Visual Question Generation -a variant of VQG which conditions the question generator on categorical information based on expectations on the type of question and the objects it should explore. We propose two variants: (i) an explicitly guided model that enables an actor (human or automated) to select which objects and categories to generate a question for; and (ii) an implicitly guided model that learns which objects and categories to condition on, based on discrete latent variables. The proposed models are evaluated on an answer-category augmented VQA dataset and our quantitative results show a substantial improvement over the current state of the art (over 9 BLEU-4 increase). Human evaluation validates that guidance helps the generation of questions that are grammatically coherent and relevant to the given image and objects.

show abstract

Section: Latent Variable Modelsmentioning

confidence: 99%

Guiding Visual Question Generation

Vedd¹,

Wang²,

Rei³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Considering LSTM units were complex and inherently sequential across time, J. Aneja, A. Deshpande, and A.G. Schwing [1] from group #11 developed a convolutional image captioning technique. In [58], J. Aneja et al proposed SeqCVAE which learns a latent space for every word position. K. Shuster et al [59] from cluster #12 proposed PERSONALITY-CAPTIONS, where the goal is to be as engaging to humans as possible by incorporating controllable style and personality traits.…”

Section: ) Other Research Communitiesmentioning

confidence: 99%

“…New techniques such as Transformer, reinforcement learning and GANs have been widely applied to solve image description problems, and unsupervised image captioning methods [92][93][94][95] become a new research hotspot. The form of captioning has become more diverse as it is no longer confined to the overall content of the image [58,81,96]. In addition, Vision-Language Pretraining (VLP) model is an emerging direction of image captioning and image understanding.…”

Section: Evolutionary Path Of Image Captioningmentioning

confidence: 99%

A Scientometric Visualization Analysis of Image Captioning Research From 2010 to 2020

Liu¹,

et al. 2021

IEEE Access

View full text Add to dashboard Cite

“…Generative models for data that is inherently sequential often couple each input term with a corresponding latent variable (e.g. [8]; applications to: image captioning in [1,4]; dialog generation in [21]; handwriting in [2]). Such models can be applied to other kinds of data by imposing a sequence, including Laplacian pyramid levels from images [3], sequences of resolutions [28], multi-scale feature representations [26].…”

Section: Sequential Latent Space Modelsmentioning

confidence: 99%

LSD-StructureNet: Modeling Levels of Structural Detail in 3D Part Hierarchies

Roberts¹,

Danielyan²,

Chu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Generative models for 3D shapes represented by hierarchies of parts can generate realistic and diverse sets of outputs. However, existing models suffer from the key practical limitation of modelling shapes holistically and thus cannot perform conditional sampling, i.e. they are not able to generate variants on individual parts of generated shapes without modifying the rest of the shape. This is limiting for applications such as 3D CAD design that involve adjusting created shapes at multiple levels of detail. To address this, we introduce LSD-StructureNet, an augmentation to the StructureNet architecture that enables re-generation of parts situated at arbitrary positions in the hierarchies of its outputs. We achieve this by learning individual, probabilistic conditional decoders for each hierarchy depth. We evaluate LSD-StructureNet on the PartNet dataset, the largest dataset of 3D shapes represented by hierarchies of parts. Our results show that contrarily to existing methods, LSD-StructureNet can perform conditional sampling without impacting inference speed or the realism and diversity of its outputs.

show abstract

Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning

Cited by 50 publications

References 27 publications

Guiding Visual Question Generation

Guiding Visual Question Generation

A Scientometric Visualization Analysis of Image Captioning Research From 2010 to 2020

LSD-StructureNet: Modeling Levels of Structural Detail in 3D Part Hierarchies

Contact Info

Product

Resources

About