Imagine This! Scripts to Compositions to Videos

Gupta, Tanmay; Schwenk, Dustin; Farhadi, Ali; Hoiem, Derek; Kembhavi, Aniruddha

doi:10.48550/arxiv.1804.03608

Cited by 4 publications

(6 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sequentially generating new data from the previous data is termed autoregressive. However, we consider some studies [243,244] autoregressive due to the sequential prediction of frames, similar to others, but without using GAN or VAE models. These models typically fuse the two domains, text and video, for learning joint embedding.…”

Section: Auto-regressive Modelsmentioning

confidence: 99%

“…In CRAFT [243], text-conditioned video creation is completed by a compositional retrieval task. Following the caption, the model sequentially predicts a temporal layout of objects and retrieves the Spatio-temporal entity segments from a video dataset, where the fused segments create the final video.…”

Section: Generationmentioning

confidence: 99%

“…Similar to T2S, another significant area is the text-guided video generation with principal contributions using GAN [253,258] and some variations in VAE [250,252] and retrieval-based models [243]. However, the main difference of this method compared to the first two is the dependency and temporal consistency between consistent frames of the video, hence the named dynamic.…”

Section: Generative Modelsmentioning

confidence: 99%

See 2 more Smart Citations

A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint

Ullah

Lee

et al. 2022

Sensors

View full text Add to dashboard Cite

For decades, co-relating different data domains to attain the maximum potential of machines has driven research, especially in neural networks. Similarly, text and visual data (images and videos) are two distinct data domains with extensive research in the past. Recently, using natural language to process 2D or 3D images and videos with the immense power of neural nets has witnessed a promising future. Despite the diverse range of remarkable work in this field, notably in the past few years, rapid improvements have also solved future challenges for researchers. Moreover, the connection between these two domains is mainly subjected to GAN, thus limiting the horizons of this field. This review analyzes Text-to-Image (T2I) synthesis as a broader picture, Text-guided Visual-output (T2Vo), with the primary goal being to highlight the gaps by proposing a more comprehensive taxonomy. We broadly categorize text-guided visual output into three main divisions and meaningful subdivisions by critically examining an extensive body of literature from top-tier computer vision venues and closely related fields, such as machine learning and human–computer interaction, aiming at state-of-the-art models with a comparative analysis. This study successively follows previous surveys on T2I, adding value by analogously evaluating the diverse range of existing methods, including different generative models, several types of visual output, critical examination of various approaches, and highlighting the shortcomings, suggesting the future direction of research.

show abstract

Section: Auto-regressive Modelsmentioning

confidence: 99%

Section: Generationmentioning

confidence: 99%

Section: Generative Modelsmentioning

confidence: 99%

See 1 more Smart Citation

A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint

Ullah

Lee

et al. 2022

Sensors

View full text Add to dashboard Cite

show abstract

“…Pre-processing CIFAR-10 [ Krizhevsky, 2009] 32 × 32 × 3 64 None CelebA 64 × 64 × 3 128 Centre-cropped, area downsampled ImageNet [Deng et al, 2009] 64 × 64 × 3 128 Area downsampled Flintstones [Gupta et al, 2018] Since the normal GAN has no encoder, it was not necessary to add additional hyper-parameters when adding the losses in this case. For instance, the losses for each component for GAN + adversarial Z are:…”

Section: Models Consideredmentioning

confidence: 99%

An Empirical Study of Generative Models with Encoders

Rubenstein,

Li,

Roblek

2018

Preprint

View full text Add to dashboard Cite

Generative adversarial networks (GANs) are capable of producing high quality image samples. However, unlike variational autoencoders (VAEs), GANs lack encoders that provide the inverse mapping for the generators, i.e., encode images back to the latent space. In this work, we consider adversarially learned generative models that also have encoders. We evaluate models based on their ability to produce high quality samples and reconstructions of real images. Our main contributions are twofold: First, we find that the baseline Bidirectional GAN (BiGAN) can be improved upon with the addition of an autoencoder loss, at the expense of an extra hyper-parameter to tune. Second, we show that comparable performance to BiGAN can be obtained by simply training an encoder to invert the generator of a normal GAN.

show abstract

“…The analysis of comics and mangas images recently sparked the computer vision and document analysis communities interest [2]. The digital version of manga can be used by the researchers to propose new algorithms to provide services such as dynamic visualization of manga [3], adding colors [12], generating animations [8], creating new kinds of recommender systems [7], etc.…”

Section: Introductionmentioning

confidence: 99%

Facial Landmark Detection for Manga Images

Stricker,

Augereau,

Kise

et al. 2018

Preprint

View full text Add to dashboard Cite

The topic of facial landmark detection has been widely covered for pictures of human faces, but it is still a challenge for drawings. Indeed, the proportions and symmetry of standard human faces are not always used for comics or mangas. The personal style of the author, the limitation of colors, etc. makes the landmark detection on faces in drawings a difficult task. Detecting the landmarks on manga images will be useful to provide new services for easily editing the character faces, estimating the character emotions, or generating automatically some animations such as lip or eye movements.This paper contains two main contributions: 1) a new landmark annotation model for manga faces, and 2) a deep learning approach to detect these landmarks. We use the "Deep Alignment Network", a multi stage architecture where the first stage makes an initial estimation which gets refined in further stages. The first results show that the proposed method succeed to accurately find the landmarks in more than 80% of the cases.

show abstract

Imagine This! Scripts to Compositions to Videos

Cited by 4 publications

References 0 publications

A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint

A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint

An Empirical Study of Generative Models with Encoders

Facial Landmark Detection for Manga Images

Contact Info

Product

Resources

About