RiFeGAN: Rich Feature Generation for Text-to-Image Synthesis From Prior Knowledge

Cheng, Jun; Wu, Fangming; Tian, Yanling; Wang, Lei; Tao, Dapeng

doi:10.1109/cvpr42600.2020.01092

Cited by 93 publications

(42 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…GAN-based Text-to-image generation. In the past few years, Generative Adversarial Networks (GANs) [18] have shown promising results on text-to-image generation [5,8,9,14,17,22,[27][28][29][30]34,39,40,46,47,54,[56][57][58][62][63][64][65][66][67][68]. GAN-INT-CLS [46] was the first to use a conditional GAN formulation for text-to-image generation.…”

Section: Related Workmentioning

confidence: 99%

Vector Quantized Diffusion Model for Text-to-Image Synthesis

Gu¹,

Chen²,

Bao³

et al. 2021

Preprint

View full text Add to dashboard Cite

We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). We find that this latent-space method is well-suited for text-to-image generation tasks because it not only eliminates the unidirectional bias with existing methods but also allows us to incorporate a mask-and-replace diffusion strategy to avoid the accumulation of errors, which is a serious problem with existing methods. Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results when compared with conventional autoregressive (AR) models with similar numbers of parameters. Compared with previous GAN-based text-to-image methods, our VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin. Finally, we show that the image generation computation in our method can be made highly efficient by reparameterization. With traditional AR methods, the text-to-image generation time increases linearly with the output image resolution and hence is quite time consuming even for normal size images. The VQ-Diffusion allows us to achieve a better trade-off between quality and speed. Our experiments indicate that the VQ-Diffusion model with the reparameterization is fifteen times faster than traditional AR methods while achieving a better image quality. The code and models will be available at https: //github.com/microsoft/VQ-Diffusion.

show abstract

Section: Related Workmentioning

confidence: 99%

Vector Quantized Diffusion Model for Text-to-Image Synthesis

Gu¹,

Chen²,

Bao³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Recently, exploring both the textual and visual representations has been studied in many challenging tasks. In the image domain, some works have dealt with image caption [43], image grounding [21], and text-to-image synthesis [6]. In the video domain, some works focus on temporal localization using natural language [32,35,51], where the temporal boundary needs to be localized with a given natural language description.…”

Section: Textual and Visual Understandingmentioning

confidence: 99%

Capsule-based Object Tracking with Natural Language Specification

2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Tracking with Natural-Language Specification (TNL) is a joint topic of understanding the vision and natural language with a wide range of applications. In previous works, the communication between two heterogeneous features of vision and language is mainly through a simple dynamic convolution. However, the performance of prior works is capped by the difficulty of linguistic variation of natural language in modeling the dynamically changing target and its surroundings. In the meanwhile, natural language and vision are firstly fused and then utilized for tracking, which is hard to model the query-focused context. Query-focused should pay more attention to context modeling to promote the correlation between these two features. To address these issues, we propose a capsule-based network, referred to as CapsuleTNL, which performs regression tracking with natural language query. In the beginning, the visual and textual input is encoded with capsules, which can not only establish the relationship between entities but also the relationship between the parts of the entity itself. Then, we devise two interaction routing modules, which consist of visual-textual routing module to reduce the linguistic variation of input query and textual-visual routing module to precisely incorporate query-based visual cues simultaneously. To validate the potential of the proposed network for visual object tracking, we evaluate our method on two large tracking benchmarks. The experimental evaluation demonstrates the effectiveness of our capsule-based network.

show abstract

“…We note however, that the FID score (which is reference-based and compares the distributions of real and synthetic images together) has been observed to be more consistent with human judgement of image realism than IS (which is reference-free and does not make comparisons to real images) (Heusel et al, 2017). We were not able to recompute other metrics for RiFeGAN (Cheng et al, 2020) and LeciaGAN (Qiao et al, 2019) as the pretrained models have not been made publicly available. In Table 5, "-" represents cases where the data was not reported or is reported in a manner which is non-comparable (besides FID values).…”

Section: A4 Note On Evaluation Metricsmentioning

confidence: 99%

Multi-Tailed, Multi-Headed, Spatial Dynamic Memory refined Text-to-Image Synthesis

Seshadri,

Ravindran

2021

Preprint

View full text Add to dashboard Cite

Synthesizing high-quality, realistic images from text-descriptions is a challenging task, and current methods synthesize images from text in a multi-stage manner, typically by first generating a rough initial image and then refining image details at subsequent stages. However, existing methods that follow this paradigm suffer from three important limitations. Firstly, they synthesize initial images without attempting to separate image attributes at a word-level. As a result, object attributes of initial images (that provide a basis for subsequent refinement) are inherently entangled and ambiguous in nature. Secondly, by using common text-representations for all regions, current methods prevent us from interpreting text in fundamentally different ways at different parts of an image. Different image regions are therefore only allowed to assimilate the same type of information from text at each refinement stage. Finally, current methods generate refinement features only once at each refinement stage and attempt to address all image aspects in a single shot. This single-shot refinement limits the precision with which each refinement stage can learn to improve the prior image. Our proposed method introduces three novel components to address these shortcomings: (1) An initial generation stage that explicitly generates separate sets of image features for each word n-gram. (2) A spatial dynamic memory module for refinement of images. (3) An iterative multi-headed mechanism to make it easier to improve upon multiple image aspects. Experimental results demonstrate that our Multi-Headed Spatial Dynamic Memory image refinement with our Multi-Tailed Word-level Initial Generation (MSMT-GAN) performs favourably against the previous state of the art on the CUB and COCO datasets.

show abstract

RiFeGAN: Rich Feature Generation for Text-to-Image Synthesis From Prior Knowledge

Cited by 93 publications

References 16 publications

Vector Quantized Diffusion Model for Text-to-Image Synthesis

Vector Quantized Diffusion Model for Text-to-Image Synthesis

Capsule-based Object Tracking with Natural Language Specification

Multi-Tailed, Multi-Headed, Spatial Dynamic Memory refined Text-to-Image Synthesis

Contact Info

Product

Resources

About