Object-Driven Text-To-Image Synthesis via Adversarial Training

Li, Wenbo; Zhang, Pengchuan; Zhang, Lei; Huang, Qiuyuan; He, Xiaodong; Lyu, Siwei; Gao, Jianfeng

doi:10.1109/cvpr.2019.01245

Cited by 282 publications

(263 citation statements)

References 19 publications

Supporting

Mentioning

248

Contrasting

Order By: Relevance

“…For pixels where bounding boxes of different objects overlap, their semantic labels are assigned by objects with the highest predicted mask weight. Unlike [12,21] where ground truth masks is adopted to guide learning of shape generator, our model can learn semantic masks in a weakly-supervised manner. Even for objects with overlapped bounding box, like person and surfboard in (f), synthesized images and learned masks are consistent and semantically reasonable.…”

Section: Qualitative Resultsmentioning

confidence: 99%

“…Spatial layout conditioned image generation has been studied in recent literature. In [16,12,11,21], layout and object information is utilized in text-to-image generation. [11] controls location of multiple objects in text-to-image generation by adding an object pathway to both the generator and discriminator.…”

Section: Related Workmentioning

confidence: 99%

“…[16,12,21] performs text-to-image synthesis in two steps: semantic layout (class label and bounding boxes) generation from text first, and image synthesis conditioned on predicted semantic layout and text description. However, [12,21] requires pixel-level instance segmentation annotation, which is labor intensive to collect, for training of shape generator, while our method does not require pixel-level annotation and can learn segmentation mask in a weaklysupervised manner. [38] studied similar task with us, where variational autoencoders based network is adopted for scene image generation from layout.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Image Synthesis From Reconfigurable Layout and Style

Sun

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

116

163

View full text Add to dashboard Cite

Despite remarkable recent progress on both unconditional and conditional image synthesis, it remains a longstanding problem to learn generative models that are capable of synthesizing realistic and sharp images from reconfigurable spatial layout (i.e., bounding boxes + class labels in an image lattice) and style (i.e., structural and appearance variations encoded by latent vectors), especially at high resolution. By reconfigurable, it means that a model can preserve the intrinsic one-to-many mapping from a given layout to multiple plausible images with different styles, and is adaptive with respect to perturbations of a layout and style latent code. In this paper, we present a layout-and style-based architecture for generative adversarial networks (termed LostGANs) that can be trained end-to-end to generate images from reconfigurable layout and style. Inspired by the vanilla StyleGAN, the proposed LostGAN consists of two new components: (i) learning fine-grained mask maps in a weakly-supervised manner to bridge the gap between layouts and images, and (ii) learning object instance-specific layout-aware feature normalization (ISLA-Norm) in the generator to realize multi-object style generation. In experiments, the proposed method is tested on the COCO-Stuff dataset and the Visual Genome dataset with state-of-the-art performance obtained. The code and pretrained models are available at

show abstract

Section: Qualitative Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Image Synthesis From Reconfigurable Layout and Style

Sun

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

116

163

View full text Add to dashboard Cite

show abstract

“…Complementary image features derived from different models, such as ResNet and Faster R-CNN, are used for multiple image attention mechanisms [139]. Moreover, the reverse of image attention that generates attended text feature with image and text input is used for text-to-image synthesis in [48] and [140].…”

Section: B Attention-based Fusionmentioning

confidence: 99%

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Zhang

Yang

et al. 2020

IEEE J. Sel. Top. Signal Process.

Self Cite

263

View full text Add to dashboard Cite

Deep learning has revolutionized speech recognition, image recognition, and natural language processing since 2010, each involving a single modality in the input signal. However, many applications in artificial intelligence involve more than one modality. It is therefore of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, a technical review of the models and learning methods for multimodal intelligence is provided. The main focus is the combination of vision and natural language, which has become an important area in both computer vision and natural language processing research communities.This review provides a comprehensive analysis of recent work on multimodal deep learning from three new angles -learning multimodal representations, the fusion of multimodal signals at various levels, and multimodal applications. On multimodal representation learning, we review the key concept of embedding, which unifies the multimodal signals into the same vector space and thus enables cross-modality signal processing. We also review the properties of the many types of embedding constructed and learned for general downstream tasks. On multimodal fusion, this review focuses on special architectures for the integration of the representation of unimodal signals for a particular task. On applications, selected areas of a broad interest in current literature are covered, including caption generation, text-to-image generation, and visual question answering. We believe this review can facilitate future studies in the emerging field of multimodal intelligence for the community.

show abstract

“…Despite the rapid progress and recent successes in object generation (e.g., celebrity face, animals, etc.) [1,9,13] and scene generation [4,11,12,19,22,30,31], little attention has been paid to frameworks designed for stochastic semantic layout generation. Having a robust model for layout generation will not only allow us to generate reliable scene layouts, but also provide priors and means to infer latent relationships between objects, advancing progress in the scene understanding domain.…”

Section: Introductionmentioning

confidence: 99%

LayoutVAE: Stochastic Scene Layout Generation From a Label Set

Jyothi

Durand

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

Recently there is an increasing interest in scene generation within the research community. However, models used for generating scene layouts from textual description largely ignore plausible visual variations within the structure dictated by the text. We propose LayoutVAE, a variational autoencoder based framework for generating stochastic scene layouts. LayoutVAE is a versatile modeling framework that allows for generating full image layouts given a label set, or per label layouts for an existing image given a new label. In addition, it is also capable of detecting unusual layouts, potentially providing a way to evaluate layout generation problem. Extensive experiments on MNIST-Layouts and challenging COCO 2017 Panoptic dataset verifies the effectiveness of our proposed framework.

show abstract

Object-Driven Text-To-Image Synthesis via Adversarial Training

Cited by 282 publications

References 19 publications

Image Synthesis From Reconfigurable Layout and Style

Image Synthesis From Reconfigurable Layout and Style

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

LayoutVAE: Stochastic Scene Layout Generation From a Label Set

Contact Info

Product

Resources

About