2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.01245
|View full text |Cite
|
Sign up to set email alerts
|

Object-Driven Text-To-Image Synthesis via Adversarial Training

Abstract: In this paper, we propose Object-driven Attentive Generative Adversarial Newtorks (Obj-GANs) that allow object-centered text-to-image synthesis for complex scenes. Following the two-step (layout-image) generation process, a novel object-driven attentive image generator is proposed to synthesize salient objects by paying attention to the most relevant words in the text description and the pre-generated semantic layout. In addition, a new Fast R-CNN based object-wise discriminator is proposed to provide rich obj… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

6
248
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 282 publications
(263 citation statements)
references
References 19 publications
6
248
0
Order By: Relevance
“…For pixels where bounding boxes of different objects overlap, their semantic labels are assigned by objects with the highest predicted mask weight. Unlike [12,21] where ground truth masks is adopted to guide learning of shape generator, our model can learn semantic masks in a weakly-supervised manner. Even for objects with overlapped bounding box, like person and surfboard in (f), synthesized images and learned masks are consistent and semantically reasonable.…”
Section: Qualitative Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…For pixels where bounding boxes of different objects overlap, their semantic labels are assigned by objects with the highest predicted mask weight. Unlike [12,21] where ground truth masks is adopted to guide learning of shape generator, our model can learn semantic masks in a weakly-supervised manner. Even for objects with overlapped bounding box, like person and surfboard in (f), synthesized images and learned masks are consistent and semantically reasonable.…”
Section: Qualitative Resultsmentioning
confidence: 99%
“…Spatial layout conditioned image generation has been studied in recent literature. In [16,12,11,21], layout and object information is utilized in text-to-image generation. [11] controls location of multiple objects in text-to-image generation by adding an object pathway to both the generator and discriminator.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Complementary image features derived from different models, such as ResNet and Faster R-CNN, are used for multiple image attention mechanisms [139]. Moreover, the reverse of image attention that generates attended text feature with image and text input is used for text-to-image synthesis in [48] and [140].…”
Section: B Attention-based Fusionmentioning
confidence: 99%
“…Despite the rapid progress and recent successes in object generation (e.g., celebrity face, animals, etc.) [1,9,13] and scene generation [4,11,12,19,22,30,31], little attention has been paid to frameworks designed for stochastic semantic layout generation. Having a robust model for layout generation will not only allow us to generate reliable scene layouts, but also provide priors and means to infer latent relationships between objects, advancing progress in the scene understanding domain.…”
Section: Introductionmentioning
confidence: 99%