2022
DOI: 10.1145/3528223.3530164
|View full text |Cite
|
Sign up to set email alerts
|

StyleGAN-NADA

Abstract: Can a generative model be trained to produce images from a specific domain, guided only by a text prompt, without seeing any image? In other words: can an image generator be trained "blindly"? Leveraging the semantic power of large scale Contrastive-Language-Image-Pre-training (CLIP) models, we present a text-driven method that allows shifting a generative model to new domains, without having to collect even a single image. We show that through natural language prompts and a few minutes of training, our method… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
98
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 294 publications
(123 citation statements)
references
References 26 publications
0
98
0
Order By: Relevance
“…To provide users with more control over the synthesis process, several works employ a segmentation map or spatial conditioning [4,17,54]. In the context of image editing, while most methods are generally limited to global edits [9,14,19,26], several works introduce a userprovided mask to specify the region that should be altered [3,7,13,34].…”
Section: Related Workmentioning
confidence: 99%
“…To provide users with more control over the synthesis process, several works employ a segmentation map or spatial conditioning [4,17,54]. In the context of image editing, while most methods are generally limited to global edits [9,14,19,26], several works introduce a userprovided mask to specify the region that should be altered [3,7,13,34].…”
Section: Related Workmentioning
confidence: 99%
“…A widely adopted approach is the use of auxiliary models such as CLIP [35] to guide the optimization of pretrained generators towards the objective of minimizing textto-image similarity scores [6,7]. Additionally, other works have exploited the use of CLIP in conjunction with generative models for various tasks such as image manipulation [20,33], domain adaptation [14], style transfer [23], and even object segmentation [27,50]. Recently, large-scale text-to-image models demonstrated impressive image generation performance [5,12,38,39,42,43,54].…”
Section: Related Workmentioning
confidence: 99%
“…Text-conditioned generation The field of text-to-image generation has made significant progress in recent years, mainly using CLIP as a representation extractor. Many works use CLIP to optimize a latent vector in the representation space of a pretrained GAN [10,17,30,37], others utilize CLIP to provide classifier guidance for a pretrained diffusion model [3], and [5] employ CLIP to optimize a Deep Image Prior model [52] that correctly edits an image. Recently, the field has shifted from employing CLIP as a loss network for optimization, and into using it as a backbone in huge generative models [41,45], resulting in impressive photorealistic results.…”
Section: Related Workmentioning
confidence: 99%
“…Since the advent of CLIP [39], training large visionlanguage models (VLMs) has become a prominent paradigm for representation learning in computer vision. By observing huge corpora of paired images and captions crawled from the Web, these models learn a powerful and rich joint image-text embedding space, which have been employed in numerous visual tasks, including classification [60,61], segmentation [28,57], motion generation [49], image captioning [32,50], text-to-image generation [10,30,34,42,46] and image or video editing [3,5,7,17,24,37,54]. Recently, VLMs have also been a key component in text-toimage generative models [4,40,42,45], which rely on their textual representations to encapsulate the rich and semantic meaning of the input text prompt.…”
Section: Introductionmentioning
confidence: 99%