StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery

Patashnik, Or; Wu, Zongze; Shechtman, Eli; Cohen–Or, Daniel; Lischinski, Dani

doi:10.48550/arxiv.2103.17249

Cited by 35 publications

(64 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Second, it simplifies guidance when conditioning on information that is difficult to predict with a classifier (such as text). Since CLIP provides a score of how close an image is to a caption, several works have used it to steer generative models like GANs towards a user-defined text caption (Galatolo et al, 2021;Patashnik et al, 2021;Murdock, 2021;Gal et al, 2021). To apply the same idea to diffusion models, we can replace the classifier with a CLIP model in classifier guidance.…”

Section: Guided Diffusionmentioning

confidence: 99%

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Nichol¹,

Dhariwal²,

Ramesh³

et al. 2021

Preprint

170

292

View full text Add to dashboard Cite

Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity. We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance. We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifierfree guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking. Additionally, we find that our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing. We train a smaller model on a filtered dataset and release the code and weights at https://github.com/openai/glide-text2im.

show abstract

Section: Guided Diffusionmentioning

confidence: 99%

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Nichol¹,

Dhariwal²,

Ramesh³

et al. 2021

Preprint

170

292

View full text Add to dashboard Cite

show abstract

“…Besides producing impressive image samples, generative adversarial networks (GANs) [9] have been shown to learn meaningful latent spaces [18] with extensive studies on multiple derived spaces [15,44] and various knobs and controls for conditional human face generation [12,28,42]. Encoding an image to the GAN's latent space requires an optimization-based inversion process [19,45] or an external image encoder [30], which has limited reconstruction fidelity (or produces latent codes in much higher dimensions outside the learned manifold).…”

Section: Related Workmentioning

confidence: 99%

Diffusion Autoencoders: Toward a Meaningful and Decodable Representation

Preechakul¹,

Chatthee²,

Wizadwongsa³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…GANs are also frequently used to modify images based on some natural language input (Nam et al, 2018;Li et al, 2020a;Xia et al, 2020). Lastly, CLIP (Radford et al, 2021) can be used in combination with a StyleGAN generator to make semantic edits in images, as exemplified in (Patashnik et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

Embedding Arithmetic for Text-driven Image Transformation

Couairon¹,

Cord²,

Douze³

et al. 2021

Preprint

View full text Add to dashboard Cite

Latent text representations exhibit geometric regularities, such as the famous analogy: queen is to king what woman is to man. Such structured semantic relations were not demonstrated on image representations. Recent works aiming at bridging this semantic gap embed images and text into a multimodal space, enabling the transfer of text-defined transformations to the image modality.We introduce the SIMAT dataset to evaluate the task of text-driven image transformation. SIMAT contains 6k images and 18k "transformation queries" that aims at either replacing scene elements or changing their pairwise relationships. The goal is to retrieve an image consistent with the (source image, transformation) query. We use an image/text matching oracle (OSCAR) to assess whether the image transformation is successful. The SIMAT dataset will be publicly available.We use SIMAT to show that vanilla CLIP multimodal embeddings are not very well suited for text-driven image transformation, but that a simple finetuning on the COCO dataset can bring dramatic improvements. We also study whether it is beneficial to leverage the geometric properties of pretrained universal sentence encoders (FastText, LASER and LaBSE).

show abstract

StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery

Abstract: Stone" "Mohawk hairstyle" "Without makeup" "Cute cat" "Lion" "Gothic church" * Equal contribution, ordered alphabetically. Code and video are available on https://github.com/orpatashnik/StyleCLIP

Cited by 35 publications

References 39 publications

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Diffusion Autoencoders: Toward a Meaningful and Decodable Representation

Embedding Arithmetic for Text-driven Image Transformation

Contact Info

Product

Resources

About