StyleGAN-NADA

Gal, Ran; Patashnik, Or; Maron, Haggai; Bermano, Amit H.; Chechik, Gal; Cohen–Or, Daniel

doi:10.1145/3528223.3530164

Cited by 294 publications

(123 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To provide users with more control over the synthesis process, several works employ a segmentation map or spatial conditioning [4,17,54]. In the context of image editing, while most methods are generally limited to global edits [9,14,19,26], several works introduce a userprovided mask to specify the region that should be altered [3,7,13,34].…”

Section: Related Workmentioning

confidence: 99%

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Chefer¹,

Alaluf²,

Vinker³

et al. 2023

Preprint

View full text Add to dashboard Cite

Figure 1. Given a pre-trained text-to-image diffusion model (e.g. Stable Diffusion [39]) our method, Attend-and-Excite, guides the generative model to modify the cross-attention values during the image synthesis process to generate images that more faithfully depict the input text prompt. Stable Diffusion alone (top row) struggles to generate multiple objects (e.g. a horse and a dog). However, by incorporating Attend-and-Excite (bottom row) to strengthen the subject tokens (marked in blue), we achieve images that are more semantically faithful with respect to the input text prompts.

show abstract

Section: Related Workmentioning

confidence: 99%

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Chefer¹,

Alaluf²,

Vinker³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…A widely adopted approach is the use of auxiliary models such as CLIP [35] to guide the optimization of pretrained generators towards the objective of minimizing textto-image similarity scores [6,7]. Additionally, other works have exploited the use of CLIP in conjunction with generative models for various tasks such as image manipulation [20,33], domain adaptation [14], style transfer [23], and even object segmentation [27,50]. Recently, large-scale text-to-image models demonstrated impressive image generation performance [5,12,38,39,42,43,54].…”

Section: Related Workmentioning

confidence: 99%

Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models

Jeong¹,

Kwon²,

Ye³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Text-conditioned generation The field of text-to-image generation has made significant progress in recent years, mainly using CLIP as a representation extractor. Many works use CLIP to optimize a latent vector in the representation space of a pretrained GAN [10,17,30,37], others utilize CLIP to provide classifier guidance for a pretrained diffusion model [3], and [5] employ CLIP to optimize a Deep Image Prior model [52] that correctly edits an image. Recently, the field has shifted from employing CLIP as a loss network for optimization, and into using it as a backbone in huge generative models [41,45], resulting in impressive photorealistic results.…”

Section: Related Workmentioning

confidence: 99%

“…Since the advent of CLIP [39], training large visionlanguage models (VLMs) has become a prominent paradigm for representation learning in computer vision. By observing huge corpora of paired images and captions crawled from the Web, these models learn a powerful and rich joint image-text embedding space, which have been employed in numerous visual tasks, including classification [60,61], segmentation [28,57], motion generation [49], image captioning [32,50], text-to-image generation [10,30,34,42,46] and image or video editing [3,5,7,17,24,37,54]. Recently, VLMs have also been a key component in text-toimage generative models [4,40,42,45], which rely on their textual representations to encapsulate the rich and semantic meaning of the input text prompt.…”

Section: Introductionmentioning

confidence: 99%

Teaching CLIP to Count to Ten

Paiss¹,

Ephrat²,

Tov³

et al. 2023

Preprint

View full text Add to dashboard Cite

object count. For example, an image depicting three dogs is paired with the caption "Six dogs playing in the yard". Our loss encourages discrimination between the correct caption and its counterfactual variant which serves as a hard negative example. To the best of our knowledge, this work is the first to extend CLIP's capabilities to object counting. Furthermore, we introduce "CountBench" -a new imagetext counting benchmark for evaluating a model's understanding of object counting. We demonstrate a significant improvement over state-of-the-art baseline models on this task. Finally, we leverage our count-aware CLIP model for image retrieval and text-conditioned image generation, demonstrating that our model can produce specific counts of objects more reliably than existing ones.

show abstract

StyleGAN-NADA

Cited by 294 publications

References 26 publications

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models

Teaching CLIP to Count to Ten

Contact Info

Product

Resources

About