2021
DOI: 10.48550/arxiv.2112.05744
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

More Control for Free! Image Synthesis with Semantic Diffusion Guidance

Abstract: Controllable image synthesis models allow creation of diverse images based on text instructions or guidance from an example image. Recently, denoising diffusion probabilistic models have been shown to generate more realistic imagery than prior methods, and have been successfully demonstrated in unconditional and class-conditional settings. We explore fine-grained, continuous control of this model class, and introduce a novel unified framework for semantic diffusion guidance, which allows either language or ima… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
17
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 17 publications
(17 citation statements)
references
References 40 publications
0
17
0
Order By: Relevance
“…It was also suggested to edit images by synthesizing data in user-provided masks, while keeping the rest of the image intact [6,33]. Liu et al [31] guide a diffusion process with a text and an image, synthesising images similar to the given one, and aligned with the given text. Hertz et al [16] alter a text-to-image diffusion process by manipulating cross-attention layers, providing more fine-grained control over generated images, and can edit real images in cases where DDIM inversion provides meaningful attention maps.…”
Section: Related Workmentioning
confidence: 99%
“…It was also suggested to edit images by synthesizing data in user-provided masks, while keeping the rest of the image intact [6,33]. Liu et al [31] guide a diffusion process with a text and an image, synthesising images similar to the given one, and aligned with the given text. Hertz et al [16] alter a text-to-image diffusion process by manipulating cross-attention layers, providing more fine-grained control over generated images, and can edit real images in cases where DDIM inversion provides meaningful attention maps.…”
Section: Related Workmentioning
confidence: 99%
“…Another line of works is based on CLIP ], a vision-language model that learns a rich shared embedding space for images and text, by contrastive training on a dataset of 400 million (image, text) pairs collected from the internet. Some of them Crowson et al 2022;Crowson 2021;Liu et al 2021;Kim and Ye 2021;Murdock 2021;Paiss et al 2022] combine a pretrained generative model [Brock et al 2018;Esser et al 2021a;] with a CLIP model to steer the generative model to perform text-to-image synthesis. Utilizing CLIP along with a generative model was also used for text-based domain adaptation [Gal et al 2021] and text-to-image without training on text data [Zhou et al 2021;Wang et al 2022;].…”
Section: Related Workmentioning
confidence: 99%
“…Image-text matching models, typically CLIPs, have been extensively utilized to steer image generation models towards text conditions [Liu et al, 2021]. Based on CLIP, Ramesh et al [2022] propose a two-stage diffusion model DALL-E 2, where a prior generates a CLIP image embedding given a text caption, and a decoder generates an image conditioned on the image embedding.…”
Section: Related Workmentioning
confidence: 99%