More Control for Free! Image Synthesis with Semantic Diffusion Guidance

Liu, Xihui; Park, Dong Huk; Azadi, Samaneh; Zhang, Gong; Chopikyan, Arman; Hu, Yuxiao; Rohrbach, Anna; Darrell, Trevor

doi:10.48550/arxiv.2112.05744

Cited by 17 publications

(17 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It was also suggested to edit images by synthesizing data in user-provided masks, while keeping the rest of the image intact [6,33]. Liu et al [31] guide a diffusion process with a text and an image, synthesising images similar to the given one, and aligned with the given text. Hertz et al [16] alter a text-to-image diffusion process by manipulating cross-attention layers, providing more fine-grained control over generated images, and can edit real images in cases where DDIM inversion provides meaningful attention maps.…”

Section: Related Workmentioning

confidence: 99%

Imagic: Text-Based Real Image Editing with Diffusion Models

Kawar¹,

Zada²,

Lang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Text-conditioned image editing has recently attracted considerable interest. However, most methods are currently either limited to specific editing types (e.g., object overlay, style transfer), or apply to synthetically generated images, or require multiple input images of a common object. In this paper we demonstrate, for the very first time, the ability to apply complex (e.g., non-rigid) text-guided semantic edits to a single real image. For example, we can change the posture and composition of one or multiple objects inside an image, while preserving its original characteristics. Our method can make a standing dog sit down or jump, cause a bird to spread its wings, etc. -each within its single highresolution natural image provided by the user. Contrary to previous work, our proposed method requires only a single input image and a target text (the desired edit). It oper-˚Equal contribution. The first author performed this work as an intern at Google Research.ates on real images, and does not require any additional inputs (such as image masks or additional views of the object). Our method, which we call "Imagic", leverages a pretrained text-to-image diffusion model for this task. It produces a text embedding that aligns with both the input image and the target text, while fine-tuning the diffusion model to capture the image-specific appearance. We demonstrate the quality and versatility of our method on numerous inputs from various domains, showcasing a plethora of high quality complex semantic image edits, all within a single unified framework.

show abstract

Section: Related Workmentioning

confidence: 99%

Imagic: Text-Based Real Image Editing with Diffusion Models

Kawar¹,

Zada²,

Lang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Another line of works is based on CLIP ], a vision-language model that learns a rich shared embedding space for images and text, by contrastive training on a dataset of 400 million (image, text) pairs collected from the internet. Some of them Crowson et al 2022;Crowson 2021;Liu et al 2021;Kim and Ye 2021;Murdock 2021;Paiss et al 2022] combine a pretrained generative model [Brock et al 2018;Esser et al 2021a;] with a CLIP model to steer the generative model to perform text-to-image synthesis. Utilizing CLIP along with a generative model was also used for text-based domain adaptation [Gal et al 2021] and text-to-image without training on text data [Zhou et al 2021;Wang et al 2022;].…”

Section: Related Workmentioning

confidence: 99%

Blended Latent Diffusion

Omri¹,

Fried²,

Lischinski³

2022

Preprint

View full text Add to dashboard Cite

“…Image-text matching models, typically CLIPs, have been extensively utilized to steer image generation models towards text conditions [Liu et al, 2021]. Based on CLIP, Ramesh et al [2022] propose a two-stage diffusion model DALL-E 2, where a prior generates a CLIP image embedding given a text caption, and a decoder generates an image conditioned on the image embedding.…”

Section: Related Workmentioning

confidence: 99%

UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance

Li¹,

Xue²,

Xiao³

et al. 2022

Preprint

View full text Add to dashboard Cite

An epic dreamscape with fantasy architecture, vivid colors, wide angle, super highly detailed, professional digital painting Legendary elegant gnome hold map and feel confuse in forest, highly detailed, global illumination, ray tracing, sharp focus Beautiful village around an ancient dragon head, massive scale, realistic concept art, cinematic color scheme, dramatic lighting (a) Samples of complex scene image generation (b) Samples of simple scene image generation Beautiful robot female with closed eyes, sci-fi, fantasy, mythology, complex, elegant, highly detailed, digital painting A portrait of a pink cat with human blue eyes wearing a scarf and a top hat, high quality, painting Full body portrait of a squirrel girl, blush, cute and elegant, with squirrel tail cocked to the left

show abstract

More Control for Free! Image Synthesis with Semantic Diffusion Guidance

Cited by 17 publications

References 40 publications

Imagic: Text-Based Real Image Editing with Diffusion Models

Imagic: Text-Based Real Image Editing with Diffusion Models

Blended Latent Diffusion

UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance

Contact Info

Product

Resources

About