Prompt-to-Prompt Image Editing with Cross Attention Control

Hertz, Amir; Mokady, Ron; Tenenbaum, Jay M.; Aberman, Kfir; Pritch, Yael; Cohen–Or, Daniel

doi:10.48550/arxiv.2208.01626

Cited by 72 publications

(154 citation statements)

References 31 publications

Supporting

Mentioning

126

Contrasting

Order By: Relevance

“…Another related line of work aims to introduce specific concepts to a pre-trained text-to-image model by learning to map a set of images to a "word" in the embedding space of the model [18,25,41]. Several works have also explored providing users with more control over the synthesis process solely through the use of the input text prompt [8,20,24,46].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Chefer¹,

Alaluf²,

Vinker³

et al. 2023

Preprint

View full text Add to dashboard Cite

Figure 1. Given a pre-trained text-to-image diffusion model (e.g. Stable Diffusion [39]) our method, Attend-and-Excite, guides the generative model to modify the cross-attention values during the image synthesis process to generate images that more faithfully depict the input text prompt. Stable Diffusion alone (top row) struggles to generate multiple objects (e.g. a horse and a dog). However, by incorporating Attend-and-Excite (bottom row) to strengthen the subject tokens (marked in blue), we achieve images that are more semantically faithful with respect to the input text prompts.

show abstract

Section: Related Workmentioning

confidence: 99%

“…We operate over the 16 × 16 attention units since they have been shown to contain the most semantic information [20].…”

Section: Sd Generated Imagementioning

confidence: 99%

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Chefer¹,

Alaluf²,

Vinker³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Recently, a line of works [33,28] focus on improving the sampling speed of diffusion model, by either altering the Markovian noising process or embedding the diffusion steps into a learned latent space. Another group [15,3,10] studies the applications of diffusion models such as text-guided image manipulation.…”

Section: Related Workmentioning

confidence: 99%

“…Blended-diffusion [3] uses a user-provided mask and a textual prompt during the diffusion process to blend the target and the existing background iteratively. A concurrent work of ours, prompt-toprompt [10] captures the text cross-attention structure to enable purely prompt-based scene editing without any explicit masks. In our work, we take blended-diffusion as the starting point, and incorporate a domain-specific classifier and its attention structure for mask-free multi-attribute fashion image manipulation.…”

Section: Related Workmentioning

confidence: 99%

“…Denoising diffusion models [11,33,7,26,30] have recently gained great attention from the research community for their impressive synthesis quality, training stability and scalability. They have demonstrated promising performances across diverse tasks and benchmarks spanning unconditional image synthesis [7], text-driven image generation [26,30,21], image manipulation [3,10] and video synthesis [13]. Nevertheless, studies on diffusion models are far from complete; unlike traditional generative model * Work done during internship at NAVER.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Leveraging Off-the-shelf Diffusion Model for Multi-attribute Fashion Image Manipulation

Chaerin¹,

Jeon²,

Kwon³

et al. 2022

Preprint

View full text Add to dashboard Cite

Fashion attribute editing is a task that aims to convert the semantic attributes of a given fashion image while preserving the irrelevant regions. Previous works typically employ conditional GANs where the generator explicitly learns the target attributes and directly execute the conversion. These approaches, however, are neither scalable nor generic as they operate only with few limited attributes and a separate generator is required for each dataset or attribute set. Inspired by the recent advancement of diffusion models, we explore the classifier-guided diffusion that leverages the offthe-shelf diffusion model pretrained on general visual semantics such as Imagenet. In order to achieve a generic editing pipeline, we pose this as multi-attribute image manipulation task, where the attribute ranges from item category, fabric, pattern to collar and neckline. We empirically show that conventional methods fail in our challenging setting, and study efficient adaptation scheme that involves recently introduced attention-pooling technique to obtain a multi-attribute classifier guidance. Based on this, we present a mask-free fashion attribute editing framework that leverages the classifier logits and the cross-attention map for manipulation. We empirically demonstrate that our framework achieves convincing sample quality and attribute alignments.1 Classifiers are generally easier and more straightforward to train or finetune compared to generative models under limited data.

show abstract

Fine-Tuned Transformer Model for Sentiment Analysis

Liu

Shuai

Zhang

et al. 2020

Knowledge Science, Engineering and Management

View full text Add to dashboard Cite

Foundational generative models should be traceable to protect their owners and facilitate safety regulation. To achieve this, traditional approaches embed identifiers based on supervisory trigger-response signals, which are commonly known as backdoor watermarks. They are prone to failure when the model is fine-tuned with nontrigger data. Our experiments show that this vulnerability is due to energetic changes in only a few 'busy' layers during fine-tuning. This yields a novel arbitrary-in-arbitrary-out (AIAO) strategy that makes watermarks resilient to fine-tuning-based removal. The trigger-response pairs of AIAO samples across various neural network depths can be used to construct watermarked subpaths, employing Monte Carlo sampling to achieve stable verification results. In addition, unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths, where a mask-controlled trigger function is proposed to preserve the generation performance and ensure the invisibility of the embedded backdoor. Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO; while the verification rates of other trigger-based methods fall from ∼90% to ∼70% after fine-tuning, those of our method remain consistently above 90%.

show abstract

Prompt-to-Prompt Image Editing with Cross Attention Control

Cited by 72 publications

References 31 publications

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Leveraging Off-the-shelf Diffusion Model for Multi-attribute Fashion Image Manipulation

Fine-Tuned Transformer Model for Sentiment Analysis

Contact Info

Product

Resources

About