2023
DOI: 10.48550/arxiv.2301.13826
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Abstract: Figure 1. Given a pre-trained text-to-image diffusion model (e.g. Stable Diffusion [39]) our method, Attend-and-Excite, guides the generative model to modify the cross-attention values during the image synthesis process to generate images that more faithfully depict the input text prompt. Stable Diffusion alone (top row) struggles to generate multiple objects (e.g. a horse and a dog). However, by incorporating Attend-and-Excite (bottom row) to strengthen the subject tokens (marked in blue), we achieve images t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
10
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(12 citation statements)
references
References 34 publications
(56 reference statements)
0
10
0
Order By: Relevance
“…As shown in Fig. A14, when Cones incorporates the Attend-and-Excite (Feng et al, 2022;Chefer et al, 2023) method to address this issue, it generates better results.…”
Section: C5 More Results On Multi Subjectsmentioning
confidence: 99%
“…As shown in Fig. A14, when Cones incorporates the Attend-and-Excite (Feng et al, 2022;Chefer et al, 2023) method to address this issue, it generates better results.…”
Section: C5 More Results On Multi Subjectsmentioning
confidence: 99%
“…The remarkable advances in this area are driven by the application of state-of-the-art image-generative models, such as auto-regressive (Ramesh et al 2021;Wang et al 2022) and diffusion models (Ramesh et al 2022;Saharia et al 2022;Rombach et al 2022), as well as the availability of large-scale language-image datasets (Sharma et al 2018;Schuhmann et al 2022). However, existing methods face challenges in synthesizing or editing multiple subjects with specific relational and attributive constraints from textual prompts (Chefer et al 2023). The typical defects that oc-cur in the synthesis results are missing entities, and inaccurate inter-object relations, as shown in ??.…”
Section: Introductionmentioning
confidence: 99%
“…The typical defects that oc-cur in the synthesis results are missing entities, and inaccurate inter-object relations, as shown in ??. Existing work improves the compositional skills of text-to-image synthesis models by incorporating linguistic structures (Feng et al 2022), and attention controls (Hertz et al 2022;Chefer et al 2023) within the diffusion guidance process. Notably, Structured Diffusion (Feng et al 2022) parse a text to extract numerous noun phrases, Attend-and-Excite (Chefer et al 2023) strength attention activations associated with the most marginalized subject token.…”
Section: Introductionmentioning
confidence: 99%
“…text-guided solutions have emerged in the field of image editing and produced impressive results [21,25,40,7,17]. The powerful generative capabilities of diffusion models enable the generation of numerous high-quality images.…”
Section: Layered Controlled Optimization Fine-tuningmentioning
confidence: 99%
“…These models can generate high-quality synthetic images based on text prompts, enabling text-guided image editing and producing impressive results. As a result, numerous text-based image editing methods [36,13,10,7,28,8,35] have emerged and evolved. However, such models cannot mimic specific subject characteristics.…”
Section: Introductionmentioning
confidence: 99%