Cones: Concept Neurons in Diffusion Models for Customized Generation

Liu, Zhiheng; Feng, Ruili; Zhu, K. J.; Zhang, Yifei; Zheng, Kecheng; Liu, Yu; Zhao, Dan; Zhou, Jingren; Cao, Yang

doi:10.48550/arxiv.2303.05125

Cited by 6 publications

(6 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Later, non-fine-tuning methods for customized generation emerged. Cones (Liu et al 2023) focuses on identifying the effective concept neurons related to the target concept, while ViCo (Hao et al 2023) proposes a plug-in image attention module to adjust the diffusion process. Other works (Wei et al 2023;Shi et al 2023;Li, Hou, and Loy 2023) explore achieving customized generation without finetuning.…”

Section: Related Workmentioning

confidence: 99%

Decoupled Textual Embeddings for Customized Image Generation

Cai,

Wei,

et al. 2024

AAAI

View full text Add to dashboard Cite

Customized text-to-image generation, which aims to learn user-specified concepts with a few images, has drawn significant attention recently. However, existing methods usually suffer from overfitting issues and entangle the subject-unrelated information (e.g., background and pose) with the learned concept, limiting the potential to compose concept into new scenes. To address these issues, we propose the DETEX, a novel approach that learns the disentangled concept embedding for flexible customized text-to-image generation. Unlike conventional methods that learn a single concept embedding from the given images, our DETEX represents each image using multiple word embeddings during training, i.e., a learnable image-shared subject embedding and several image-specific subject-unrelated embeddings. To decouple irrelevant attributes (i.e., background and pose) from the subject embedding, we further present several attribute mappers that encode each image as several image-specific subject-unrelated embeddings. To encourage these unrelated embeddings to capture the irrelevant information, we incorporate them with corresponding attribute words and propose a joint training strategy to facilitate the disentanglement. During inference, we only use the subject embedding for image generation, while selectively using image-specific embeddings to retain image-specified attributes. Extensive experiments demonstrate that the subject embedding obtained by our method can faithfully represent the target concept, while showing superior editability compared to the state-of-the-art methods. Our code will be available at https://github.com/PrototypeNx/DETEX.

show abstract

Section: Related Workmentioning

confidence: 99%

Decoupled Textual Embeddings for Customized Image Generation

Cai,

Wei,

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…Text-to-image diffusion models Diffusion models [10,19,21,41,[58][59][60][61][62] have proven to be highly effective in learning data distributions and have shown impressive results in image synthesis, leading to various applications [8,26,27,29,31,32,36,46,56,74]. Recent advancements have also explored transformer-based architectures [6,45,67].…”

Section: Related Workmentioning

confidence: 99%

SVDiff: Compact Parameter Space for Diffusion Fine-Tuning

Han¹,

Li²,

Zhang³

et al. 2023

Preprint

View full text Add to dashboard Cite

Diffusion models have achieved remarkable success in text-to-image generation, enabling the creation of highquality images from text prompts or other modalities. However, existing methods for customizing these models are limited by handling multiple personalized subjects and the risk of overfitting. Moreover, their large number of parameters is inefficient for model storage. In this paper, we propose a novel approach to address these limitations in existing textto-image diffusion models for personalization. Our method involves fine-tuning the singular values of the weight matrices, leading to a compact and efficient parameter space that reduces the risk of overfitting and language-drifting. We also propose a Cut-Mix-Unmix data-augmentation technique to enhance the quality of multi-subject image generation and a simple text-based image editing framework. Our proposed SVDiff method has a significantly smaller model size (1.7MB for StableDiffusion) compared to existing methods (vanilla DreamBooth 3.66GB, Custom Diffusion 73MB), making it more practical for real-world applications.

show abstract

“…Based on diffusion models, [7,17] have led to techniques like using placeholder words for object representation, enabling highfidelity customizations. Subsequent works [19,26,34,35] extend this by fine-tuning pretrained text-to-image models for new concept learning. These advancements have facilitated diverse applications, such as subject swapping [8], open-world generation [22], and non-rigid image editing [2].…”

Section: Subject-driven Image Generationmentioning

confidence: 99%