Image-Based CLIP-Guided Essence Transfer

Chefer, Hila; Benaim, Sagie; Paiss, Roni; Wolf, Lior

doi:10.1007/978-3-031-19778-9_40

Cited by 25 publications

(9 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Leveraging these powerful generative models, many have attempted to utilize such models for downstream editing tasks [9,18,21,25,29,47]. Most text-guided generation techniques condition the diffusion model directly on embeddings extracting from a pretrained text encoder [3,5,6,18,31]. In this work, we utilize a Latent Diffusion Model [35] paired with a Diffusion Prior model [33,39] and show its benefits in the context of creative generation.…”

Section: Related Workmentioning

confidence: 99%

Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation

Richardson

Alaluf

Patashnik

et al. 2021

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

754

518

View full text Add to dashboard Cite

show abstract

Section: Related Workmentioning

confidence: 99%

Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation

Richardson

Alaluf

Patashnik

et al. 2021

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

754

518

View full text Add to dashboard Cite

show abstract

“…To provide users with more control over the synthesis process, several works employ a segmentation map or spatial conditioning [4,17,54]. In the context of image editing, while most methods are generally limited to global edits [9,14,19,26], several works introduce a userprovided mask to specify the region that should be altered [3,7,13,34].…”

Section: Related Workmentioning

confidence: 99%

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Chefer¹,

Alaluf²,

Vinker³

et al. 2023

Preprint

View full text Add to dashboard Cite

Figure 1. Given a pre-trained text-to-image diffusion model (e.g. Stable Diffusion [39]) our method, Attend-and-Excite, guides the generative model to modify the cross-attention values during the image synthesis process to generate images that more faithfully depict the input text prompt. Stable Diffusion alone (top row) struggles to generate multiple objects (e.g. a horse and a dog). However, by incorporating Attend-and-Excite (bottom row) to strengthen the subject tokens (marked in blue), we achieve images that are more semantically faithful with respect to the input text prompts.

show abstract

“…Several previous works have attempted to combine GAN and CLIP to achieve text-to-image generation [4,46,65]. Specifically, StyleGAN [22,23,21,51] focuses on the latent space to enable better control over generated images.…”

Section: Text-to-image Manipulation/generationmentioning

confidence: 99%

X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance

Ma¹,

Zhang²,

Sun³

et al. 2023

Preprint

View full text Add to dashboard Cite

Text-driven 3D stylization is a complex and crucial task in the fields of computer vision (CV) and computer graphics (CG), aimed at transforming a bare mesh to fit a target text. Prior methods adopt text-independent multilayer perceptrons (MLPs) to predict the attributes of the target mesh with the supervision of CLIP loss. However, such text-independent architecture lacks textual guidance during predicting attributes, thus leading to unsatisfactory stylization and slow convergence. To address these limitations, we present X-Mesh, an innovative text-driven 3D stylization framework that incorporates a novel Text-guided Dynamic Attention Module (TDAM). The TDAM dynamically integrates the guidance of the target text by utilizing textrelevant spatial and channel-wise attentions during vertex feature extraction, resulting in more accurate attribute prediction and faster convergence speed. Furthermore, existing works lack standard benchmarks and automated metrics for evaluation, often relying on subjective and nonreproducible user studies to assess the quality of stylized 3D assets. To overcome this limitation, we introduce a new standard text-mesh benchmark, namely MIT-30, and two automated metrics, which will enable future research to achieve fair and objective comparisons. Our extensive qualitative and quantitative experiments demonstrate that X-Mesh outperforms previous state-of-the-art methods. Our codes and results are available at our project webpage: https://xmu-xiaoma666.github.io/ Projects/X-Mesh/ * Corresponding author; ‡ Equal contributions. Neural Style NetworkSteve Jobs in a red sweater, blue jeans, brown leather shoes and colorful gloves . X-MeshSteve Jobs in a red sweater, blue jeans, brown leather shoes and colorful gloves .

show abstract

Image-Based CLIP-Guided Essence Transfer

Cited by 25 publications

References 37 publications

Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation

Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance

Contact Info

Product

Resources

About