Splicing ViT Features for Semantic Appearance Transfer

Tumanyan, Narek; Bar-Tal, Omer; Bagon, Shai; Dekel, Tali

doi:10.1109/cvpr52688.2022.01048

Cited by 87 publications

(111 citation statements)

References 20 publications

Supporting

Mentioning

106

Contrasting

Order By: Relevance

“…We design our encoder as a set of featurerefinement blocks built on top of a pre-trained OpenCLIP [Ilharco et al 2021] ViT-H/14 [Dosovitskiy et al 2021] feature-extraction backbone. Specifically, we extract the features of the [CLS] token of each 2nd CLIP layer as an hierarchical feature representation [Tumanyan et al 2022;Vinker et al 2022]. Each such feature vector is fed through a linear layer, followed by average-pooling over the hierarchy and LeakyReLU activation.…”

Section: Inversion and Encoder Designmentioning

confidence: 99%

Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models

Gal¹,

Arar²,

Atzmon³

et al. 2023

Preprint

View full text Add to dashboard Cite

Section: Inversion and Encoder Designmentioning

confidence: 99%

Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models

Gal¹,

Arar²,

Atzmon³

et al. 2023

Preprint

View full text Add to dashboard Cite

“…However, these works require costly extensive training on curated datasets. (ii) On the other side of the spectrum, numerous methods proposed to implicitly control the generated content by manipulating the generation process of a pre-trained model Meng et al, 2021;Tumanyan et al, 2022;Hertz et al, 2022;Figure 2. MultiDiffusion: a new generation process, Ψ, is defined over a pre-trained reference model Φ.…”

Section: Related Workmentioning

confidence: 99%

“…Avarahami et al designed image inpainting methods (Avrahami et al, 2022a; that do not require finetuning. Recent works (Tumanyan et al, 2022;Hertz et al, 2022) rely on architectural properties and insights about the internal features of the pretrained model, and tailor image editing techniques accordingly. Our work also manipulates the generation process of a pretrained diffusion model, and does not require any training or finetuning.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Text2LIVE: Text-Driven Layered Image and Video Editing

Bar-Tal

Ofri-Amar

Fridman

et al. 2022

Lecture Notes in Computer Science

Self Cite

106

View full text Add to dashboard Cite

Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long retraining and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning. At the center of our approach is a new generation process, based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints. We show that MultiDiffusion can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes.Project page is available at https://multidiffusion. github.io.

show abstract

“…A few methods [22,46,47,48] have exploited self-similarity-based feature descriptors to obtain structure representations. Unlike these methods, to reveal the clear background structures, we use deep spatial features obtained from DINO-ViT [49], which has been proven to learn meaningful visual representations [50]. Moreover, these powerful representations are shared across different object classes.…”

Section: Structure Representation Networkmentioning

confidence: 99%

Structure Representation Network and Uncertainty Feedback Learning for Dense Non-Uniform Fog Removal

Jin¹,

Yan²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Few existing image defogging or dehazing methods consider dense and non-uniform particle distributions, which usually happen in smoke, dust and fog. Dealing with these dense and/or non-uniform distributions can be intractable, since fog's attenuation and airlight (or veiling effect) significantly weaken the background scene information in the input image. To address this problem, we introduce a structurerepresentation network with uncertainty feedback learning. Specifically, we extract the feature representations from a pre-trained Vision Transformer (DINO-ViT) module to recover the background information. To guide our network to focus on non-uniform fog areas, and then remove the fog accordingly, we introduce the uncertainty feedback learning, which produces uncertainty maps, that have higher uncertainty in denser fog regions, and can be regarded as an attention map that represents fog's density and uneven distribution. Based on the uncertainty map, our feedback network refines our defogged output iteratively. Moreover, to handle the intractability of estimating the atmospheric light colors, we exploit the grayscale version of our input image, since it is less affected by varying light colors that are possibly present in the input image. The experimental results demonstrate the effectiveness of our method both quantitatively and qualitatively compared to the state-of-the-art methods in handling dense and non-uniform fog or smoke.

show abstract

Splicing ViT Features for Semantic Appearance Transfer

Cited by 87 publications

References 20 publications

Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models

Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models

Text2LIVE: Text-Driven Layered Image and Video Editing

Structure Representation Network and Uncertainty Feedback Learning for Dense Non-Uniform Fog Removal

Contact Info

Product

Resources

About