2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01048
|View full text |Cite
|
Sign up to set email alerts
|

Splicing ViT Features for Semantic Appearance Transfer

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
106
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 87 publications
(111 citation statements)
references
References 20 publications
1
106
0
Order By: Relevance
“…We design our encoder as a set of featurerefinement blocks built on top of a pre-trained OpenCLIP [Ilharco et al 2021] ViT-H/14 [Dosovitskiy et al 2021] feature-extraction backbone. Specifically, we extract the features of the [CLS] token of each 2nd CLIP layer as an hierarchical feature representation [Tumanyan et al 2022;Vinker et al 2022]. Each such feature vector is fed through a linear layer, followed by average-pooling over the hierarchy and LeakyReLU activation.…”
Section: Inversion and Encoder Designmentioning
confidence: 99%
“…We design our encoder as a set of featurerefinement blocks built on top of a pre-trained OpenCLIP [Ilharco et al 2021] ViT-H/14 [Dosovitskiy et al 2021] feature-extraction backbone. Specifically, we extract the features of the [CLS] token of each 2nd CLIP layer as an hierarchical feature representation [Tumanyan et al 2022;Vinker et al 2022]. Each such feature vector is fed through a linear layer, followed by average-pooling over the hierarchy and LeakyReLU activation.…”
Section: Inversion and Encoder Designmentioning
confidence: 99%
“…However, these works require costly extensive training on curated datasets. (ii) On the other side of the spectrum, numerous methods proposed to implicitly control the generated content by manipulating the generation process of a pre-trained model Meng et al, 2021;Tumanyan et al, 2022;Hertz et al, 2022;Figure 2. MultiDiffusion: a new generation process, Ψ, is defined over a pre-trained reference model Φ.…”
Section: Related Workmentioning
confidence: 99%
“…Avarahami et al designed image inpainting methods (Avrahami et al, 2022a; that do not require finetuning. Recent works (Tumanyan et al, 2022;Hertz et al, 2022) rely on architectural properties and insights about the internal features of the pretrained model, and tailor image editing techniques accordingly. Our work also manipulates the generation process of a pretrained diffusion model, and does not require any training or finetuning.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…A few methods [22,46,47,48] have exploited self-similarity-based feature descriptors to obtain structure representations. Unlike these methods, to reveal the clear background structures, we use deep spatial features obtained from DINO-ViT [49], which has been proven to learn meaningful visual representations [50]. Moreover, these powerful representations are shared across different object classes.…”
Section: Structure Representation Networkmentioning
confidence: 99%