HairCLIP: Design Your Hair by Text and Reference Image

Wei, Tianyi; Chen, Dongdong; Zhou, Wenbo; Liao, Jing; Tan, Zhentao; Liu, Yuan; Zhang, Weimin; Yu, Nenghai

doi:10.48550/arxiv.2112.05142

Cited by 4 publications

(6 citation statements)

References 33 publications

(73 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With the development of the powerful cross-modal visual and language model CLIP [32], many recent efforts start to study CLIP-Guided Image Generation [8,15,19,30,36,37]. CLIPStyler [19] utilizes the well-aligned CLIP latent space for high-quality style transfer.…”

Section: Related Workmentioning

confidence: 99%

Pluralistic Aging Diffusion Autoencoder

Li¹,

Wang²,

Huang³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

Section: Related Workmentioning

confidence: 99%

Pluralistic Aging Diffusion Autoencoder

Li¹,

Wang²,

Huang³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Otherwise, the similarity is minimized for the model to find the most suitable paired images and texts. Multimodal learning has a promising future where the innovation of CLIP has benefitted a lot of down-stream tasks [10,43]. Other multimodal schemes can be represented by Glide [30] and VilBERT [28] that are respectively for text-to-image generation and multimodal representation learning.…”

Section: Multimodal Learningmentioning

confidence: 99%

Multimodal Fake News Detection via CLIP-Guided Learning

Zhou¹,

Ying²,

Qian³

et al. 2022

Preprint

View full text Add to dashboard Cite

Multimodal fake news detection has attracted many research interests in social forensics. Many existing approaches introduce tailored attention mechanisms to guide the fusion of unimodal features. However, how the similarity of these features is calculated and how it will affect the decision-making process in FND are still open questions. Besides, the potential of pretrained multimodal feature learning models in fake news detection has not been well exploited. This paper proposes a FND-CLIP framework, i.e., a multimodal Fake News Detection network based on Contrastive Language-Image Pretraining (CLIP). Given a targeted multimodal news, we extract the deep representations from the image and text using a ResNet-based encoder, a BERT-based encoder and two pairwise CLIP encoders. The multimodal feature is a concatenation of the CLIP-generated features weighted by the standardized crossmodal similarity of the two modalities. The extracted features are further processed for redundancy reduction before feeding them into the final classifier. We introduce a modality-wise attention module to adaptively reweight and aggregate the features. We have conducted extensive experiments on typical fake news datasets. The results indicate that the proposed framework has a better capability in mining crucial features for fake news detection. The proposed FND-CLIP can achieve better performances than previous works, i.e., 0.7%, 6.8% and 1.3%improvements in overall accuracy on Weibo, Politifact and Gossipcop, respectively. Besides, we justify that CLIP-based learning can allow better flexibility on multimodal feature selection.

show abstract

“…To achieve zero-shot and open-vocabulary editing, latest works set their sights on using pretrained multi-modality models as guidance. With the aligned image-text representation learned by CLIP, a few works Wei et al [2021], use text to extract the latent edit directions with textual defined semantic meanings for separate input images. These works focus on extracting latent directions using contrastive CLIP loss to conduct image manipulation tasks such as face editing , Wei et al [2021], cars editing .…”

Section: Stylegan-based Editingmentioning

confidence: 99%

“…With the aligned image-text representation learned by CLIP, a few works Wei et al [2021], use text to extract the latent edit directions with textual defined semantic meanings for separate input images. These works focus on extracting latent directions using contrastive CLIP loss to conduct image manipulation tasks such as face editing , Wei et al [2021], cars editing . On the other hand, rather than editing the latent code, in observance of the smoothness of the StyleGAN feature space, Gal et al Gal et al [2021] focus on fine-tuning the latent domain of the generator to transfer the feature domain.…”

Section: Stylegan-based Editingmentioning

confidence: 99%

Grasping the Arrow of Time from the Singularity: Decoding Micromotion in Low-dimensional Latent Spaces from StyleGAN

Wu¹,

Jiang²,

Wu³

et al. 2022

Preprint

View full text Add to dashboard Cite

The disentanglement of StyleGAN latent space has paved the way for realistic and controllable image editing, but does StyleGAN know anything about temporal motion, as it was only trained on static images? To study the motion features in the latent space of StyleGAN, in this paper, we hypothesize and demonstrate that a series of meaningful, natural, and versatile small, local movements (referred to as "micromotion", such as expression, head movement, and aging effect) can be represented in low-rank spaces extracted from the latent space of a conventionally pre-trained StyleGAN-v2 model for face generation, with the guidance of proper "anchors" in the form of either short text or video clips. Starting from one target face image, with the editing direction decoded from the low-rank space, its micromotion features can be represented as simple as an affine transformation over its latent feature. Perhaps more surprisingly, such micromotion subspace, even learned from just single target face, can be painlessly transferred to other unseen face images, even those from vastly different domains (such as oil painting, cartoon, and sculpture faces). It demonstrates that the local feature geometry corresponding to one type of micromotion is aligned across different face subjects, and hence that StyleGAN-v2 is indeed "secretly" aware of the subject-disentangled feature variations caused by that micromotion. We present various successful examples of applying our low-dimensional micromotion subspace technique to directly and effortlessly manipulate faces, showing high robustness, low computational overhead, and impressive domain transferability. Our codes are available at https: //github.com/wuqiuche/micromotion-StyleGAN.

show abstract

HairCLIP: Design Your Hair by Text and Reference Image

Cited by 4 publications

References 33 publications

Pluralistic Aging Diffusion Autoencoder

Pluralistic Aging Diffusion Autoencoder

Multimodal Fake News Detection via CLIP-Guided Learning

Grasping the Arrow of Time from the Singularity: Decoding Micromotion in Low-dimensional Latent Spaces from StyleGAN

Contact Info

Product

Resources

About