Language-Driven Image Style Transfer

Fu, Tsu-Jui; Wang, Xin Eric; Wang, William Yang

doi:10.48550/arxiv.2106.00178

Cited by 3 publications

(3 citation statements)

References 100 publications

(90 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Multimodal learning has come into prominence recently, with text-to-image synthesis [53,12,57] and image-text contrastive learning [49,31,74] at the forefront. These models have transformed the research community and captured widespread public attention with creative image generation [22,54] and editing applications [21,41,34]. To pursue this research direction further, we introduce Imagen, a text-to-image diffusion model that combines the power of transformer language models (LMs) [15,52] with high-fidelity diffusion models [28,29,16,41] to deliver an unprecedented degree of photorealism and a deep level of language understanding in text-to-image synthesis.…”

Section: Introductionmentioning

confidence: 99%

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Saharia¹,

Chan²,

Saxena³

et al. 2022

Preprint

222

312

View full text Add to dashboard Cite

We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and imagetext alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, GLIDE and DALL-E 2, and find that human raters prefer Imagen over other models in side-byside comparisons, both in terms of sample quality and image-text alignment. See imagen.research.google for an overview of the results. * Equal contribution. † Core contribution.

show abstract

Section: Introductionmentioning

confidence: 99%

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Saharia¹,

Chan²,

Saxena³

et al. 2022

Preprint

222

312

View full text Add to dashboard Cite

show abstract

“…Style Transfer. Without requiring training or inversion of generative models, CLVA [196] manipulates the style of a content image through text prompts by comparing the contrastive pairs of content image and style instruction to achieve the mutual relativeness. However, CLVA is constrained as it requires style images accompanied with the text prompts during training.…”

Section: Other Methodsmentioning

confidence: 99%

Expired

Zhan¹

2022

Preprint

View full text Add to dashboard Cite

As information exists in various modalities in real world, effective interaction and fusion among multimodal information plays a key role for the creation and perception of multimodal data in computer vision and deep learning research. With superb power in modelling the interaction among multimodal information, multimodal image synthesis and editing have become a hot research topic in recent years. Different from traditional visual guidance which provides explicit clues, multimodal guidance offers intuitive and flexible means in image synthesis and editing. On the other hand, this field is also facing several challenges in alignment of features with inherent modality gaps, synthesis of high-resolution images, faithful evaluation metrics, etc. In this survey, we comprehensively contextualize the advance of the recent multimodal image synthesis \& editing and formulate taxonomies according to data modality and model architectures. We start with an introduction to different types of guidance modalities in image synthesis and editing. We then describe multimodal image synthesis and editing approaches extensively with detailed frameworks including Generative Adversarial Networks (GANs), GAN Inversion, Transformers, and other methods such as NeRF and Diffusion models. This is followed by a comprehensive description of benchmark datasets and corresponding evaluation metrics as widely adopted in multimodal image synthesis and editing, as well as detailed comparisons of different synthesis methods with analysis of respective advantages and limitations. Finally, we provide insights into the current research challenges and possible future research directions. We hope this survey could lay a sound and valuable foundation for future development of multimodal image synthesis and editing. A project associated with this survey is available at \href{https://github.com/fnzhan/MISE}{https://github.com/fnzhan/MISE}.

show abstract

“…Style Transfer CLVA [165] proposes to manipulate the style of a content image through text guidance, by comparing the contrastive pairs of content image and style instruction to achieve the mutual relativeness. CLIPstyler [166] propose to achieve text guided style transfer by training a lightweight network which transform a content image to follow the text condition by matching the similarity between the CLIP model output.…”

Section: Other Methodsmentioning

confidence: 99%

Multimodal Image Synthesis and Editing: A Survey

Zhan¹,

Yu²,

Wu³

et al. 2021

Preprint

View full text Add to dashboard Cite

As information exists in various modalities in real world, effective interaction and fusion among multimodal information plays a key role for the creation and perception of multimodal data in computer vision and deep learning research. With superb power in modelling the interaction among multimodal information, multimodal image synthesis and editing have become a hot research topic in recent years. Different from traditional visual guidance which provides explicit clues, multimodal guidance offers intuitive and flexible means in image synthesis and editing. On the other hand, this field is also facing several challenges in alignment of features with inherent modality gaps, synthesis of high-resolution images, faithful evaluation metrics, etc. In this survey, we comprehensively contextualize the advance of the recent multimodal image synthesis & editing and formulate taxonomies according to data modality and model architectures. We start with an introduction to different types of guidance modalities in image synthesis and editing. We then describe multimodal image synthesis and editing approaches extensively with detailed frameworks including Generative Adversarial Networks (GANs), GAN Inversion, Transformers, and other methods such as NeRF and Diffusion models. This is followed by a comprehensive description of benchmark datasets and corresponding evaluation metrics as widely adopted in multimodal image synthesis and editing, as well as detailed comparisons of different synthesis methods with analysis of respective advantages and limitations. Finally, we provide insights into the current research challenges and possible future research directions. We hope this survey could lay a sound and valuable foundation for future development of multimodal image synthesis and editing. A project associated with this survey is available at https://github.com/fnzhan/MISE.

show abstract

Language-Driven Image Style Transfer

Cited by 3 publications

References 100 publications

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Expired

Multimodal Image Synthesis and Editing: A Survey

Contact Info

Product

Resources

About