StyleMC: Multi-Channel Based Fast Text-Guided Image Generation and Manipulation

Kocasari, Umut; Dirik, Alara; Tiftikci, Mert; Yanardag, Pinar

doi:10.48550/arxiv.2112.08493

Cited by 2 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Aiming for text-guided image inpainting, Bau et al [130] define a semantic consistency loss based on CLIP that optimizes latent codes inside the inpainting region to achieve semantic consistency with the given text. StyleClip [29] and StyleMC [131] use pre-trained CLIP as the loss supervision to match the manipulated results with the text condition as illustrated in Fig. 8.…”

Section: Gan Inversionmentioning

confidence: 99%

Expired

Zhan¹

2022

Preprint

View full text Add to dashboard Cite

As information exists in various modalities in real world, effective interaction and fusion among multimodal information plays a key role for the creation and perception of multimodal data in computer vision and deep learning research. With superb power in modelling the interaction among multimodal information, multimodal image synthesis and editing have become a hot research topic in recent years. Different from traditional visual guidance which provides explicit clues, multimodal guidance offers intuitive and flexible means in image synthesis and editing. On the other hand, this field is also facing several challenges in alignment of features with inherent modality gaps, synthesis of high-resolution images, faithful evaluation metrics, etc. In this survey, we comprehensively contextualize the advance of the recent multimodal image synthesis \& editing and formulate taxonomies according to data modality and model architectures. We start with an introduction to different types of guidance modalities in image synthesis and editing. We then describe multimodal image synthesis and editing approaches extensively with detailed frameworks including Generative Adversarial Networks (GANs), GAN Inversion, Transformers, and other methods such as NeRF and Diffusion models. This is followed by a comprehensive description of benchmark datasets and corresponding evaluation metrics as widely adopted in multimodal image synthesis and editing, as well as detailed comparisons of different synthesis methods with analysis of respective advantages and limitations. Finally, we provide insights into the current research challenges and possible future research directions. We hope this survey could lay a sound and valuable foundation for future development of multimodal image synthesis and editing. A project associated with this survey is available at \href{https://github.com/fnzhan/MISE}{https://github.com/fnzhan/MISE}.

show abstract

Section: Gan Inversionmentioning

confidence: 99%

Expired

Zhan¹

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Similarly, StyleClip [6] (illustrated in the third row of Fig. 2.2) and StyleMC [135] use cosine similarity between CLIP representations of texts and images to supervise text-guided manipulation. A known issue with the standard CLIP loss is the adversarial solution [136], where the model tends to fool the CLIP classifier by adding meaningless pixel-level perturbations to the image.…”

Section: D-aware Generative Modelsmentioning

confidence: 99%

Deep generative modeling for image synthesis and manipulation

View full text Add to dashboard Cite

pletion of this thesis. Their invaluable support, guidance, and encouragement have been indispensable during the entirety of my research journey. This thesis is a testament to their collective efforts and support. First and foremost, I wish to express my deepest gratitude to my supervisor Prof. Shijian Lu, for his continuous guidance, patience, and invaluable advice. His insightful feedback and dedication to my academic and personal growth have played a crucial role in shaping this research project. Without his patient guidance and thoughtful counsel, I would not have been able to accomplish my research and successfully conclude my studies. I also want to express my profound gratitude to my collaborators, especially Dr. Fangneng Zhan, for their insightful input, helpful advice, and unwavering support during my research endeavors. Furthermore, I extend my gratitude to all members of Prof. Lu's team, my friends in the Alibaba Talent Programme and Media and Interactive Computing Lab (MICL), for their generous assistance and encouragement, which have made my academic journey and time spent in Singapore truly memorable. I would like to express my gratitude to my fellow members of the Vision Intelligent Team at DAMO Academy, Alibaba, specifically my team leaders, Ms. Feiyang Ma and Ms. Miaomiao Cui, for supporting me with computational resources, technical assistance, and insightful suggestions. My heartfelt thanks go to the School of Computer Science and Engineering, Nanyang Technological University (NTU), for providing me with the necessary resources and a conducive environment for my research. I also would like to acknowledge Alibaba-NTU Singapore Joint Research Institute for offering me this Industrial Post-Graduate Program and for their support during my studies. xi xii Special thanks to my former colleague, Dr. Pan Liu, for his continuous support and invaluable assistance throughout the process of my Ph.D. application. Last but not the least, I am forever grateful to my parents, for their unconditional love, constant encouragement, and unwavering belief in my capabilities throughout my life. Their support and understanding have been the bedrock of my academic pursuits.

show abstract

StyleMC: Multi-Channel Based Fast Text-Guided Image Generation and Manipulation

Cited by 2 publications

References 0 publications

Expired

Expired

Deep generative modeling for image synthesis and manipulation

Contact Info

Product

Resources

About