CookGAN: Causality Based Text-to-Image Synthesis

Zhu, Bin; Ngo, Chong‐Wah

doi:10.1109/cvpr42600.2020.00556

Cited by 59 publications

(26 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our proposed unified framework is implemented based on StyleGAN2-Ada [12]. As is shown in Table 1, we find our proposed CI-GAN 4.54 ± 0.07 -StackGAN++ [37] 5.03 ± 0.09 -CookGAN [38] 5.41 ± 0.11 -CI-GAN (Ours)…”

Section: Implementation Detailsmentioning

confidence: 95%

Cycle-Consistent Inverse GAN for Text-to-Image Synthesis

Wang

Lin

Hoi

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

This paper investigates an open research task of text-to-image synthesis for automatically generating or manipulating images from text descriptions. Prevailing methods mainly take the textual descriptions as the conditional input for the GAN generation, and need to train different models for the text-guided image generation and manipulation tasks. In this paper, we propose a novel unified framework of Cycle-consistent Inverse GAN (CI-GAN) for both text-to-image generation and text-guided image manipulation tasks. Specifically, we first train a GAN model without text input, aiming to generate images with high diversity and quality. Then we learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image, where we introduce the cycle-consistency training to learn more robust and consistent inverted latent codes. We further uncover the semantics of the latent space of the trained GAN model, by learning a similarity model between text representations and the latent codes.In the text-guided optimization module, we can generate images with the desired semantic attributes through optimization on the inverted latent codes. Extensive experiments on the Recipe1M and CUB datasets validate the efficacy of our proposed framework. CCS CONCEPTS• Computing methodologies → Computer vision.

show abstract

Section: Implementation Detailsmentioning

confidence: 95%

Cycle-Consistent Inverse GAN for Text-to-Image Synthesis

Wang

Lin

Hoi

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…This domain gap can be bridged by additionally learning or fine-tuning the text encoder (or parts of it) during the generative model training, however, due to the complexity of learning an effective decoder, this might result in sub-optimal cross-modal textual representations. Recently, cross-domain retrieval and synthesis frameworks have attempted to alleviate this, particularly, for complex cooking recipe descriptions [10,37,45,46]. These last methods can be spilt into joint [37,46], and separate [10,45] 3 , embedding and synthesis.…”

Section: Cross-modal Synthesismentioning

confidence: 99%

“…Recently, cross-domain retrieval and synthesis frameworks have attempted to alleviate this, particularly, for complex cooking recipe descriptions [10,37,45,46]. These last methods can be spilt into joint [37,46], and separate [10,45] 3 , embedding and synthesis. The method proposed here is closely related to these methods, however, it significantly differs on how the conditional information is generated, as stated above.…”

Section: Cross-modal Synthesismentioning

confidence: 99%

“…However, our proposed recipe embeddings can generate images more suitable for retrieval. † indicates values taken directly from [45] as no code was available to replicate results. In terms of visual realism, CHEF embeddings enable the generative model to produce the most realistic looking food images among the tested methods, reflected by its smallest FID score.…”

Section: Multi-lingual Retrieval Analysismentioning

confidence: 99%

“…Both[10,45] methods are called CookGAN, we will use CookGAN-a for[10] and CookGAN-b for[45] to reflect order of publication.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Cross-modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Subspace Learning

Pham

Pavlović

2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Computational food analysis (CFA) naturally requires multi-modal evidence of a particular food, e.g., images, recipe text, etc. A key to making CFA possible is multi-modal shared representation learning, which aims to create a joint representation of the multiple views (text and image) of the data. In this work we propose a method for food domain cross-modal shared representation learning that preserves the vast semantic richness present in the food data. Our proposed method employs an effective transformer-based multilingual recipe encoder coupled with a traditional image embedding architecture. Here, we propose the use of imperfect multilingual translations to effectively regularize the model while at the same time adding functionality across multiple languages and alphabets. Experimental analysis on the public Recipe1M dataset shows that the representation learned via the proposed method significantly outperforms the current state-of-the-arts (SOTA) on retrieval tasks. Furthermore, the representational power of the learned representation is demonstrated through a generative food image synthesis model conditioned on recipe embeddings. Synthesized images can effectively reproduce the visual appearance of paired samples, indicating that the learned representation captures the joint semantics of both the textual recipe and its visual content, thus narrowing the modality gap. CCS CONCEPTS• Information systems → Multimedia and multimodal retrieval; • Computing methodologies → Learning latent representations.

show abstract

Hybrid Attention Driven Text-to-Image Synthesis via Generative Adversarial Networks

Cheng

2019

Artificial Neural Networks and Machine Learning – ICANN 2019: Workshop and Special Sessions

View full text Add to dashboard Cite

Text-to-image synthesis is an attractive but challenging task that aims to generate a photo-realistic and semantic consistent image from a specific text description. The images synthesized by off-the-shelf models usually contain limited components compared with the corresponding image and text description, which decreases the image quality and the textualvisual consistency. To address this issue, we propose a novel Vision-Language Matching strategy for text-to-image synthesis, named VLMGAN*, which introduces a dual vision-language matching mechanism to strengthen the image quality and semantic consistency. The dual vision-language matching mechanism considers textual-visual matching between the generated image and the corresponding text description, and visual-visual consistent constraints between the synthesized image and the real image. Given a specific text description, VLMGAN* firstly encodes it into textual features and then feeds them to a dual vision-language matching-based generative model to synthesize a photo-realistic and textual semantic consistent image. Besides, the popular evaluation metrics for text-to-image synthesis are borrowed from simple image generation, which mainly evaluate the reality and diversity of the synthesized images. Therefore, we introduce a metric named Vision-Language Matching Score (VLMS) to evaluate the performance of text-to-image synthesis which can consider both the image quality and the semantic consistency between synthesized image and the description. The proposed dual multi-level vision-language matching strategy can be applied to other text-to-image synthesis methods. We implement this strategy on two popular baselines, which are marked with VLMGAN +AttnGAN and VLMGAN +DFGAN . The experimental results on two widely-used datasets show that the model achieves significant improvements over other state-of-theart methods.

show abstract

CookGAN: Causality Based Text-to-Image Synthesis

Cited by 59 publications

References 16 publications

Cycle-Consistent Inverse GAN for Text-to-Image Synthesis

Cycle-Consistent Inverse GAN for Text-to-Image Synthesis

Cross-modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Subspace Learning

Hybrid Attention Driven Text-to-Image Synthesis via Generative Adversarial Networks

Contact Info

Product

Resources

About