Adversarial Learning of Semantic Relevance in Text to Image Synthesis

Cha, Miriam; Gwon, Youngjune; Kung, H. T.

doi:10.1609/aaai.v33i01.33013272

Cited by 52 publications

(33 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The matching aware discriminator is trained to distinguish between real and matching caption-image pairs ("real"), real but mismatching caption-image pairs ("fake"), and matching captions with generated images ("fake"). [17] modify the sampling procedure during training to obtain a curriculum of mismatching caption-image pairs and introduce an auxiliary classifier that specifically predicts the semantic consistency of a given caption-image pair. [9], [18] use multiple generators and discriminators and are one of the first ones to achieve good image quality at resolutions of 256 × 256 on complex data sets.…”

Section: Related Workmentioning

confidence: 99%

Semantic Object Accuracy for Generative Text-to-Image Synthesis

Hinz

Heinrich

Wermter

2022

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Generative adversarial networks conditioned on textual image descriptions are capable of generating realistic-looking images. However, current methods still struggle to generate images based on complex image captions from a heterogeneous domain. Furthermore, quantitatively evaluating these text-to-image models is challenging, as most evaluation metrics only judge image quality but not the conformity between the image and its caption. To address these challenges we introduce a new model that explicitly models individual objects within an image and a new evaluation metric called Semantic Object Accuracy (SOA) that specifically evaluates images given an image caption. The SOA uses a pre-trained object detector to evaluate if a generated image contains objects that are mentioned in the image caption, e.g. whether an image generated from "a car driving down the street" contains a car. We perform a user study comparing several text-to-image models and show that our SOA metric ranks the models the same way as humans, whereas other metrics such as the Inception Score do not. Our evaluation also shows that models which explicitly model objects outperform models which only model global image characteristics.

show abstract

Section: Related Workmentioning

confidence: 99%

Semantic Object Accuracy for Generative Text-to-Image Synthesis

Hinz

Heinrich

Wermter

2022

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

show abstract

“…The final output is discriminator similar to generic GAN; (b) Manifold interpolation matching‐aware discriminator GAN (GAN‐INT‐CLS) (Reed, Akata, Yan, et al, ) feeds text input to both generator and discriminator (texts are preprocessed as embedding features, using function φ (), and concatenated with other input, before feeding to both generator and discriminator). The final output is discriminator similar to generic GAN; (c) Auxiliary classifier GAN (AC‐GAN) (Odena, Olah, & Shlens, ) uses an auxiliary classifier layer to predict the class of the image to ensure that the output consists of images from different classes, resulting in diversified synthesis images; (d) text conditioned auxiliary classifier GAN (TAC‐GAN) (Dash, Gamboa, Ahmed, Afzal, & Liwicki, ) share similar design as GAN‐INT‐CLS, whereas the output include both a discriminator and a classifier (which can be used for classification); and (e) text conditioned semantic classifier GAN (Text‐SeGAN) (Cha, Gown, & Kung, ) uses a regression layer to estimate the semantic relevance between the image, so the generated images are not limited to certain classes and are semantically matching to the text input…”

Section: Preliminaries and Frameworkmentioning

confidence: 99%

“…For example, a recent work (Gao et al, ) proposes to use a pyramid generator and three independent discriminators, each focusing on a different aspect of the images, to lead the generator toward creating images that are photorealistic on multiple levels. Another recent publication (Cha, Gwon, & Kung, ) proposes to use discriminator to measure semantic relevance between image and text instead of class prediction (like most discriminator in GANs does), resulting a new GAN structure outperforming text conditioned auxiliary classifier (TAC‐GAN) (Dash, Gamboa, Ahmed, Afzal, & Liwicki, ) and generating diverse, realistic, and relevant to the input text regardless of class.…”

Section: Preliminaries and Frameworkmentioning

confidence: 99%

A survey and taxonomy of adversarial neural networks for text‐to‐image synthesis

Agnese

Herrera

Tao

et al. 2020

WIREs Data Min & Knowl

View full text Add to dashboard Cite

Text-to-image synthesis refers to computational methods which translate human written textual descriptions, in the form of keywords or sentences, into images with similar semantic meaning to the text. In earlier research, image synthesis relied mainly on word to image correlation analysis combined with supervised methods to find best alignment of the visual content matching to the text. Recent progress in deep learning (DL) has brought a new set of unsupervised deep learning methods, particularly deep generative models which are able to generate realistic visual images using suitably trained neural network models. The change of direction from the computer vision based approaches to artificial intelligence (AI) driven methods ignited the intense interest in industry, such as virtual reality, recreational & professional (eSports) gaming, and computer-aided design etc., to automatically generate compelling images from text-based natural language descriptions. In this paper, we review the most recent development in the text-to-image synthesis research domain. Our goal is to provide value by delivering a comparative review of the state-of-the-art models in terms of their architecture and design. Our survey first introduces image synthesis and its challenges, and then reviews key concepts such as generative adversarial networks (GANs) and deep convolutional encoderdecoder neural networks (DCNN). After that, we propose a taxonomy to summarize GAN based text-toimage synthesis into four major categories: Semantic Enhancement GANs, Resolution Enhancement GANs, Diversity Enhancement GANS, and Motion Enhancement GANs. We elaborate the main objective of each group, and further review typical GAN architectures in each group. The taxonomy and the review outline the techniques and the evolution of different approaches, and eventually provide a clear roadmap to summarize the list of contemporaneous solutions that utilize GANs and DCNNs to generate enthralling results in categories such as human faces, birds, flowers, room interiors, object reconstruction from edge maps (games) etc. The survey will conclude with a comparison of the proposed solutions, challenges that remain unresolved, and future developments in the text-to-image synthesis domain.

show abstract

“…Nguyen et al [17] introduced the PPGN, which is similar to TAC-GAN and contains a conditional network, to generate images from captions. Furthermore, based on conditional GANs, Cha et al [18] improved the adversarial training process by forming positive-negative label pairs and employing an auxiliary classifier to predict the semantic consistency of a given image-caption pair.…”

Section: A Single-stage Text-to-image Generationmentioning

confidence: 99%

Instance Mask Embedding and Attribute-Adaptive Generative Adversarial Network for Text-to-Image Synthesis

Zhang

Zhou

et al. 2020

IEEE Access

View full text Add to dashboard Cite

Existing image generation models have achieved the synthesis of reasonable individuals and complex but low-resolution images. Directly from complicated text to high-resolution image generation still remains a challenge. To this end, we propose the instance mask embedding and attribute-adaptive generative adversarial network (IMEAA-GAN). Firstly, we use the box regression network to compute a global layout containing the class labels and locations for each instance. Then the global generator encodes the layout, combines the whole text embedding and noise to preliminarily generate a low-resolution image; the instance embedding mechanism is used firstly to guide local refinement generators obtain fine-grained local features and generate a more realistic image. Finally, in order to synthesize the exact visual attributes, we introduce the multi-scale attribute-adaptive discriminator, which provides local refinement generators with the specific training signals to explicitly generate instance-level features. Extensive experiments based on the MS-COCO dataset and the Caltech-UCSD Birds-200-2011 dataset show that our model can obtain globally consistent attributes and generate complex images with local texture details.

show abstract

Adversarial Learning of Semantic Relevance in Text to Image Synthesis

Cited by 52 publications

References 17 publications

Semantic Object Accuracy for Generative Text-to-Image Synthesis

Semantic Object Accuracy for Generative Text-to-Image Synthesis

A survey and taxonomy of adversarial neural networks for text‐to‐image synthesis

Instance Mask Embedding and Attribute-Adaptive Generative Adversarial Network for Text-to-Image Synthesis

Contact Info

Product

Resources

About