Visual-Relation Conscious Image Generation from Structured-Text

Vo, Duc Minh; Sugimoto, Akihiro

doi:10.1007/978-3-030-58604-1_18

Cited by 16 publications

(9 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [122], a scene graph is used to predict initial bounding boxes for objects. Using the initial bounding boxes, relation units consisting of two bounding boxes are predicted for each individual subject-predicate-object relation.…”

Section: Scene Graphsmentioning

confidence: 99%

Adversarial Text-to-Image Synthesis: A Review

Frolov,

Hinz,

Raue

et al. 2021

Preprint

View full text Add to dashboard Cite

With the advent of generative adversarial networks, synthesizing images from textual descriptions has recently become an active research area. It is a flexible and intuitive way for conditional image generation with significant progress in the last years regarding visual realism, diversity, and semantic alignment. However, the field still faces several challenges that require further research efforts such as enabling the generation of high-resolution images with multiple objects, and developing suitable and reliable evaluation metrics that correlate with human judgement. In this review, we contextualize the state of the art of adversarial text-to-image synthesis models, their development since their inception five years ago, and propose a taxonomy based on the level of supervision. We critically examine current strategies to evaluate textto-image synthesis models, highlight shortcomings, and identify new areas of research, ranging from the development of better datasets and evaluation metrics to possible improvements in architectural design and model training. This review complements previous surveys on generative adversarial networks with a focus on text-to-image synthesis which we believe will help researchers to further advance the field.

show abstract

Section: Scene Graphsmentioning

confidence: 99%

Adversarial Text-to-Image Synthesis: A Review

Frolov,

Hinz,

Raue

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…This consideration is crucial for better image generation to support the story visualization task. Indeed, recent works [10,14,17,18,34] employ different techniques such as prediction networks to estimate a scene layout that gives initial and refined layout [34] or gives predictive values for the object appearances [10,14,17,18]. However, in [14], the authors used object embeddings only to represent objects in their layout without using other details from the scene graph.…”

Section: Object Layout Modulementioning

confidence: 99%

Improving text-to-image generation with object layout guidance

Zakraoui

Saleh

Al‐Maadeed

et al. 2021

Multimed Tools Appl

View full text Add to dashboard Cite

The automatic generation of realistic images directly from a story text is a very challenging problem, as it cannot be addressed using a single image generation approach due mainly to the semantic complexity of the story text constituents. In this work, we propose a new approach that decomposes the task of story visualization into three phases: semantic text understanding, object layout prediction, and image generation and refinement. We start by simplifying the text using a scene graph triple notation that encodes semantic relationships between the story objects. We then introduce an object layout module to capture the features of these objects from the corresponding scene graph. Specifically, the object layout module aggregates individual object features from the scene graph as well as averaged or likelihood object features generated by a graph convolutional neural network. All these features are concatenated to form semantic triples that are then provided to the image generation framework. For the image generation phase, we adopt a scene graph image generation framework as stage-I, which is refined using a StackGAN as stage-II conditioned on the object layout module and the generated output image from stage-I. Our approach renders object details in high-resolution images while keeping the image structure consistent with the input text. To evaluate the performance of our approach, we use the COCO dataset and compare it with three baseline approaches, namely, sg2im, StackGAN and AttnGAN, in terms of image quality and user evaluation. According to the obtained assessment results, our object layout guidance-based approach significantly outperforms the abovementioned baseline approaches in terms of the accuracy of semantic matching and realism of the generated images representing the story text sentences.

show abstract

“…Johnson et al [10] first proposed to generate images from scene graphs, they implemented the sg2im method to reason related objects and relationships. Then Vo et al [36] adopted the scene structure in the conditional GAN network and put forward the stacking-GANs to infer visualrelation layouts. With the same input form, Li et al [19] proposed the PasteGAN to crop objects from the external memory tank and paste them into correct locations of the final images.…”

Section: 3mentioning

confidence: 99%

“…The graph convolutional network (GCN) can directly operate on graphs. Following [10,36,19], we also take the scene graph as input and calculate new embedding vectors for each node and edge. Additionally, we apply the same function on each graph convolutional layer, which ensures a single layer can work with arbitrarily shaped graphs.…”

Section: 3mentioning

confidence: 99%

See 1 more Smart Citation

Global-Affine and Local-Specific Generative Adversarial Network for semantic-guided image generation

Zhang¹,

Ni²,

Hou³

et al. 2021

MFC

View full text Add to dashboard Cite

The recent progress in learning image feature representations has opened the way for tasks such as label-to-image or text-to-image synthesis. However, one particular challenge widely observed in existing methods is the difficulty of synthesizing fine-grained textures and small-scale instances. In this paper, we propose a novel Global-Affine and Local-Specific Generative Adversarial Network (GALS-GAN) to explicitly construct global semantic layouts and learn distinct instance-level features. To achieve this, we adopt the graph convolutional network to calculate the instance locations and spatial relationships from scene graphs, which allows our model to obtain the highfidelity semantic layouts. Also, a local-specific generator, where we introduce the feature filtering mechanism to separately learn semantic maps for different categories, is utilized to disentangle and generate specific visual features. Moreover, we especially apply a weight map predictor to better combine the global and local pathways considering the highly complementary between these two generation sub-networks. Extensive experiments on the COCO-Stuff and Visual Genome datasets demonstrate the superior generation performance of our model against previous methods, our approach is more capable of capturing photo-realistic local characteristics and rendering small-sized entities with more details.

show abstract

Visual-Relation Conscious Image Generation from Structured-Text

Cited by 16 publications

References 17 publications

Adversarial Text-to-Image Synthesis: A Review

Adversarial Text-to-Image Synthesis: A Review

Improving text-to-image generation with object layout guidance

Global-Affine and Local-Specific Generative Adversarial Network for semantic-guided image generation

Contact Info

Product

Resources

About