Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis

Hong, Seunghoon; Yang, Dingdong; Choi, Jong-Wook; Lee, Honglak

doi:10.1109/cvpr.2018.00833

Cited by 333 publications

(297 citation statements)

References 25 publications

(56 reference statements)

Supporting

Mentioning

297

Contrasting

Order By: Relevance

“…[212] includes extra object pathway to both generator and discriminator to explicit control the object locations. [213] employs a two-stage procedure that first builds a semantic layout automatically from the input sentence with LSTM based box and shape generators, and then synthesizes the image using image generator and discriminators. Since fine-grained word/object level information is not explicitly used for generation, such synthesized images do not contain enough details to make them look realistic.…”

Section: ) Semantic Layout Control For Complex Scenesmentioning

confidence: 99%

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Zhang

Yang

et al. 2020

IEEE J. Sel. Top. Signal Process.

230

View full text Add to dashboard Cite

Deep learning has revolutionized speech recognition, image recognition, and natural language processing since 2010, each involving a single modality in the input signal. However, many applications in artificial intelligence involve more than one modality. It is therefore of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, a technical review of the models and learning methods for multimodal intelligence is provided. The main focus is the combination of vision and natural language, which has become an important area in both computer vision and natural language processing research communities.This review provides a comprehensive analysis of recent work on multimodal deep learning from three new angles -learning multimodal representations, the fusion of multimodal signals at various levels, and multimodal applications. On multimodal representation learning, we review the key concept of embedding, which unifies the multimodal signals into the same vector space and thus enables cross-modality signal processing. We also review the properties of the many types of embedding constructed and learned for general downstream tasks. On multimodal fusion, this review focuses on special architectures for the integration of the representation of unimodal signals for a particular task. On applications, selected areas of a broad interest in current literature are covered, including caption generation, text-to-image generation, and visual question answering. We believe this review can facilitate future studies in the emerging field of multimodal intelligence for the community.

show abstract

Section: ) Semantic Layout Control For Complex Scenesmentioning

confidence: 99%

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Zhang

Yang

et al. 2020

IEEE J. Sel. Top. Signal Process.

230

View full text Add to dashboard Cite

show abstract

“…Reed et al [24] perform image generation with sentence input along with additional information in the form of keypoints or bounding boxes. Hong et al [11] break down the process of generating an image from a sentence into multiple stages. The input sentence is first used to predict the objects that are present in the scene, followed by prediction of bounding boxes, then semantic segmentation masks, and finally the image.…”

Section: Related Workmentioning

confidence: 99%

“…Despite the rapid progress and recent successes in object generation (e.g., celebrity face, animals, etc.) [1,9,13] and scene generation [4,11,12,19,22,30,31], little attention has been paid to frameworks designed for stochastic semantic layout generation. Having a robust model for layout generation will not only allow us to generate reliable scene layouts, but also provide priors and means to infer latent relationships between objects, advancing progress in the scene understanding domain.…”

Section: Introductionmentioning

confidence: 99%

LayoutVAE: Stochastic Scene Layout Generation From a Label Set

Jyothi

Durand

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

Recently there is an increasing interest in scene generation within the research community. However, models used for generating scene layouts from textual description largely ignore plausible visual variations within the structure dictated by the text. We propose LayoutVAE, a variational autoencoder based framework for generating stochastic scene layouts. LayoutVAE is a versatile modeling framework that allows for generating full image layouts given a label set, or per label layouts for an existing image given a new label. In addition, it is also capable of detecting unusual layouts, potentially providing a way to evaluate layout generation problem. Extensive experiments on MNIST-Layouts and challenging COCO 2017 Panoptic dataset verifies the effectiveness of our proposed framework.

show abstract

“…As clip arts in abstract scene can be easily generalized to object bounding boxes in semantic layout, this concept extends to real images [20]. Predicting a semantic layout from text is usually posed as an intermediate step for complex image generation [7] [9]. A complex image refers to the one containing multiple interactive objects.…”

Section: Related Workmentioning

confidence: 99%

“…Figure 1: The Seq-SG2SL framework for inferring semantic layout from scene graph. scene graph [11] [13] for semantic description and semantic layout [9] [26] for image. Therefore, our goal in this work solves the underlying task, inferring semantic layout from scene graph, for connecting text to image.…”

Section: Introductionmentioning

confidence: 99%

Seq-SG2SL: Inferring Semantic Layout From Scene Graph Through Sequence to Sequence Learning

Li¹,

Zhuang²,

Li³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

Generating semantic layout from scene graph is a crucial intermediate task connecting text to image. We present a conceptually simple, flexible and general framework using sequence to sequence (seq-to-seq) learning for this task. The framework, called Seq-SG2SL, derives sequence proxies for the two modality and a Transformer-based seq-toseq model learns to transduce one into the other. A scene graph is decomposed into a sequence of semantic fragments (SF), one for each relationship. A semantic layout is represented as the consequence from a series of brick-action code segments (BACS), dictating the position and scale of each object bounding box in the layout. Viewing the two building blocks, SF and BACS, as corresponding terms in two different vocabularies, a seq-to-seq model is fittingly used to translate. A new metric, semantic layout evaluation understudy (SLEU), is devised to evaluate the task of semantic layout prediction inspired by BLEU. SLEU defines relationships within a layout as unigrams and looks at the spatial distribution for n-grams. Unlike the binary precision of BLEU, SLEU allows for some tolerances spatially through thresholding the Jaccard Index and is consequently more adapted to the task. Experimental results on the challenging Visual Genome dataset show improvement over a non-sequential approach based on graph convolution.

show abstract

Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis

Cited by 333 publications

References 25 publications

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

LayoutVAE: Stochastic Scene Layout Generation From a Label Set

Seq-SG2SL: Inferring Semantic Layout From Scene Graph Through Sequence to Sequence Learning

Contact Info

Product

Resources

About