MONet: Unsupervised Scene Decomposition and Representation

Burgess, Christopher; Matthey, Löıc; Watters, Nicholas; Kabra, Rishabh; Higgins, Irina; Botvinick, Matt; Lerchner, Alexander

doi:10.48550/arxiv.1901.11390

Cited by 86 publications

(135 citation statements)

References 13 publications

Supporting

Mentioning

125

Contrasting

Order By: Relevance

“…There is a rich literature on this topic, for example, in [29], the IODINE model uses iterative variational inference to infer a set of latent variables recurrently, with each representing one object in an image. Similarly, MONet [11] and GENESIS [21] also adopt the multiple encode-decode steps. In contrast, [46] proposes Slot Attention, which enables single step encoding-decoding with iterative attention.…”

Section: Related Workmentioning

confidence: 99%

Self-supervised Video Object Segmentation by Motion Grouping

Yang

Lamdouar

et al. 2021

Preprint

View full text Add to dashboard Cite

https://charigyang.github.io/motiongroup/ Figure 1. Segmenting camouflaged animals. Motion plays a critical role in augmenting the capability of our visual system for perceptual grouping in complex scenes -for example, in these sequences (MoCA dataset [41]), the visual appearance (RGB images) is clearly uninformative. We propose a self-supervised approach to segment objects using only motion, i.e. optical flow. From top to bottom rows, we show the video frames, optical flow between consecutive frames, and the segmentation produced by our approach.

show abstract

Section: Related Workmentioning

confidence: 99%

Self-supervised Video Object Segmentation by Motion Grouping

Yang

Lamdouar

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…object-centric) scene learning as that of computing the posterior p(z 1 , z 2 , ..., z K |x 1 ), where K is the number of 3D objects in the scene including the background "object", z k ∈ R D is a D-dimensional latent representation of object k, and x 1 ∈ R M is a 2D image with M pixels. As K is unknown, recent "multi-object" works [4,9] have fixed K globally to be a number that is sufficiently large (greater than actual number of objects) to capture all the objects in a scene and allowing for empty slots. Thus, we will use K to represent the number of object slots hereafter.…”

Section: Multi-object Representationsmentioning

confidence: 99%

“…We model an image x t with a spatial Gaussian mixture model [24,10], similar to MONet [4] and IODINE [9], and additionally we take as input (condition on) the viewpoint v t . We can then write the generative likelihood as:…”

Section: Generative Modelmentioning

confidence: 99%

“…In addition, these works struggle in a multi-view scenario where pre-segmented images require consistent multi-frame object registration and tracking, since the segmentation and representation models work independently. More recently, several works [4,9,18,17,1] have succeeded in approximating the factorized posterior p(z 1 , z 2 , ..., z K |x) within the VAE framework, achieving impressive unsupervised object-level scene factorization. However, being single-view models, they fall victim to single-view spatial ambiguities.…”

Section: Related Workmentioning

confidence: 99%

“…Model State Space SpecificationsModel architecture We show our model configurations inTable 3,4,and 5. …”

mentioning

confidence: 99%

See 2 more Smart Citations

Learning Object-Centric Representations of Multi-Object Scenes from Multiple Views

Li¹,

Cian²,

Fisher³

2021

Preprint

View full text Add to dashboard Cite

Learning object-centric representations of multi-object scenes is a promising approach towards machine intelligence, facilitating high-level reasoning and control from visual sensory data. However, current approaches for unsupervised objectcentric scene representation are incapable of aggregating information from multiple observations of a scene. As a result, these "single-view" methods form their representations of a 3D scene based only on a single 2D observation (view). Naturally, this leads to several inaccuracies, with these methods falling victim to single-view spatial ambiguities. To address this, we propose The Multi-View and Multi-Object Network (MulMON)-a method for learning accurate, object-centric representations of multi-object scenes by leveraging multiple views. In order to sidestep the main technical difficulty of the multi-object-multi-view scenario-maintaining object correspondences across views-MulMON iteratively updates the latent object representations for a scene over multiple views. To ensure that these iterative updates do indeed aggregate spatial information to form a complete 3D scene understanding, MulMON is asked to predict the appearance of the scene from novel viewpoints during training. Through experiments we show that MulMON better-resolves spatial ambiguities than single-view methods-learning more accurate and disentangled object representations-and also achieves new functionality in predicting object segmentations for novel viewpoints. Our implementation and pretrained models are given on GitHub 1 .

show abstract

KG-to-Text Generation with Slot-Attention and Link-Attention

Wang

Zhang

Liu

et al. 2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Slot attention has shown remarkable objectcentric representation learning performance in computer vision tasks without requiring any supervision. Despite its object-centric binding ability brought by compositional modelling, as a deterministic module, slot attention lacks the ability to generate novel scenes. In this paper, we propose the Slot-VAE, a generative model that integrates slot attention with the hierarchical VAE framework for object-centric structured scene generation. For each image, the model simultaneously infers a global scene representation to capture high-level scene structure and object-centric slot representations to embed individual object components. During generation, slot representations are generated from the global scene representation to ensure coherent scene structures. Our extensive evaluation of the scene generation ability indicates that Slot-VAE outperforms slot representation-based generative baselines in terms of sample quality and scene structure accuracy.

show abstract

MONet: Unsupervised Scene Decomposition and Representation

Cited by 86 publications

References 13 publications

Self-supervised Video Object Segmentation by Motion Grouping

Self-supervised Video Object Segmentation by Motion Grouping

Learning Object-Centric Representations of Multi-Object Scenes from Multiple Views

KG-to-Text Generation with Slot-Attention and Link-Attention

Contact Info

Product

Resources

About