2019
DOI: 10.48550/arxiv.1901.11390
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MONet: Unsupervised Scene Decomposition and Representation

Abstract: The ability to decompose scenes in terms of abstract building blocks is crucial for general intelligence. Where those basic building blocks share meaningful properties, interactions and other regularities across scenes, such decompositions can simplify reasoning and facilitate imagination of novel scenarios. In particular, representing perceptual observations in terms of entities should improve data efficiency and transfer performance on a wide range of tasks. Thus we need models capable of discovering useful … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
125
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 86 publications
(135 citation statements)
references
References 13 publications
0
125
0
Order By: Relevance
“…There is a rich literature on this topic, for example, in [29], the IODINE model uses iterative variational inference to infer a set of latent variables recurrently, with each representing one object in an image. Similarly, MONet [11] and GENESIS [21] also adopt the multiple encode-decode steps. In contrast, [46] proposes Slot Attention, which enables single step encoding-decoding with iterative attention.…”
Section: Related Workmentioning
confidence: 99%
“…There is a rich literature on this topic, for example, in [29], the IODINE model uses iterative variational inference to infer a set of latent variables recurrently, with each representing one object in an image. Similarly, MONet [11] and GENESIS [21] also adopt the multiple encode-decode steps. In contrast, [46] proposes Slot Attention, which enables single step encoding-decoding with iterative attention.…”
Section: Related Workmentioning
confidence: 99%
“…object-centric) scene learning as that of computing the posterior p(z 1 , z 2 , ..., z K |x 1 ), where K is the number of 3D objects in the scene including the background "object", z k ∈ R D is a D-dimensional latent representation of object k, and x 1 ∈ R M is a 2D image with M pixels. As K is unknown, recent "multi-object" works [4,9] have fixed K globally to be a number that is sufficiently large (greater than actual number of objects) to capture all the objects in a scene and allowing for empty slots. Thus, we will use K to represent the number of object slots hereafter.…”
Section: Multi-object Representationsmentioning
confidence: 99%
“…We model an image x t with a spatial Gaussian mixture model [24,10], similar to MONet [4] and IODINE [9], and additionally we take as input (condition on) the viewpoint v t . We can then write the generative likelihood as:…”
Section: Generative Modelmentioning
confidence: 99%
See 2 more Smart Citations