2022
DOI: 10.48550/arxiv.2210.05861
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models

Abstract: Understanding dynamics from visual observations is a challenging problem that requires disentangling individual objects from the scene and learning their interactions. While recent object-centric models can successfully decompose a scene into objects, modeling their dynamics effectively still remains a challenge. We address this problem by introducing SlotFormer -a Transformer-based autoregressive model operating on learned object-centric representations. Given a video clip, our approach reasons over object fe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(6 citation statements)
references
References 24 publications
0
6
0
Order By: Relevance
“…We evaluate our object-centric prediction framework using different predictor modules, namely LSTM [29], Transformer [24], and our two proposed OCVP modules, for different prediction horizons (NumPreds). We compare our object-centric approach with two existing object-agnostic prediction models: ConvLSTM [27] and PhyDNet [28].…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…We evaluate our object-centric prediction framework using different predictor modules, namely LSTM [29], Transformer [24], and our two proposed OCVP modules, for different prediction horizons (NumPreds). We compare our object-centric approach with two existing object-agnostic prediction models: ConvLSTM [27] and PhyDNet [28].…”
Section: Discussionmentioning
confidence: 99%
“…The approaches presented in [20,21,22,23] employ structured or object-centric representations to perform video prediction, at the cost of requiring explicit human supervision, error-prone Hungarian alignment operations, or being only applicable to very simple 2D datasets. The most similar approach to ours, developed concurrently with this work, is SlotFormer [24], which also combines SAVi [5] with an autoregressive transformer to perform object-centric video prediction. In this work, we further investigate the role of the predictor and propose two novel object-centric transformers that decouple object dynamics and interactions.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…It was shown that increasing the decoder capacity is the key to dealing with complex and naturalistic scenes in this framework. Following this, several works have demonstrated the effectiveness of this transformer-based slot decoding approach in various settings [8,7,61,74,63]. This success of transformer-based image generative modeling in object-centric learning naturally raises a question: can the other pillar of modern deep generative models, the diffusion models, also be beneficial for object-centric learning?…”
Section: Introductionmentioning
confidence: 93%
“…Approaches on causal reasoning benchmarks [6,32] can be divided into two paradigms: neural networks [7,15,27,30] and neuro-symbolic models [5,6,8,32]. Neuro-symbolic models leverage various independentlylearned modules and perform better than neural network baselines.…”
Section: Video Understanding and Causal Reasoningmentioning
confidence: 99%