SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models

Wu, Ziyi; Dvornik, Nikita; Greff, Klaus; Kipf, Thomas; Garg, Animesh

doi:10.48550/arxiv.2210.05861

Cited by 3 publications

(6 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We evaluate our object-centric prediction framework using different predictor modules, namely LSTM [29], Transformer [24], and our two proposed OCVP modules, for different prediction horizons (NumPreds). We compare our object-centric approach with two existing object-agnostic prediction models: ConvLSTM [27] and PhyDNet [28].…”

Section: Discussionmentioning

confidence: 99%

“…The approaches presented in [20,21,22,23] employ structured or object-centric representations to perform video prediction, at the cost of requiring explicit human supervision, error-prone Hungarian alignment operations, or being only applicable to very simple 2D datasets. The most similar approach to ours, developed concurrently with this work, is SlotFormer [24], which also combines SAVi [5] with an autoregressive transformer to perform object-centric video prediction. In this work, we further investigate the role of the predictor and propose two novel object-centric transformers that decouple object dynamics and interactions.…”

Section: Related Workmentioning

confidence: 99%

“…Following previous works [24,4], we evaluate the visual quality of predicted video frames using video prediction metrics (PSNR, SSIM and LPIPS), and evaluate the ability to model object dynamics by measuring the ARI and mIoU between ground-truth instance segmentation and the forecasted object masks.…”

Section: Datasets and Experimental Detailsmentioning

confidence: 99%

See 2 more Smart Citations

Object-Centric Video Prediction via Decoupling of Object Dynamics and Interactions

Villar-Corrales¹,

Wahdan²,

Behnke³

2023

Preprint

View full text Add to dashboard Cite

We present a framework for object-centric video prediction, i.e., parsing a video sequence into objects, and modeling their dynamics and interactions in order to predict the future object states from which video frames are rendered. To facilitate the learning of meaningful spatio-temporal object representations and forecasting of their states, we propose two novel object-centric video prediction (OCVP) transformer modules, which decouple the processing of temporal dynamics and object interactions. We show how OCVP predictors outperform object-agnostic video prediction models on two different datasets. Furthermore, we observe that OCVP modules learn consistent and interpretable object representations. Animations and code to reproduce our results can be found in our project website 1 .

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Object-Centric Video Prediction via Decoupling of Object Dynamics and Interactions

Villar-Corrales¹,

Wahdan²,

Behnke³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…It was shown that increasing the decoder capacity is the key to dealing with complex and naturalistic scenes in this framework. Following this, several works have demonstrated the effectiveness of this transformer-based slot decoding approach in various settings [8,7,61,74,63]. This success of transformer-based image generative modeling in object-centric learning naturally raises a question: can the other pillar of modern deep generative models, the diffusion models, also be beneficial for object-centric learning?…”

Section: Introductionmentioning

confidence: 93%

Object-Centric Slot Diffusion

Jiang¹,

Deng²,

Singh³

et al. 2023

Preprint

View full text Add to dashboard Cite

Despite remarkable recent advances, making object-centric learning work for complex natural scenes remains the main challenge. The recent success of adopting the transformer-based image generative model in object-centric learning suggests that having a highly expressive image generator is crucial for dealing with complex scenes. In this paper, inspired by this observation, we aim to answer the following question: can we benefit from the other pillar of modern deep generative models, i.e., the diffusion models, for object-centric learning and what are the pros and cons of such a model? To this end, we propose a new object-centric learning model, Latent Slot Diffusion (LSD). LSD can be seen from two perspectives. From the perspective of object-centric learning, it replaces the conventional slot decoders with a latent diffusion model conditioned on the object slots. Conversely, from the perspective of diffusion models, it is the first unsupervised compositional conditional diffusion model which, unlike traditional diffusion models, does not require supervised annotation such as a text description to learn to compose. In experiments on various object-centric tasks, including the FFHQ dataset for the first time in this line of research, we demonstrate that LSD significantly outperforms the state-ofthe-art transformer-based decoder, particularly when the scene is more complex. We also show a superior quality in unsupervised compositional generation.

show abstract

“…Approaches on causal reasoning benchmarks [6,32] can be divided into two paradigms: neural networks [7,15,27,30] and neuro-symbolic models [5,6,8,32]. Neuro-symbolic models leverage various independentlylearned modules and perform better than neural network baselines.…”

Section: Video Understanding and Causal Reasoningmentioning

confidence: 99%

Intrinsic Physical Concepts Discovery with Object-Centric Predictive Models

Qu¹,

Zhu²,

Lei³

et al. 2023

Preprint

View full text Add to dashboard Cite

The ability to discover abstract physical concepts and understand how they work in the world through observing lies at the core of human intelligence. The acquisition of this ability is based on compositionally perceiving the environment in terms of objects and relations in an unsupervised manner. Recent approaches learn object-centric representations and capture visually observable concepts of objects, e.g., shape, size, and location. In this paper, we take a step forward and try to discover and represent intrinsic physical concepts such as mass and charge. We introduce the PHYsical Concepts Inference NEtwork (PHYCINE), a system that infers physical concepts in different abstract levels without supervision. The key insights underlining PHYCINE are two-fold, commonsense knowledge emerges with prediction, and physical concepts of different abstract levels should be reasoned in a bottom-up fashion. Empirical evaluation demonstrates that variables inferred by our system work in accordance with the properties of the corresponding physical concepts. We also show that object representations containing the discovered physical concepts variables could help achieve better performance in causal reasoning tasks, i.e., ComPhy.

show abstract

SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models

Cited by 3 publications

References 24 publications

Object-Centric Video Prediction via Decoupling of Object Dynamics and Interactions

Object-Centric Video Prediction via Decoupling of Object Dynamics and Interactions

Object-Centric Slot Diffusion

Intrinsic Physical Concepts Discovery with Object-Centric Predictive Models

Contact Info

Product

Resources

About