Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image

Ren, Xuanchi; Wang, Xiaolong

doi:10.48550/arxiv.2203.09457

Cited by 2 publications

(6 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Nevertheless, it achieves clear advantage at longer horizon, demonstrating superior longterm modeling ability. Although VQFormer is also able to generate sharp images, it fails to predicts correct dynamics and object attributes, as also observed in previous works (Yan et al, 2021;Ren & Wang, 2022). This shows that only a strong decoder (i.e.…”

Section: Evaluation On Video Predictionsupporting

confidence: 74%

“…Therefore, it still underperforms RNN-based baselines in the video Transformers for sequential modeling. Inspired by the success of autoregressive Transformers in language modeling (Radford et al, 2018;Brown et al, 2020), they were adapted to video generation tasks (Yan et al, 2021;Ren & Wang, 2022;Micheli et al, 2022;Nash et al, 2022). To handle the high dimensionality of images, these methods often adopt a two-stage training strategy by first mapping images to discrete tokens (Esser et al, 2021), and then learning a Transformer over tokens.…”

Section: Related Workmentioning

confidence: 99%

“…Then, instead of only using S 2 to generate S 3 , we feed in both S 1 and S 2 for better temporal consistency. We apply this iterative overlapping modeling technique (Ren & Wang, 2022), and set the maximum conditioning length as 6. To rank the actions during testing, we train a task success classifier on future states simulated by SlotFormer, which is detailed in Appendix C. We experiment on the within-template setting, and report the AUCCESS averaged over the official 10 folds.…”

Section: Action Planningmentioning

confidence: 99%

“…With the prevalence of Transformers in the NLP field (Vaswani et al, 2017;Kenton & Toutanova, 2019), there have been tremendous efforts in introducing it to computer vision tasks Carion et al, 2020;Liu et al, 2021). Our method is highly motivated by previous works in Transformer-based autoregressive image and video generation (Esser et al, 2021;Chen et al, 2020a;Yan et al, 2021;Nash et al, 2022;Ren & Wang, 2022). VQ-GAN (Esser et al, 2021) first pretrains the encoder, decoder and a codebook that can map images to discrete tokens and tokens back to images.…”

Section: A Additional Related Workmentioning

confidence: 99%

“…Transframer (Nash et al, 2022) instead discretizes video frames using Discrete Cosine Transform (DCT), and learns an autoregressive Transformer over these sparse representations from multiple frames. The design of SlotFormer is mostly related to (Ren & Wang, 2022), which also uses image tokens from multiple frames to enable consistent long-term view synthesis. Different from these works, our mapping step maps images to object-centric representations, preserving the identity of objects and is independent of the input image resolution.…”

Section: A Additional Related Workmentioning

confidence: 99%

See 4 more Smart Citations

SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models

Wu¹,

Dvornik²,

Greff³

et al. 2022

Preprint

View full text Add to dashboard Cite

Understanding dynamics from visual observations is a challenging problem that requires disentangling individual objects from the scene and learning their interactions. While recent object-centric models can successfully decompose a scene into objects, modeling their dynamics effectively still remains a challenge. We address this problem by introducing SlotFormer -a Transformer-based autoregressive model operating on learned object-centric representations. Given a video clip, our approach reasons over object features to model spatio-temporal relationships and predicts accurate future object states. In this paper, we successfully apply SlotFormer to perform video prediction on datasets with complex object interactions. Moreover, the unsupervised SlotFormer's dynamics model can be used to improve the performance on supervised downstream tasks, such as Visual Question Answering (VQA), and goal-conditioned planning. Compared to past works on dynamics modeling, our method achieves significantly better long-term synthesis of object dynamics, while retaining high quality visual generation. Besides, SlotFormer enables VQA models to reason about the future without objectlevel labels, even outperforming counterparts that use ground-truth annotations. Finally, we show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks. Additional results and details are available at our Website.

show abstract

Section: Evaluation On Video Predictionsupporting

confidence: 74%

Section: Related Workmentioning

confidence: 99%

Section: Action Planningmentioning

confidence: 99%

Section: A Additional Related Workmentioning

confidence: 99%

Section: A Additional Related Workmentioning

confidence: 99%

See 3 more Smart Citations

SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models

Wu¹,

Dvornik²,

Greff³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Advancing naturalistic affective science with deep learning

Lin¹,

Bulls²,

Tepfer³

et al. 2023

Preprint

View full text Add to dashboard Cite

People express emotions via a variety of behaviors, including facial muscle movements, body poses and gestures, vocal prosody, and speech. To understand how people experience and perceive emotion, it is crucial to quantify and model these behaviors. However, existing methods are insufficient to address this need. Manually annotating behavior is very time-consuming, making it infeasible to do at scale. Moreover, common linear models cannot fully capture the complex, nonlinear, and interactive affective processes embodied by these behaviors. In this methodology review, we describe how deep learning addresses these challenges and thereby promises to advance naturalistic affective science. First, deep learning provides accessible and efficient tools to annotate dynamic, complex, multi-modal behaviors. These automated annotation tools can scale up behavioral quantification to a degree impossible with human coders, enabling many new, more naturalistic approaches to affective science. Second, deep learning offers innovative paradigms for optimizing and manipulating naturalistic stimuli. This application makes it possible to generate experiment designs with greater generalizability, statistical power, and external validity. Third, deep learning can support flexible, powerful cognitive models of naturalistic affective processing. These novel cognitive models make it possible to explain how the mind and brain engage in the emotional world in ways that are both broader and more precise. However, deep learning is not without its limitations, so we also explore important failure cases, practical issues, and ethical concerns. By detailing the promise and the peril of deep learning, this review paves the way for a more naturalistic and generalizable affective science.

show abstract

Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image

Cited by 2 publications

References 35 publications

SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models

SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models

Advancing naturalistic affective science with deep learning

Contact Info

Product

Resources

About