Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image

Ren, Xuanchi; Wang, Xiaolong

doi:10.1109/cvpr52688.2022.00355

Cited by 29 publications

(17 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…By conditioning an autoregressive transformer with camera translation and rotation, [38] showed that a transformer-based model can learn the 3D relationship between images without explicit depth maps or warping as used in prior attempts for singleview NVS such as in [37,48]. To improve the consistency between frames, [36] suggests a camera-aware bias for selfattention that encodes the similarity between consecutive image frames. Our task requires a similar 3D understanding between different viewpoints as in NVS, but lacks the conditioning information provided by a source view(s) and requires consistency not only between frames but also with an HD map.…”

Section: Related Workmentioning

confidence: 99%

“…In addition to providing the model with aligned embeddings, we add a bias to our self-attention layers that provides both an intramodal (image to image) and intermodel (image to BEV) similarity constraint. This draws inspiration from [36], but instead of providing a blockwise similarity matrix that is composed of encoded poses between frames, we provide a per-token similarity based on their relative direction vectors. Our approach also encodes the relationship between image and BEV tokens.…”

Section: Camera Biasmentioning

confidence: 99%

See 1 more Smart Citation

Street-View Image Generation from a Bird's-Eye View Layout

Swerdlow¹,

Xu²,

Zhou³

2023

Preprint

View full text Add to dashboard Cite

Bird's-Eye View (BEV) Perception has received increasing attention in recent years as it provides a concise and unified spatial representation across views and benefits a diverse set of downstream driving applications. While the focus has been placed on discriminative tasks such as BEV segmentation, the dual generative task of creating streetview images from a BEV layout has rarely been explored. The ability to generate realistic street-view images that align with a given HD map and traffic layout is critical for visualizing complex traffic scenarios and developing robust perception models for autonomous driving. In this paper, we propose BEVGen, a conditional generative model that synthesizes a set of realistic and spatially consistent surrounding images that match the BEV layout of a traffic scenario. BEVGen incorporates a novel cross-view transformation and spatial attention design which learn the relationship between cameras and map views to ensure their consistency. Our model can accurately render road and lane lines, as well as generate traffic scenes under different weather conditions and times of day. The code will be made publicly available.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Camera Biasmentioning

confidence: 99%

Street-View Image Generation from a Bird's-Eye View Layout

Swerdlow¹,

Xu²,

Zhou³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…On the other hand, a number of generative view synthesis methods have been recently proposed utilizing neural volumetric representations [6, 15, 21, 56-58, 64, 71, 80]. These methods can learn to generate 3D representations from 2D supervision, and have demonstrated impressive results on generating novel objects [61], faces [6,13,21,60], or indoor environments [15,65]. However, none of these methods can generate unbounded outdoor scenes due to lack of multiview data for supervision, and due to the larger and more complex scene geometry and appearance that is difficult to model with prior representations.…”

Section: Related Workmentioning

confidence: 99%

Persistent Nature: A Generative Model of Unbounded 3D Worlds

Chai¹,

Tucker²,

Li³

et al. 2023

Preprint

View full text Add to dashboard Cite

“…World/Environment Modeling. [37,25] learn environment models and generate future frames. Their work is conceptually similar to our future frame prediction.…”

Section: Related Workmentioning

confidence: 99%

“…Various abstractions of this problem, such as the prediction of top down maps [48] or graph-based representations [17] have been studied, but none have yielded a strong universal abstraction. More recently, methods such as [36] have explored frame prediction in the RGB space conditioned on camera pose using a buffer of a few frames. These produce visually compelling predictions, but fail to capture the movement dynamics of an embodied agent in the scene due to condi-tioning their models with camera pose instead of agent actions.…”

Section: Introductionmentioning

confidence: 99%

Interactron: Embodied Adaptive Object Detection

Kotar

Mottaghi

2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

We propose Embodied Navigation Trajectory Learner (ENTL), a method for extracting long sequence representations for embodied navigation. Our approach unifies world modeling, localization and imitation learning into a single sequence prediction task. We train our model using vectorquantized predictions of future states conditioned on current states and actions. ENTL's generic architecture enables sharing of the spatio-temporal sequence encoder for multiple challenging embodied tasks. We achieve competitive performance on navigation tasks using significantly less data than strong baselines while performing auxiliary tasks such as localization and future frame prediction (a proxy for world modeling). A key property of our approach is that the model is pre-trained without any explicit reward signal, which makes the resulting model generalizable to multiple tasks and environments.

show abstract

Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image

Cited by 29 publications

References 38 publications

Street-View Image Generation from a Bird's-Eye View Layout

Street-View Image Generation from a Bird's-Eye View Layout

Persistent Nature: A Generative Model of Unbounded 3D Worlds

Interactron: Embodied Adaptive Object Detection

Contact Info

Product

Resources

About