2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.00355
|View full text |Cite
|
Sign up to set email alerts
|

Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
13
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 29 publications
(17 citation statements)
references
References 38 publications
0
13
0
Order By: Relevance
“…By conditioning an autoregressive transformer with camera translation and rotation, [38] showed that a transformer-based model can learn the 3D relationship between images without explicit depth maps or warping as used in prior attempts for singleview NVS such as in [37,48]. To improve the consistency between frames, [36] suggests a camera-aware bias for selfattention that encodes the similarity between consecutive image frames. Our task requires a similar 3D understanding between different viewpoints as in NVS, but lacks the conditioning information provided by a source view(s) and requires consistency not only between frames but also with an HD map.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…By conditioning an autoregressive transformer with camera translation and rotation, [38] showed that a transformer-based model can learn the 3D relationship between images without explicit depth maps or warping as used in prior attempts for singleview NVS such as in [37,48]. To improve the consistency between frames, [36] suggests a camera-aware bias for selfattention that encodes the similarity between consecutive image frames. Our task requires a similar 3D understanding between different viewpoints as in NVS, but lacks the conditioning information provided by a source view(s) and requires consistency not only between frames but also with an HD map.…”
Section: Related Workmentioning
confidence: 99%
“…In addition to providing the model with aligned embeddings, we add a bias to our self-attention layers that provides both an intramodal (image to image) and intermodel (image to BEV) similarity constraint. This draws inspiration from [36], but instead of providing a blockwise similarity matrix that is composed of encoded poses between frames, we provide a per-token similarity based on their relative direction vectors. Our approach also encodes the relationship between image and BEV tokens.…”
Section: Camera Biasmentioning
confidence: 99%
“…On the other hand, a number of generative view synthesis methods have been recently proposed utilizing neural volumetric representations [6, 15, 21, 56-58, 64, 71, 80]. These methods can learn to generate 3D representations from 2D supervision, and have demonstrated impressive results on generating novel objects [61], faces [6,13,21,60], or indoor environments [15,65]. However, none of these methods can generate unbounded outdoor scenes due to lack of multiview data for supervision, and due to the larger and more complex scene geometry and appearance that is difficult to model with prior representations.…”
Section: Related Workmentioning
confidence: 99%
“…World/Environment Modeling. [37,25] learn environment models and generate future frames. Their work is conceptually similar to our future frame prediction.…”
Section: Related Workmentioning
confidence: 99%
“…Various abstractions of this problem, such as the prediction of top down maps [48] or graph-based representations [17] have been studied, but none have yielded a strong universal abstraction. More recently, methods such as [36] have explored frame prediction in the RGB space conditioned on camera pose using a buffer of a few frames. These produce visually compelling predictions, but fail to capture the movement dynamics of an embodied agent in the scene due to condi-tioning their models with camera pose instead of agent actions.…”
Section: Introductionmentioning
confidence: 99%