2022 IEEE International Conference on Image Processing (ICIP) 2022
DOI: 10.1109/icip46576.2022.9897982
|View full text |Cite
|
Sign up to set email alerts
|

HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator

Abstract: Video prediction is an important yet challenging problem; burdened with the tasks of generating future frames and learning environment dynamics. Recently, autoregressive latent video models have proved to be a powerful video prediction tool, by separating the video prediction into two sub-problems: pre-training an image generator model, followed by learning an autoregressive prediction model in the latent space of the image generator. However, successfully generating high-fidelity and high-resolution videos ha… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
19
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(19 citation statements)
references
References 8 publications
0
19
0
Order By: Relevance
“…These setups include (i) a multi-view control where agents operate on multi-view data, (ii) a single-view control where agents operate on single-view data but use auxiliary cameras for representation learning, and (iii) a viewpoint-robust control where agents operate on single randomized viewpoint but use multiple randomized viewpoints for representation learning, as illustrated in Figure 1. Our experiments on RLBench [10] show that our method outperforms single-view representation learning baselines [11,12,13] and a multi-view representation learning baseline [6]. • We show MV-MWM can solve real-world robotic manipulation tasks by transferring a policy trained in simulation to a real-robot without camera calibration.…”
Section: Introductionmentioning
confidence: 76%
See 4 more Smart Citations
“…These setups include (i) a multi-view control where agents operate on multi-view data, (ii) a single-view control where agents operate on single-view data but use auxiliary cameras for representation learning, and (iii) a viewpoint-robust control where agents operate on single randomized viewpoint but use multiple randomized viewpoints for representation learning, as illustrated in Figure 1. Our experiments on RLBench [10] show that our method outperforms single-view representation learning baselines [11,12,13] and a multi-view representation learning baseline [6]. • We show MV-MWM can solve real-world robotic manipulation tasks by transferring a policy trained in simulation to a real-robot without camera calibration.…”
Section: Introductionmentioning
confidence: 76%
“…We present Multi-View Masked World Models (MV-MWM), a reinforcement learning framework that learns multi-view representations and utilizes them for visual robotic manipulation. Our method builds on top of the Masked World Models (MWM; [13]) framework, which learns a world model on frozen masked autoencoder features. We first introduce how to learn multi-view representations in Section 3.1.…”
Section: Multi-view Masked World Models For Visual Robotic Manipulationmentioning
confidence: 99%
See 3 more Smart Citations