2021

DOI: 10.48550/arxiv.2110.11586

|View full text |Cite

Preprint

|

Sign up to set email alerts

|

Wide and Narrow: Video Prediction from Context and Motion

Jae Hoon Cho¹,

Ji-Young Lee²,

Chang-Hwan Oh³

et al.

Abstract: Video prediction, forecasting the future frames from a sequence of input frames, is a challenging task since the view changes are influenced by various factors, such as the global context surrounding the scene and local motion dynamics. In this paper, we propose a new framework to integrate these complementary attributes to predict complex pixel dynamics through deep networks. To capture the local motion pattern of objects, we devise local filter memory networks that generate adaptive filter kernels by storing… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Video Prediction5

Citation Types

Supporting

0

Mentioning

7

Contrasting

0

Year Published

2023

2023

2023

2023

Publication Types

Select...

Article1

Relationship

Self Cite0

Independent1

Authors

Journals

Cited by 1 publication

(7 citation statements)

References 44 publications

Supporting

0

Mentioning

7

Contrasting

0

Order By: Relevance

“…Input Encoding Methods in video prediction typically consist of CNNs with a Unet (Ronneberger et al, 2015) type architecture (e.g., Cho et al, 2021;Ho et al, 2019;Bhattacharjee & Das, 2019;Kwon & Park, 2019;Ying et al, 2019). CNNs are an obvious choice for encoding the sequences of frames within a video due to their effectiveness on image-based data (Russakovsky et al, 2015), and the U-net architecture is an especially versatile form of CNN which can be applied to any application where the 'labels' are of the same format (e.g.…”

Section: Video Predictionmentioning

confidence: 99%

“…an image with the same height and width) as the inputs (Ronneberger et al, 2015). Since this is the case for video prediction, U-net CNNs can either be used straightforwardly for encoding purposes (e.g., Cho et al, 2021;Ho et al, 2019) or as the generator within a GAN framework (Isola et al, 2018) to perform adversarial learning (e.g., Bhattacharjee & Das, 2019;Kwon & Park, 2019;Ying et al, 2019).…”

Section: Video Predictionmentioning

confidence: 99%

“…Modeling Methods. While most of the reviewed methods use generative adversarial networks (GANs) (e.g., Bhattacharjee & Das, 2019;Kwon & Park, 2019;Ying et al, 2019;Cai et al, 2018;Liang et al, 2017), or variations of autoencoders (e.g., Cho et al, 2021;Oliu et al, 2018;Liu et al, 2018), Byeon et al (2018) use 4 layers of PMD units with 2 skip connections, and Ho et al (2019) identifies the motion vectors of only the critical pixels of the image and uses that to estimate the future motion vectors of those pixels to generate future frames. Autoencoders are a natural framework to use in video prediction because they aggregate both the global context necessary to understand which parts of the scene are moving and why, and local context which helps generate the pixel-by-pixel texture of the future frame (Ronneberger et al, 2015).…”

Section: Video Predictionmentioning

confidence: 99%

“…Autoencoders are a natural framework to use in video prediction because they aggregate both the global context necessary to understand which parts of the scene are moving and why, and local context which helps generate the pixel-by-pixel texture of the future frame (Ronneberger et al, 2015). In addition to having an autoencoder, Cho et al (2021) uses a memory network on the scene encoding which memorizes the prototypical motion patterns of different types of objects in order to increase robustness to scenes in different environments. Furthermore, both Cho et al (2021) and add skip connections to the autoencoder in order to add local context (as shown in Ronneberger et al, 2015), and Oliu et al (2018) uses a bijective gated recurrent unit (bGRU) within the autoencoder to allow shared states between the encoder and decoder.…”

Section: Video Predictionmentioning

confidence: 99%

“…In addition to having an autoencoder, Cho et al (2021) uses a memory network on the scene encoding which memorizes the prototypical motion patterns of different types of objects in order to increase robustness to scenes in different environments. Furthermore, both Cho et al (2021) and add skip connections to the autoencoder in order to add local context (as shown in Ronneberger et al, 2015), and Oliu et al (2018) uses a bijective gated recurrent unit (bGRU) within the autoencoder to allow shared states between the encoder and decoder. In Liu et al (2018), the dynamical atoms from the input encoding are used by the decoder to predict the next frame by extending their temporal horizon into the future.…”

Section: Video Predictionmentioning

confidence: 99%

See 4 more Smart Citations

Prediction of Social Dynamic Agents and Long-Tailed Learning Challenges: A Survey

Thuremella,

Kunze

2023

View full text Add to dashboard Cite

Autonomous robots that can perform common tasks like driving, surveillance, and chores have the biggest potential for impact due to frequency of usage, and the biggest potential for risk due to direct interaction with humans. These tasks take place in openended environments where humans socially interact and pursue their goals in complex and diverse ways. To operate in such environments, such systems must predict this behaviour, especially when the behavior is unexpected and potentially dangerous. Therefore, we summarize trends in various types of tasks, modeling methods, datasets, and social interaction modules aimed at predicting the future location of dynamic, socially interactive agents. Furthermore, we describe long-tailed learning techniques from classification and regression problems that can be applied to prediction problems. To our knowledge this is the first work that reviews social interaction modeling within prediction, and long-tailed learning techniques within regression and prediction.

“…Input Encoding Methods in video prediction typically consist of CNNs with a Unet (Ronneberger et al, 2015) type architecture (e.g., Cho et al, 2021;Ho et al, 2019;Bhattacharjee & Das, 2019;Kwon & Park, 2019;Ying et al, 2019). CNNs are an obvious choice for encoding the sequences of frames within a video due to their effectiveness on image-based data (Russakovsky et al, 2015), and the U-net architecture is an especially versatile form of CNN which can be applied to any application where the 'labels' are of the same format (e.g.…”

Section: Video Predictionmentioning

confidence: 99%

“…an image with the same height and width) as the inputs (Ronneberger et al, 2015). Since this is the case for video prediction, U-net CNNs can either be used straightforwardly for encoding purposes (e.g., Cho et al, 2021;Ho et al, 2019) or as the generator within a GAN framework (Isola et al, 2018) to perform adversarial learning (e.g., Bhattacharjee & Das, 2019;Kwon & Park, 2019;Ying et al, 2019).…”

Section: Video Predictionmentioning

confidence: 99%

“…Modeling Methods. While most of the reviewed methods use generative adversarial networks (GANs) (e.g., Bhattacharjee & Das, 2019;Kwon & Park, 2019;Ying et al, 2019;Cai et al, 2018;Liang et al, 2017), or variations of autoencoders (e.g., Cho et al, 2021;Oliu et al, 2018;Liu et al, 2018), Byeon et al (2018) use 4 layers of PMD units with 2 skip connections, and Ho et al (2019) identifies the motion vectors of only the critical pixels of the image and uses that to estimate the future motion vectors of those pixels to generate future frames. Autoencoders are a natural framework to use in video prediction because they aggregate both the global context necessary to understand which parts of the scene are moving and why, and local context which helps generate the pixel-by-pixel texture of the future frame (Ronneberger et al, 2015).…”

Section: Video Predictionmentioning

confidence: 99%

“…Autoencoders are a natural framework to use in video prediction because they aggregate both the global context necessary to understand which parts of the scene are moving and why, and local context which helps generate the pixel-by-pixel texture of the future frame (Ronneberger et al, 2015). In addition to having an autoencoder, Cho et al (2021) uses a memory network on the scene encoding which memorizes the prototypical motion patterns of different types of objects in order to increase robustness to scenes in different environments. Furthermore, both Cho et al (2021) and add skip connections to the autoencoder in order to add local context (as shown in Ronneberger et al, 2015), and Oliu et al (2018) uses a bijective gated recurrent unit (bGRU) within the autoencoder to allow shared states between the encoder and decoder.…”

Section: Video Predictionmentioning

confidence: 99%

“…In addition to having an autoencoder, Cho et al (2021) uses a memory network on the scene encoding which memorizes the prototypical motion patterns of different types of objects in order to increase robustness to scenes in different environments. Furthermore, both Cho et al (2021) and add skip connections to the autoencoder in order to add local context (as shown in Ronneberger et al, 2015), and Oliu et al (2018) uses a bijective gated recurrent unit (bGRU) within the autoencoder to allow shared states between the encoder and decoder. In Liu et al (2018), the dynamical atoms from the input encoding are used by the decoder to predict the next frame by extending their temporal horizon into the future.…”

Section: Video Predictionmentioning

confidence: 99%

See 3 more Smart Citations

Prediction of Social Dynamic Agents and Long-Tailed Learning Challenges: A Survey

Thuremella,

Kunze

2023

View full text Add to dashboard Cite

Autonomous robots that can perform common tasks like driving, surveillance, and chores have the biggest potential for impact due to frequency of usage, and the biggest potential for risk due to direct interaction with humans. These tasks take place in openended environments where humans socially interact and pursue their goals in complex and diverse ways. To operate in such environments, such systems must predict this behaviour, especially when the behavior is unexpected and potentially dangerous. Therefore, we summarize trends in various types of tasks, modeling methods, datasets, and social interaction modules aimed at predicting the future location of dynamic, socially interactive agents. Furthermore, we describe long-tailed learning techniques from classification and regression problems that can be applied to prediction problems. To our knowledge this is the first work that reviews social interaction modeling within prediction, and long-tailed learning techniques within regression and prediction.

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Product

Browser Extension Assistant by scite Citation Statement Search Reference Check Visualizations Dashboards Explore Journals Explore Organizations Explore Funders Embedding Badge Embedding Citation Search Pricing

Resources

Blog Help & FAQ Accessibility Statement API Terms For Universities & Governments For Researchers For Publishers For Corporate, Pharma & Enterprise Author Marketing Become an Affiliate Get an organization trial or quote scite Data & Services

About

News & Press Careers Read our Paper Coverage

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Copyright © 2024 scite LLC. All rights reserved.

Made with 💙 for researchers

Part of the Research Solutions Family.