Predicting the Future: A Jointly Learnt Model for Action Anticipation

Gammulle, Harshala; Denman, Simon; Sridharan, Sridha; Fookes, Clinton

doi:10.1109/iccv.2019.00566

Cited by 73 publications

(44 citation statements)

References 54 publications

(99 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The average and the median of the gap are 21 seconds and 14 seconds, respectively. Thus, the forecasting gaps in this benchmarks are substantially longer than those used in other action anticipation tasks [19,23,45]. This makes this benchmark particularly challenging as the model is asked to predict the step of segments far away in the future compared to the observed history.…”

Section: F Further Details About Step Forecastingmentioning

confidence: 99%

Learning To Recognize Procedural Activities with Distant Supervision

Lin¹,

Petroni²,

Bertasius³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper we consider the problem of classifying fine-grained, multi-step activities (e.g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes. Accurately categorizing these activities requires not only recognizing the individual steps that compose the task but also capturing their temporal dependencies. This problem is dramatically different from traditional action classification, where models are typically optimized on videos that span only a few seconds and that are manually trimmed to contain simple atomic actions. While step annotations could enable the training of models to recognize the individual steps of procedural activities, existing large-scale datasets in this area do not include such segment labels due to the prohibitive cost of manually annotating temporal boundaries in long videos. To address this issue, we propose to automatically identify steps in instructional videos by leveraging the distant supervision of a textual knowledge base (wikiHow) that includes detailed descriptions of the steps needed for the execution of a wide variety of complex activities. Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base. We demonstrate that video models trained to recognize these automatically-labeled steps (without manual supervision) yield a representation that achieves superior generalization performance on four downstream tasks: recognition of procedural activities, step classification, step forecasting and egocentric video classification.

show abstract

Section: F Further Details About Step Forecastingmentioning

confidence: 99%

Learning To Recognize Procedural Activities with Distant Supervision

Lin¹,

Petroni²,

Bertasius³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Video prediction is an emerging research field of computer vision [49]- [51]. It has been successfully applied in various applications such as action anticipation [52], prediction of object locations [53], trajectory prediction [54], anomaly detection [55] and many more. Given a sequence of previous frames, the target of video prediction is to reason and predict about the subsequent frame(s) based on the analysis of rich spatio-temporal features in a video, e.g., object/background information or regularity of pixel changes [51].…”

Section: Video Frame Predictionmentioning

confidence: 99%

Two-stage Rule-induction Visual Reasoning on RPMs with an Application to Video Prediction

Wang¹,

Ren²,

Bai³

et al. 2021

Preprint

View full text Add to dashboard Cite

“…Misra et al [40] introduce the idea of learning such visual representations by estimating the order of shuffled video frames. Inspired by the success of this approach, several recent papers focused on designing a novel pretext task using temporal information, such as predicting future frames [13,49,54] or their embeddings [21,27]; estimating the order of frames [10,20,36,40,57] or the direction of video [56]. Another line of research focuses on using temporal coherence [6,24,26,41,62,63] as supervision signal.…”

Section: Related Workmentioning

confidence: 99%

Learning to Align Sequential Actions in the Wild

Liu¹,

Tekin²,

Coskun³

et al. 2021

Preprint

View full text Add to dashboard Cite

State-of-the-art methods for self-supervised sequential action alignment rely on deep networks that find correspondences across videos in time. They either learn frame-toframe mapping across sequences, which does not leverage temporal information, or assume monotonic alignment between each video pair, which ignores variations in the order of actions. As such, these methods are not able to deal with common real-world scenarios that involve background frames or videos that contain non-monotonic sequence of actions.In this paper, we propose an approach to align sequential actions in the wild that involve diverse temporal variations. To this end, we propose an approach to enforce temporal priors on the optimal transport matrix, which leverages temporal consistency, while allowing for variations in the order of actions. Our model accounts for both monotonic and non-monotonic sequences and handles background frames that should not be aligned. We demonstrate that our approach consistently outperforms the stateof-the-art in self-supervised sequential action representation learning on four different benchmark datasets.

show abstract

Predicting the Future: A Jointly Learnt Model for Action Anticipation

Cited by 73 publications

References 54 publications

Learning To Recognize Procedural Activities with Distant Supervision

Learning To Recognize Procedural Activities with Distant Supervision

Two-stage Rule-induction Visual Reasoning on RPMs with an Application to Video Prediction

Learning to Align Sequential Actions in the Wild

Contact Info

Product

Resources

About