Anticipation and next action forecasting in video: an end-to-end model with memory

Pirri, Fiora; Mauro, Lorenzo; Alati, Edoardo; Ntouskos, Valsamis; Izadpanahkakhk, Mahdieh; Omrani, Elham

doi:10.48550/arxiv.1901.03728

Cited by 7 publications

(14 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent work to recognize or anticipate actions in egocentric video adopts state-of-the-art video models from thirdperson video, like two-stream networks [42,47], 3DConv models [6,54,49], or recurrent networks [15,16,62,66]. In contrast, our model grounds first-person activity in a persistent topological encoding of the environment.…”

Section: Related Workmentioning

confidence: 99%

“…Graph-based methods encode relationships between detected objects: nodes are objects or actors, and edges specify their spatio-temporal layout or semantic relationships (e.g., is-holding) [68,4,46,72]. Architectures for composite activity aggregate action primitives across the video [17,30,31], memory-based models record a recurrent network's state [54], and 3D convnets augmented with longterm feature banks provide temporal context [69]. Unlike any of the above, our approach encodes video in a humancentric manner according to how people use a space.…”

Section: Structured Video Representationsmentioning

confidence: 99%

“…Recent action anticipation work [14,76,15,6,54,16,62] predicts the immediate next action (e.g. in the next 1 second) rather than all future actions, for which an encoding of recent video information is sufficient.…”

Section: Anticipating Future Actions In Long Videomentioning

confidence: 99%

See 2 more Smart Citations

Ego-Topo: Environment Affordances From Egocentric Video

Nagarajan

Feichtenhofer

et al. 2020

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

First-person video naturally brings the use of a physical environment to the forefront, since it shows the camera wearer interacting fluidly in a space based on his intentions. However, current methods largely separate the observed actions from the persistent space itself. We introduce a model for environment affordances that is learned directly from egocentric video. The main idea is to gain a human-centric model of a physical space (such as a kitchen) that captures (1) the primary spatial zones of interaction and (2) the likely activities they support. Our approach decomposes a space into a topological map derived from first-person activity, organizing an ego-video into a series of visits to the different zones. Further, we show how to link zones across multiple related environments (e.g., from videos of multiple kitchens) to obtain a consolidated representation of environment functionality. On EPIC-Kitchens and EGTEA+, we demonstrate our approach for learning scene affordances and anticipating future actions in long-form video. Project page: http://vision.cs. utexas.edu/projects/ego-topo/In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Structured Video Representationsmentioning

confidence: 99%

See 1 more Smart Citation

Ego-Topo: Environment Affordances From Egocentric Video

Nagarajan

Feichtenhofer

et al. 2020

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

show abstract

“…The task in next action anticipation is to predict the upcoming action τ before it occurs. Various architectures ranging from recurrent neural networks (RNNs) [27,28,29,30], convolutional networks combined with RNNs [2], to transformers [31] are proposed. The main focus of these works is to extract relevant information from the observations to predict the label of the action starting in τ seconds, varying between zero [32] to 10s of seconds [33].…”

Section: Related Workmentioning

confidence: 99%

Weakly-Supervised Dense Action Anticipation

Zhang

Chen

Yao

2021

Preprint

View full text Add to dashboard Cite

Dense anticipation aims to forecast future actions and their durations for long horizons. Existing approaches rely on fully-labelled data, i.e. sequences labelled with all future actions and their durations. We present a (semi-) weakly supervised method using only a small number of fullylabelled sequences and predominantly sequences in which only the (one) upcoming action is labelled. To this end, we propose a framework that generates pseudo-labels for future actions and their durations and adaptively refines them through a refinement module. Given only the upcoming action label as input, these pseudo-labels guide action/duration prediction for the future. We further design an attention mechanism to predict context-aware durations. Experiments on the Breakfast and 50Salads benchmarks verify our method's effectiveness; we are competitive even when compared to fully supervised state-of-the-art models. We will make our code available at: https://github.com/zhanghaotong1/WSLVideoDenseAnticipation.

show abstract

“…In the last few years, due to its importance to perform an effective interaction, action anticipation has been addressed by many researchers [19,20,21,22,23,24].…”

Section: Related Workmentioning

confidence: 99%

Action Anticipation for Collaborative Environments: The Impact of Contextual Information and Uncertainty-Based Prediction

Canuto¹,

Moreno²,

Samatelo³

et al. 2019

Preprint

View full text Add to dashboard Cite

For effectively interacting with humans in collaborative environments, machines need to be able to predict (i.e. anticipate) future events, in order to execute actions in a timely manner. However, the observation of the human limbs movements may not be sufficient to anticipate their actions in an unambiguous manner. In this work we consider two additional sources of information (i.e. context) over time, gaze movements and object information, and study how these additional contextual cues improve the action anticipation performance. We address action anticipation as a classification task, where the model takes the available information as the input, and predicts the most likely action. We propose to use the uncertainty about each prediction as an online decision-making criterion for action anticipation. Uncertainty is modeled as a stochastic process applied to a time-based neural network architecture, which improves the conventional class-likelihood (i.e. deterministic) criterion. The main contributions of this paper are three-fold: (i) we propose a deep architecture that outperforms previous results in the action anticipation task, when using the Acticipate collaborative dataset; (ii) we show that contextual information is important do disambiguate the interpretation of similar actions; (iii) we propose the minimization of uncertainty as a more effective criterion for action anticipation, when compared with the maximization of class probability. Our results on the Acticipate dataset showed the importance of contextual information and the uncertainty criterion for action anticipation. We achieve an average accuracy of 98.75% in the anticipation task using only an average of 25% of observations. In addition, considering that a good anticipation model should also perform well in the action recognition task, we achieve an average accuracy of 100% in action recognition on the Acticipate dataset, when the entire observation set is used.

show abstract

Anticipation and next action forecasting in video: an end-to-end model with memory

Cited by 7 publications

References 38 publications

Ego-Topo: Environment Affordances From Egocentric Video

Ego-Topo: Environment Affordances From Egocentric Video

Weakly-Supervised Dense Action Anticipation

Action Anticipation for Collaborative Environments: The Impact of Contextual Information and Uncertainty-Based Prediction

Contact Info

Product

Resources

About