2019
DOI: 10.48550/arxiv.1901.03728
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Anticipation and next action forecasting in video: an end-to-end model with memory

Abstract: Action anticipation and forecasting in videos do not require a hat-trick, as far as there are signs in the context to foresee how actions are going to be deployed. Capturing these signs is hard because the context includes the past. We propose an end-to-end network for action anticipation and forecasting with memory, to both anticipate the current action and foresee the next one. Experiments on action sequence datasets show excellent results indicating that training on histories with a dynamic memory can signi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
14
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(14 citation statements)
references
References 38 publications
0
14
0
Order By: Relevance
“…Recent work to recognize or anticipate actions in egocentric video adopts state-of-the-art video models from thirdperson video, like two-stream networks [42,47], 3DConv models [6,54,49], or recurrent networks [15,16,62,66]. In contrast, our model grounds first-person activity in a persistent topological encoding of the environment.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Recent work to recognize or anticipate actions in egocentric video adopts state-of-the-art video models from thirdperson video, like two-stream networks [42,47], 3DConv models [6,54,49], or recurrent networks [15,16,62,66]. In contrast, our model grounds first-person activity in a persistent topological encoding of the environment.…”
Section: Related Workmentioning
confidence: 99%
“…Graph-based methods encode relationships between detected objects: nodes are objects or actors, and edges specify their spatio-temporal layout or semantic relationships (e.g., is-holding) [68,4,46,72]. Architectures for composite activity aggregate action primitives across the video [17,30,31], memory-based models record a recurrent network's state [54], and 3D convnets augmented with longterm feature banks provide temporal context [69]. Unlike any of the above, our approach encodes video in a humancentric manner according to how people use a space.…”
Section: Structured Video Representationsmentioning
confidence: 99%
See 1 more Smart Citation
“…The task in next action anticipation is to predict the upcoming action τ before it occurs. Various architectures ranging from recurrent neural networks (RNNs) [27,28,29,30], convolutional networks combined with RNNs [2], to transformers [31] are proposed. The main focus of these works is to extract relevant information from the observations to predict the label of the action starting in τ seconds, varying between zero [32] to 10s of seconds [33].…”
Section: Related Workmentioning
confidence: 99%
“…In the last few years, due to its importance to perform an effective interaction, action anticipation has been addressed by many researchers [19,20,21,22,23,24].…”
Section: Related Workmentioning
confidence: 99%