2018
DOI: 10.1007/978-3-030-01216-8_13
|View full text |Cite
|
Sign up to set email alerts
|

Egocentric Activity Prediction via Event Modulated Attention

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
40
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 38 publications
(40 citation statements)
references
References 26 publications
0
40
0
Order By: Relevance
“…Authors of [10,33,32] use top-down attention generated from the prior information encoded in a CNN pretrained for object recognition while [15] uses gaze information for generating attention. The work of [23,26] uses attention for weighting relevant frames, thereby adding temporal attention. This is based on the idea that not all frames present in a video are equally important for understanding the action being carried out.…”
Section: Attentionmentioning
confidence: 99%
See 1 more Smart Citation
“…Authors of [10,33,32] use top-down attention generated from the prior information encoded in a CNN pretrained for object recognition while [15] uses gaze information for generating attention. The work of [23,26] uses attention for weighting relevant frames, thereby adding temporal attention. This is based on the idea that not all frames present in a video are equally important for understanding the action being carried out.…”
Section: Attentionmentioning
confidence: 99%
“…In [23] a series of temporal attention filters is learnt that weight frame level features depending on their relevance for identifying actions. [26] uses change in gaze for generating the temporal attention. [17,5] apply attention on both spatial and temporal dimensions to select relevant frames and the regions present in them.…”
Section: Attentionmentioning
confidence: 99%
“…Recurrent with attention The temporal aspect of videos is further studied with recurrent attention mechanisms [3,39,48,47,25,11,56] that act to find the most informative parts in images (spatial attention) or the most informative frames throughout videos (temporal attention). An encoderdecoder scheme is described in [3] for textual description of videos.…”
Section: Advances In First-person Activity Recognitionmentioning
confidence: 99%
“…From the current and previous step's embedding, an attention mechanism selects the features that will be decoded as the optimal textual description of the current activity. The attention mechanism in [39] focuses on the frames that carry the action specific information by learning the associations between the input gaze, the detected objects and the segmented hands. The combined focus on these regions allows the network to discard redundant frames of the input video segment that would otherwise obfuscate the prediction task.…”
Section: Advances In First-person Activity Recognitionmentioning
confidence: 99%
“…(1) The action recognition of a camera wearer [7,28,29]. This (2) The interaction recognition between a camera wearers hand and objects [32,33,34,35,36]. This research is focused on actions related to "How do I interact with what type of objects?"…”
Section: Related Workmentioning
confidence: 99%