Egocentric Activity Prediction via Event Modulated Attention

Shen, Yang; Ni, Bingbing; Li, Zefan; Zhuang, Ning

doi:10.1007/978-3-030-01216-8_13

Cited by 38 publications

(40 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Authors of [10,33,32] use top-down attention generated from the prior information encoded in a CNN pretrained for object recognition while [15] uses gaze information for generating attention. The work of [23,26] uses attention for weighting relevant frames, thereby adding temporal attention. This is based on the idea that not all frames present in a video are equally important for understanding the action being carried out.…”

Section: Attentionmentioning

confidence: 99%

See 1 more Smart Citation

LSTA: Long Short-Term Attention for Egocentric Action Recognition

Sudhakaran

Escalera

Lanz

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

142

124

View full text Add to dashboard Cite

Egocentric activity recognition is one of the most challenging tasks in video analysis. It requires a fine-grained discrimination of small objects and their manipulation. While some methods base on strong supervision and attention mechanisms, they are either annotation consuming or do not take spatio-temporal patterns into account. In this paper we propose LSTA as a mechanism to focus on features from relevant spatial parts while attention is being tracked smoothly across the video sequence. We demonstrate the effectiveness of LSTA on egocentric activity recognition with an end-to-end trainable two-stream architecture, achieving state-of-the-art performance on four standard benchmarks.

show abstract

Section: Attentionmentioning

confidence: 99%

“…In [23] a series of temporal attention filters is learnt that weight frame level features depending on their relevance for identifying actions. [26] uses change in gaze for generating the temporal attention. [17,5] apply attention on both spatial and temporal dimensions to select relevant frames and the regions present in them.…”

Section: Attentionmentioning

confidence: 99%

LSTA: Long Short-Term Attention for Egocentric Action Recognition

Sudhakaran

Escalera

Lanz

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

142

124

View full text Add to dashboard Cite

show abstract

“…Recurrent with attention The temporal aspect of videos is further studied with recurrent attention mechanisms [3,39,48,47,25,11,56] that act to find the most informative parts in images (spatial attention) or the most informative frames throughout videos (temporal attention). An encoderdecoder scheme is described in [3] for textual description of videos.…”

Section: Advances In First-person Activity Recognitionmentioning

confidence: 99%

“…From the current and previous step's embedding, an attention mechanism selects the features that will be decoded as the optimal textual description of the current activity. The attention mechanism in [39] focuses on the frames that carry the action specific information by learning the associations between the input gaze, the detected objects and the segmented hands. The combined focus on these regions allows the network to discard redundant frames of the input video segment that would otherwise obfuscate the prediction task.…”

Section: Advances In First-person Activity Recognitionmentioning

confidence: 99%

Multitask Learning to Improve Egocentric Action Recognition

Kapidis

Poppe

Dam

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

In this work we employ multitask learning to capitalize on the structure that exists in related supervised tasks to train complex neural networks. It allows training a network for multiple objectives in parallel, in order to improve performance on at least one of them by capitalizing on a shared representation that is developed to accommodate more information than it otherwise would for a single task. We employ this idea to tackle action recognition in egocentric videos by introducing additional supervised tasks. We consider learning the verbs and nouns from which action labels consist of and predict coordinates that capture the hand locations and the gaze-based visual saliency for all the frames of the input video segments. This forces the network to explicitly focus on cues from secondary tasks that it might otherwise have missed resulting in improved inference. Our experiments on EPIC-Kitchens and EGTEA Gaze+ show consistent improvements when training with multiple tasks over the single-task baseline. Furthermore, in EGTEA Gaze+ we outperform the state-of-the-art in action recognition by 3.84%. Apart from actions, our method produces accurate hand and gaze estimations as side tasks, without requiring any additional input at test time other than the RGB video clips.

show abstract

“…(1) The action recognition of a camera wearer [7,28,29]. This (2) The interaction recognition between a camera wearers hand and objects [32,33,34,35,36]. This research is focused on actions related to "How do I interact with what type of objects?"…”

Section: Related Workmentioning

confidence: 99%

Three-stream fusion network for first-person interaction recognition

Kim

Lee

2020

Pattern Recognition

View full text Add to dashboard Cite

First-person interaction recognition is a challenging task because of unstable video conditions resulting from the camera wearers movement. For human interaction recognition from a first-person viewpoint, this paper proposes a three-stream fusion network with two main parts: three-stream architecture and three-stream correlation fusion. The three-stream architecture captures the characteristics of the target appearance, target motion, and camera egomotion. Meanwhile the three-stream correlation fusion combines the feature map of each of the three streams to consider the correlations among the target appearance, target motion, and camera ego-motion. The fused feature vector is robust to the camera movement and compensates for the noise of the camera ego-motion. Short-term intervals are modeled using the fused feature vector, and a long short-term memory(LSTM) model considers the temporal dynamics of the video. We evaluated the proposed method on two public benchmark datasets to validate the effectiveness of our approach. The experimental results show that the proposed fusion method successfully generated a discriminative feature vector, and our network outperformed all competing activity recognition methods in first-person videos where considerable camera ego-motion occurs.

show abstract

Egocentric Activity Prediction via Event Modulated Attention

Cited by 38 publications

References 26 publications

LSTA: Long Short-Term Attention for Egocentric Action Recognition

LSTA: Long Short-Term Attention for Egocentric Action Recognition

Multitask Learning to Improve Egocentric Action Recognition

Three-stream fusion network for first-person interaction recognition

Contact Info

Product

Resources

About