2023
DOI: 10.1109/tpami.2021.3058649
|View full text |Cite
|
Sign up to set email alerts
|

Learning to Recognize Actions on Objects in Egocentric Video With Attention Dictionaries

Abstract: We present EgoACO, a deep neural architecture for video action recognition that learns to pool action-context-object descriptors from frame level features by leveraging the verb-noun structure of action labels in egocentric video datasets. The core component of EgoACO is class activation pooling (CAP), a differentiable pooling operation that combines ideas from bilinear pooling for fine-grained recognition and from feature learning for discriminative localization. CAP uses self-attention with a dictionary of l… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
3
1

Relationship

2
7

Authors

Journals

citations
Cited by 14 publications
(7 citation statements)
references
References 70 publications
0
7
0
Order By: Relevance
“…In EK55 each action is described by the composition of a noun and a verb. To take into account such label structure both in prediction and for learning, we follow the "multihead prediction" three branches design of [57].…”
Section: A Details On Multi-label Predictionmentioning
confidence: 99%
“…In EK55 each action is described by the composition of a noun and a verb. To take into account such label structure both in prediction and for learning, we follow the "multihead prediction" three branches design of [57].…”
Section: A Details On Multi-label Predictionmentioning
confidence: 99%
“…By means of the seminal works [24,25], deep neural networks are used for extracting features with high representational capacity. Sudhakaran et al [26] presented EgoACO for video action recognition that learns to pool action-context-object descriptors from frame level features by leveraging the verb-noun structure of action labels. By the aid of the multiscale feature maps output by a fully convolutional network, features of each individual within consecutive frames are fused together by a recurrent network [17].…”
Section: Descriptor Learning Without Interactionmentioning
confidence: 99%
“…Also, a framework using Long Short-Term Memory (LSTM) is presented where no scene labelling is necessary but as we can see in figure 2.9, objects are filtered depending on the scene. Finally, before introducing the object detection methods, it is worth mentioning EgoACO [29] where authors present a deep learning architecture to recognize actions on objects in egocentric videos. From class activation pooling using attention dictionaries the most relevant feature regions are used to decode object and scene descriptors.…”
Section: Object and Hand Recognition Approachesmentioning
confidence: 99%