First-Person Action Decomposition and Zero-Shot Learning

Zhang, Yun C.; Li, Yin; Rehg, James M.

doi:10.1109/wacv.2017.21

Cited by 7 publications

(8 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Zero-shot learning models (Zhang, Li, and Rehg 2017;Jain et al 2015;Liu, Kuipers, and Savarese 2011) do not require as much supervision and learn semantic correspondences that extend beyond training classes to unseen test classes. The common approaches are to either use an attribute space or embedding space that captures the semantics of a scene and helps extend beyond the training label by exploiting the semantic correspondences across classes.…”

Section: Related Workmentioning

confidence: 99%

“…GTEA Gaze contains 10 different verbs and 38 different nouns, while GTEA Gaze+ contains 15 verbs and 27 nouns. We report results averaged over all subjects for a fair comparison with prior works (Ma, Fan, and Kitani 2016;Zhang, Li, and Rehg 2017), which use leave-one-out cross-validation. We also test our approach's generalization capability to scenes beyond egocentric videos for object detection with zero supervision.…”

Section: Experimental Evaluation Datamentioning

confidence: 99%

“…We use a two-stream CNN (Ma, Fan, and Kitani 2016) as our fully supervised baseline, given its tremendous success in action recognition. We also establish a zeroshot baseline based (Zhang, Li, and Rehg 2017) for comparison on the object recognition and activity recognition tasks.…”

Section: Metrics and Baselinesmentioning

confidence: 99%

“…Advances in deep learning have enabled the development of models that have exhibited a remarkable tendency to recognize (Fathi, Li, and Rehg 2012;Liu, Kuipers, and Savarese 2011;Singh, Arora, and Jawahar 2016;Sudhakaran, Escalera, and Lanz 2019;Zhang, Li, and Rehg 2017) and localize actions (Aakur and Sarkar 2020;Jain et al 2015) in videos. However, they tend to experience errors when faced with scenes or examples beyond their initial training environment.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Knowledge guided learning: Open world egocentric action recognition with zero supervision

Aakur

Kundu

Gunti

2022

Pattern Recognition Letters

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

Section: Experimental Evaluation Datamentioning

confidence: 99%

Section: Metrics and Baselinesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Knowledge guided learning: Open world egocentric action recognition with zero supervision

Aakur

Kundu

Gunti

2022

Pattern Recognition Letters

View full text Add to dashboard Cite

“…Also the recipes are not treated as a strictly ordered set since recipe steps can be done out of order. First Person Vision: Our system is created for first person (FP) videos which have become more prevalent in the computer vision community in recent years [18,16,22,13,33]. We utilize the egocentric cues proposed by [16] in our method for action proposal generation.…”

Section: Related Workmentioning

confidence: 99%

Learning to Localize and Align Fine-Grained Actions to Sparse Instructions

Hahn,

Ruiz,

Alayrac

et al. 2018

Preprint

Self Cite

View full text Add to dashboard Cite

Automatically generating textual video descriptions that are time-aligned with the video content is a long-standing goal in computer vision. The task is challenging due to the difficulty of bridging the semantic gap between the visual and natural language domains. This paper addresses the task of automatically generating an alignment between a set of instructions and a first person video demonstrating an activity. The sparse descriptions and ambiguity of written instructions create significant alignment challenges. The key to our approach is the use of egocentric cues to generate a concise set of action proposals, which are then matched to recipe steps using object recognition and computational linguistic techniques. We obtain promising results on both the Extended GTEA Gaze+ dataset and the Bristol Egocentric Object Interactions Dataset.

show abstract