My View is the Best View: Procedure Learning from Egocentric Videos

Bansal, Siddhant; Arora, Chetan; Jawahar, C. V.

doi:10.1007/978-3-031-19778-9_38

Cited by 19 publications

(8 citation statements)

References 86 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It has 11 actions and on average 33 segments per video. EgoProceL [3] is an egocentric dataset featuring diverse tasks, such as repairing cars, assembling toys and cooking. It has 1055 videos, 130 actions and on average 21 segments per video.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Weakly-Supervised Action Segmentation and Alignment via Transcript-Aware Union-of-Subspaces Learning

Elhamifar

2021

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

We address the task of supervised action segmentation which aims to partition a video into non-overlapping segments, each representing a different action. Recent works apply transformers to perform temporal modeling at the frame-level, which suffer from high computational cost and cannot well capture action dependencies over long temporal horizons. To address these issues, we propose an efficient BI-level Temporal modeling (BIT) framework that learns explicit action tokens to represent action segments, in parallel performs temporal modeling on frame and action levels, while maintaining a low computational cost. Our model contains (i) a frame branch that uses convolution to learn frame-level relationships, (ii) an action branch that uses transformer to learn action-level dependencies with a small set of action tokens and (iii) cross-attentions to allow communication between the two branches. We apply and extend a set-prediction objective to allow each action token to represent one or multiple action segments, thus can avoid learning a large number of tokens over long videos with many segments. Thanks to the design of our action branch, we can also seamlessly leverage textual transcripts of videos (when available) to help action segmentation by using them to initialize the action tokens. We evaluate our model on four video datasets (two egocentric and two third-person) for action segmentation with and without transcripts, showing that BIT significantly improves the state-of-the-art accuracy with much lower computational cost (30 times faster) compared to existing transformer-based methods.

show abstract

Section: Methodsmentioning

confidence: 99%

“…The result of UVAST is not reported on EPIC-Kitchen as we found its sequence decoder has difficulty learning the large number of segments in the videos, thus cannot converge well. We include more implementation details in the supplementary materials 3 .…”

Section: Methodsmentioning

confidence: 99%

Weakly-Supervised Action Segmentation and Alignment via Transcript-Aware Union-of-Subspaces Learning

Elhamifar

2021

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

show abstract

“…Another approach to egocentric action recognition is to consider it as a procedural problem and learn the key steps required to perform a task upon observing multiple egocentric videos as done in Bansal et al (2022). This work is restricted to procedural tasks but is a venue for exploration as opposed to recognising isolated actions.…”

Section: Action Recognitionmentioning

confidence: 99%

An Outlook into the Future of Egocentric Vision

Plizzari,

Goletto,

Furnari

et al. 2024

Int J Comput Vis

Self Cite

View full text Add to dashboard Cite

What will the future be? We wonder! In this survey, we explore the gap between current research in egocentric vision and the ever-anticipated future, where wearable computing, with outward facing cameras and digital overlays, is expected to be integrated in our every day lives. To understand this gap, the article starts by envisaging the future through character-based stories, showcasing through examples the limitations of current technology. We then provide a mapping between this future and previously defined research tasks. For each task, we survey its seminal works, current state-of-the-art methodologies and available datasets, then reflect on shortcomings that limit its applicability to future research. Note that this survey focuses on software models for egocentric vision, independent of any specific hardware. The paper concludes with recommendations for areas of immediate explorations so as to unlock our path to the future always-on, personalised and life-enhancing egocentric vision.

show abstract

“…Procedure Learning from Instructional Videos. Recent works have attempted to learn procedures from instructional videos [2,5,13,19,27]. Most notably, [5] generates a sequence of actions given a start and a goal image.…”

Section: Related Workmentioning

confidence: 99%

“…Most notably, [5] generates a sequence of actions given a start and a goal image. [2] finds temporal correspondences between key steps across multiple videos while [19] distinguishes pairs of videos performing the same sequence of actions from negative ones. [13] uses distant supervision from WikiHow to localize steps in instructional videos.…”

Section: Related Workmentioning

confidence: 99%

Learning and Verification of Task Structure in Instructional Videos

Narasimhan¹,

Yu²,

Bell³

et al. 2023

Preprint

View full text Add to dashboard Cite

Given the enormous number of instructional videos available online, learning a diverse array of multi-step task models from videos is an appealing goal. We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos. We pre-train VideoTaskformer using a simple and effective objective: predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video (masked step modeling). Compared to prior work which learns step representations locally, our approach involves learning them globally, leveraging video of the entire surrounding task as context. From these learned representations, we can verify if an unseen video correctly executes a given task, as well as forecast which steps are likely to be taken after a given step. We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order. We also introduce a long-term forecasting benchmark, where the goal is to predict long-range future steps from a given step. Our method outperforms previous baselines on these tasks, and we believe the tasks will be a valuable way for the community to measure the quality of step representations. Additionally, we evaluate Video-Taskformer on 3 existing benchmarks-procedural activity recognition, step classification, and step forecasting-and demonstrate on each that our method outperforms existing baselines and achieves new state-of-the-art performance.

show abstract

My View is the Best View: Procedure Learning from Egocentric Videos

Cited by 19 publications

References 86 publications

Weakly-Supervised Action Segmentation and Alignment via Transcript-Aware Union-of-Subspaces Learning

Weakly-Supervised Action Segmentation and Alignment via Transcript-Aware Union-of-Subspaces Learning

An Outlook into the Future of Egocentric Vision

Learning and Verification of Task Structure in Instructional Videos

Contact Info

Product

Resources

About