Hierarchical Modeling for Task Recognition and Action Segmentation in Weakly-Labeled Instructional Videos

Ghoddoosian, Reza; Sayed, Saif; Athitsos, Vassilis

doi:10.48550/arxiv.2110.05697

Cited by 1 publication

(1 citation statement)

References 46 publications

(121 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Instructional videos are generally accompanied with explanations such as audio or narrations matching the timestamps of sequential actions, which has attracted the research interest in the video understanding community. For instance, step localization [59,86,106] as well as action segmentation [27,29,62,86] in instructional videos have been widely studied in the early stage. With the growing attention paid to this research topic, various kinds of tasks related to instructional videos have been proposed, e.g., video captioning [37,57,87,106] which generates the description of a video based on the actions and events, visual grounding [35,77] which locates the target in an image according to the language description, and procedure learning [2,22,27,72,73,106] which extracts key-steps in videos.…”

Section: Related Workmentioning

confidence: 99%

SVIP: Sequence VerIfication for Procedures in Videos

Qian¹,

Luo²,

Lian³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we propose a novel sequence verification task that aims to distinguish positive video pairs performing the same action sequence from negative ones with step-level transformations but still conducting the same task. Such a challenging task resides in an open-set setting without prior action detection or segmentation that requires event-level or even frame-level annotations. To that end, we carefully reorganize two publicly available action-related datasets with step-procedure-task structure. To fully investigate the effectiveness of any method, we collect a scripted video dataset enumerating all kinds of step-level transformations in chemical experiments. Besides, a novel evaluation metric Weighted Distance Ratio is introduced to ensure equivalence for different step-level transformations during evaluation. In the end, a simple but effective baseline based on the transformer with a novel sequence alignment loss is introduced to better characterize long-term dependency between steps, which outperforms other action recognition methods. Codes and data will be released.

show abstract