2021
DOI: 10.48550/arxiv.2110.05697
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Hierarchical Modeling for Task Recognition and Action Segmentation in Weakly-Labeled Instructional Videos

Abstract: This paper 1 focuses on task recognition and action segmentation in weakly-labeled instructional videos, where only the ordered sequence of video-level actions is available during training. We propose a two-stream framework, which exploits semantic and temporal hierarchies to recognize top-level tasks in instructional videos. Further, we present a novel top-down weakly-supervised action segmentation approach, where the predicted task is used to constrain the inference of fine-grained action sequences. Experime… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 46 publications
(121 reference statements)
0
1
0
Order By: Relevance
“…Instructional videos are generally accompanied with explanations such as audio or narrations matching the timestamps of sequential actions, which has attracted the research interest in the video understanding community. For instance, step localization [59,86,106] as well as action segmentation [27,29,62,86] in instructional videos have been widely studied in the early stage. With the growing attention paid to this research topic, various kinds of tasks related to instructional videos have been proposed, e.g., video captioning [37,57,87,106] which generates the description of a video based on the actions and events, visual grounding [35,77] which locates the target in an image according to the language description, and procedure learning [2,22,27,72,73,106] which extracts key-steps in videos.…”
Section: Related Workmentioning
confidence: 99%
“…Instructional videos are generally accompanied with explanations such as audio or narrations matching the timestamps of sequential actions, which has attracted the research interest in the video understanding community. For instance, step localization [59,86,106] as well as action segmentation [27,29,62,86] in instructional videos have been widely studied in the early stage. With the growing attention paid to this research topic, various kinds of tasks related to instructional videos have been proposed, e.g., video captioning [37,57,87,106] which generates the description of a video based on the actions and events, visual grounding [35,77] which locates the target in an image according to the language description, and procedure learning [2,22,27,72,73,106] which extracts key-steps in videos.…”
Section: Related Workmentioning
confidence: 99%