2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00129
|View full text |Cite
|
Sign up to set email alerts
|

A Perceptual Prediction Framework for Self Supervised Event Segmentation

Abstract: Temporal segmentation of long videos is an important problem, that has largely been tackled through supervised learning, often requiring large amounts of annotated training data. In this paper, we tackle the problem of selfsupervised temporal segmentation that alleviates the need for any supervision in the form of labels (full supervision) or temporal ordering (weak supervision). We introduce a self-supervised, predictive learning framework that draws inspiration from cognitive psychology to segment long, visu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
21
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 55 publications
(21 citation statements)
references
References 39 publications
0
21
0
Order By: Relevance
“…However, these narration or subtitles may be inaccurate [82] or even irrelevant to the video as we mention above. For the action segmentation task, Aakur et al [5] presented a self-supervised and predictive learning framework to explore the spatial-temporal dynamics of the videos, while Sener et al [58] proposed a Generalized Mallows Model (GMM) to model the distribution over sub-activity permutations. More recently, Kukleva et al [37] first learned a continuous temporal embedding of frame-based features, and then decoded the videos into coherent action segments according to an ordered clustering of these features.…”
Section: Methods For Instructional Video Analysismentioning
confidence: 99%
See 1 more Smart Citation
“…However, these narration or subtitles may be inaccurate [82] or even irrelevant to the video as we mention above. For the action segmentation task, Aakur et al [5] presented a self-supervised and predictive learning framework to explore the spatial-temporal dynamics of the videos, while Sener et al [58] proposed a Generalized Mallows Model (GMM) to model the distribution over sub-activity permutations. More recently, Kukleva et al [37] first learned a continuous temporal embedding of frame-based features, and then decoded the videos into coherent action segments according to an ordered clustering of these features.…”
Section: Methods For Instructional Video Analysismentioning
confidence: 99%
“…(2) SSN [80]. This is an effective model for action detection, which outputs the same type of results (interval and label for each action 5. We present a table to clarify the goal, metric, and evaluated methods for each task in supplementary material.…”
Section: Evaluation On Step Localizationmentioning
confidence: 99%
“…Unsupervised learning-based approaches recently received attention [15], [16], [17], [35]. One line of work targets keyframe localization in videos [19], [20].…”
Section: Related Workmentioning
confidence: 99%
“…First, video-level matching [35], [23] only matches the labeled actions with respect to the ground truth actions of that given video; this granularity of matching produces very high performance since it somehow simplifies this task in the sense that it is reluctant to associate actions in any pair of videos even within the same complex activity.…”
Section: Hungarian Matching Hierarchiesmentioning
confidence: 99%
“…Previous work mainly divides the problem of video object detection on key frames into sub-problems of key frame recognition [13]- [17] and object detection. They propose techniques such as self-supervised learning [18], semi-supervised learning [19], label propagation [20], registration [21], and temporal cycle-consistency [22]. However, the inherent predictive uncertainties in video landmark measurements are generally ignored in the existing literature.…”
Section: Introductionmentioning
confidence: 99%