SF-Net: Single-Frame Supervision for Temporal Action Localization

Ma, Fan; Zhu, Linchao; Yang, Yi; Zha, Shengxin; Kundu, Gourab; Feiszli, Matt; Shou, Zheng

doi:10.1007/978-3-030-58548-8_25

Cited by 89 publications

(91 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Besides, our iterative approach takes around 4.6 minutes to train even on CPU. Our method can also be used to improve other single frame methods [10]. Compared to fully supervised methods, our method gives good performance while utilizing significantly less annotation effort.…”

Section: Methodsmentioning

confidence: 97%

“…to optimize the segment length and recognize human actions with fewer frames [8,9]. Using a single timestamp instead of start and end time for action recognition has been shown to be a reasonable compromise between performance and annotation effort [10]. In this paper, we question the need for more complex methods, and evaluate an extremely simple idea: We propose labeling a single action frame as "key frame" inside an action's temporal window (Figure 1) and evaluate the simplest approach we could find: Positive Unlabeled (PU) learning to detect action frames.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

PUNet: Temporal Action Proposal Generation With Positive Unlabeled Learning Using Key Frame Annotations

Zia

Kayhan

Gemert

2021

2021 IEEE International Conference on Image Processing (ICIP)

View full text Add to dashboard Cite

Popular approaches to classifying action segments in long, realistic, untrimmed videos start with high quality action proposals. Current action proposal methods based on deep learning are trained on labeled video segments. Obtaining annotated segments for untrimmed videos is time consuming, expensive and error-prone as annotated temporal action boundaries are imprecise, subjective and inconsistent. By embracing this uncertainty we explore to significantly speed up temporal annotations by using just a single key frame label for each action instance instead of the inherently imprecise start and end frames. To tackle the class imbalance by using only a single frame, we evaluate an extremely simple Positive-Unlabeled algorithm (PU-learning). We demonstrate on THUMOS'14 and ActivityNet that using a single key frame label give good results while being significantly faster to annotate. In addition, we show that our simple method, PUNet 1 , is data-efficient which further reduces the need for expensive annotations.

show abstract

Section: Methodsmentioning

confidence: 97%

Section: Introductionmentioning

confidence: 99%

PUNet: Temporal Action Proposal Generation With Positive Unlabeled Learning Using Key Frame Annotations

Zia

Kayhan

Gemert

2021

2021 IEEE International Conference on Image Processing (ICIP)

View full text Add to dashboard Cite

show abstract

“…TSS require one frame label for each action 1 While the percentage of overall labeled frames is very little (0.03%), the annotation effort should not be underestimated. Annotators must still watch all the videos and labelling timestamp frames gives only a 6X speedup compared to densely labelling all frames (Ma et al 2020).…”

Section: Introductionmentioning

confidence: 99%

Iterative Contrast-Classify For Semi-supervised Temporal Action Segmentation

Singhania¹,

Rahaman²,

Yao³

2021

Preprint

View full text Add to dashboard Cite

Temporal action segmentation classifies the action of each frame in (long) video sequences. Due to the high cost of framewise labeling, we propose the first semi-supervised method for temporal action segmentation. Our method hinges on unsupervised representation learning, which, for temporal action segmentation, poses unique challenges. Actions in untrimmed videos vary in length and have unknown labels and start/end times. Ordering of actions across videos may also vary. We propose a novel way to learn frame-wise representations from temporal convolutional networks (TCNs) by clustering input features with added time-proximity condition and multiresolution similarity. By merging representation learning with conventional supervised learning, we develop an "Iterative-Contrast-Classify (ICC)" semi-supervised learning scheme. With more labelled data, ICC progressively improves in performance; ICC semi-supervised learning, with 40% labelled videos, performs similar to fully-supervised counterparts. Our ICC improves MoF by {+1.8, +5.6, +2.5}% on Breakfast, 50Salads and GTEA respectively for 100% labelled videos.

show abstract

“…In practice, this type of weak label is akin to the time-stamp annotations used in weakly-supervised temporal action segmentation, in which an arbitrary frame from each action segment is labelled [8,9,10]. When annotating timestamps, Figure 1: Dense anticipation with full supervision vs. weak supervision.…”

Section: Introductionmentioning

confidence: 99%

“…annotators quickly go through a video and press a button when an action is occurring. This is ∼6x faster than marking the exact start and end frames of action segments [10] and still provides strong cues to learn effective models for action segmentation.…”

Section: Introductionmentioning

confidence: 99%

Weakly-Supervised Dense Action Anticipation

Zhang

Chen

Yao

2021

Preprint

View full text Add to dashboard Cite

Dense anticipation aims to forecast future actions and their durations for long horizons. Existing approaches rely on fully-labelled data, i.e. sequences labelled with all future actions and their durations. We present a (semi-) weakly supervised method using only a small number of fullylabelled sequences and predominantly sequences in which only the (one) upcoming action is labelled. To this end, we propose a framework that generates pseudo-labels for future actions and their durations and adaptively refines them through a refinement module. Given only the upcoming action label as input, these pseudo-labels guide action/duration prediction for the future. We further design an attention mechanism to predict context-aware durations. Experiments on the Breakfast and 50Salads benchmarks verify our method's effectiveness; we are competitive even when compared to fully supervised state-of-the-art models. We will make our code available at: https://github.com/zhanghaotong1/WSLVideoDenseAnticipation.

show abstract

SF-Net: Single-Frame Supervision for Temporal Action Localization

Cited by 89 publications

References 39 publications

PUNet: Temporal Action Proposal Generation With Positive Unlabeled Learning Using Key Frame Annotations

PUNet: Temporal Action Proposal Generation With Positive Unlabeled Learning Using Key Frame Annotations

Iterative Contrast-Classify For Semi-supervised Temporal Action Segmentation

Weakly-Supervised Dense Action Anticipation

Contact Info

Product

Resources

About