Fast Weakly Supervised Action Segmentation Using Mutual Consistency

Souri, Yaser; Fayyaz, Mohsen; Minciullo, Luca; Francesca, Gianpiero; Gall, Jüergen

doi:10.48550/arxiv.1904.03116

Cited by 4 publications

(10 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, we apply the k-Means algorithm combined with the Silhouette Score to find the optimal number of clusters in which each cluster corresponds to Unsupervised 52.2 05 CDFL [22] Weakly Sup. 50.2 06 MuCon [37] Weakly Sup. 49.7 07 D3TW [6] Weakly Sup.…”

Section: Final Remarksmentioning

confidence: 99%

“…These subactions allow their model to learn fine-grained movements but still capture mid and longrange temporal information frames. Another very recent proposal, by Souri et al [37], utilizes a two-branch network where both try to predict the segmentation and to train it. They propose a novel mutual consistency loss (MuCon) to enforce consistency between the two predictions.…”

Section: Temporal Action Segmentationmentioning

confidence: 99%

“…Nevertheless, these solutions require frame-level or scene-level annotations that are incredibly laborious. For this reason, researchers started focusing on methods with less supervision, such as weakly-supervised [4,22,29,37] and unsupervised methods [23,33,40].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Cluster-Based Method for Action Segmentation Using Spatio-Temporal and Positional Encoded Embeddings

Marques

Busson

Guedes

et al. 2021

Proceedings of the Brazilian Symposium on Multimedia and the Web

View full text Add to dashboard Cite

A crucial task to overall video understanding is the recognition and localisation in time of different actions or events that are present along the scenes. To address this problem, action segmentation must be achieved. Action segmentation consists of temporally segmenting a video by labeling each frame with a specific action. In this work, we propose a novel action segmentation method that requires no prior video analysis and no annotated data. Our method involves extracting spatio-temporal features from videos in samples of 0.5s using a pre-trained deep network. Data is then transformed using a positional encoder and finally a clustering algorithm is applied with the use of a silhouette score to find the optimal number of clusters where each cluster presumably corresponds to a different single and distinguishable action. In experiments, we show that our method produces competitive results on Breakfast and Inria Instructional Videos dataset benchmarks. CCS CONCEPTS• Computing methodologies → Neural networks; Cost-sensitive learning.

show abstract

Section: Final Remarksmentioning

confidence: 99%

Section: Temporal Action Segmentationmentioning

confidence: 99%

See 1 more Smart Citation

A Cluster-Based Method for Action Segmentation Using Spatio-Temporal and Positional Encoded Embeddings

Marques

Busson

Guedes

et al. 2021

Proceedings of the Brazilian Symposium on Multimedia and the Web

View full text Add to dashboard Cite

show abstract

“…Weakly supervised methods bypass per-frame annotations and use labels such as ordered lists of actions (Ding and Xu 2018;Richard et al 2018;Chang et al 2019;Li, Lei, and Todorovic 2019;Souri et al 2019) or a small percentage of action time-stamps (Kuehne, Richard, and Gall 2018;Li, Farha, and Gall 2021;Chen et al 2020a) for all videos.…”

Section: Related Workmentioning

confidence: 99%

Iterative Contrast-Classify for Semi-supervised Temporal Action Segmentation

Singhania

Rahaman

Yao

2022

AAAI

View full text Add to dashboard Cite

Temporal action segmentation classifies the action of each frame in (long) video sequences. Due to the high cost of frame-wise labeling, we propose the first semi-supervised method for temporal action segmentation. Our method hinges on unsupervised representation learning, which, for temporal action segmentation, poses unique challenges. Actions in untrimmed videos vary in length and have unknown labels and start/end times. Ordering of actions across videos may also vary. We propose a novel way to learn frame-wise representations from temporal convolutional networks (TCNs) by clustering input features with added time-proximity conditions and multi-resolution similarity. By merging representation learning with conventional supervised learning, we develop an "Iterative Contrast-Classify (ICC)'' semi-supervised learning scheme. With more labelled data, ICC progressively improves in performance; ICC semi-supervised learning, with 40% labelled videos, performs similarly to fully-supervised counterparts. Our ICC improves MoF by {+1.8, +5.6, +2.5}% on Breakfast, 50Salads, and GTEA respectively for 100% labelled videos.

show abstract

“…While these approaches have been very successful, they suffer from a slow inference time as they iterate over all the training transcripts and select the one with the highest score. Souri et al [37] addressed this issue by predicting the transcript besides the frame-wise scores at inference time. While these approaches rely on a cheap transcript supervision, their performance is much worse than fully supervised approaches.…”

Section: Related Workmentioning

confidence: 99%

Temporal Action Segmentation from Timestamp Supervision

Li¹,

Farha²,

Gall³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Temporal action segmentation approaches have been very successful recently. However, annotating videos with frame-wise labels to train such models is very expensive and time consuming. While weakly supervised methods trained using only ordered action lists require less annotation effort, the performance is still worse than fully supervised approaches. In this paper, we propose to use timestamp supervision for the temporal action segmentation task. Timestamps require a comparable annotation effort to weakly supervised approaches, and yet provide a more supervisory signal. To demonstrate the effectiveness of timestamp supervision, we propose an approach to train a segmentation model using only timestamps annotations. Our approach uses the model output and the annotated timestamps to generate frame-wise labels by detecting the action changes. We further introduce a confidence loss that forces the predicted probabilities to monotonically decrease as the distance to the timestamps increases. This ensures that all and not only the most distinctive frames of an action are learned during training. The evaluation on four datasets shows that models trained with timestamps annotations achieve comparable performance to the fully supervised approaches.

show abstract

Fast Weakly Supervised Action Segmentation Using Mutual Consistency

Cited by 4 publications

References 25 publications

A Cluster-Based Method for Action Segmentation Using Spatio-Temporal and Positional Encoded Embeddings

A Cluster-Based Method for Action Segmentation Using Spatio-Temporal and Positional Encoded Embeddings

Iterative Contrast-Classify for Semi-supervised Temporal Action Segmentation

Temporal Action Segmentation from Timestamp Supervision

Contact Info

Product

Resources

About