Fast Weakly Supervised Action Segmentation Using Mutual Consistency

Souri, Yaser; Fayyaz, Mohsen; Minciullo, Luca; Francesca, Gianpiero; Gall, Jüergen

doi:10.1109/tpami.2021.3089127

Cited by 37 publications

(20 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Specifically, [9] encodes the entire video first before decoding it to frame-level action scores. The work in [4,6,27,38,47] use Dynamic Programming (DP) to infer the most likely actions and their duration given the entire video. Our method also uses a DPbased framework, but to our knowledge, we are the first to introduce a weakly-supervised method to segment a streaming video in an online manner.…”

Section: Related Workmentioning

confidence: 99%

“…Another important consideration in action understanding relates to requirements for processing the videos online versus offline, which is not addressed in existing weaklysupervised segmentation methods [6,27,47]. Online processing with low latency is an increasingly important part of interactive applications where real-time, or near real-time feedback is critical.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Weakly-Supervised Online Action Segmentation in Multi-View Instructional Videos

Ghoddoosian¹,

Dwivedi²,

Agarwal³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper addresses a new problem of weaklysupervised online action segmentation in instructional videos. We present a framework to segment streaming videos online at test time using Dynamic Programming and show its advantages over greedy sliding window approach. We improve our framework by introducing the Online-Offline Discrepancy Loss (OODL) to encourage the segmentation results to have a higher temporal consistency. Furthermore, only during training, we exploit framewise correspondence between multiple views as supervision for training weakly-labeled instructional videos. In particular, we investigate three different multi-view inference techniques to generate more accurate frame-wise pseudo ground-truth with no additional annotation cost. We present results and ablation studies on two benchmark multi-view datasets, Breakfast and IKEA ASM. Experimental results show efficacy of the proposed methods both qualitatively and quantitatively in two domains of cooking and assembly.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Weakly-Supervised Online Action Segmentation in Multi-View Instructional Videos

Ghoddoosian¹,

Dwivedi²,

Agarwal³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Two types of human activity recognition can be distinguished: (1) video data based, e.g. [44] and [39] and (2) inertial sensor data based activity recognition. An inertial sensor consists of at least an accelerometer and a gyroscope, but is often supplemented by a magnetometer.…”

Section: Related Workmentioning

confidence: 99%

Tutorial on Deep Learning for Human Activity Recognition

Bock,

Hoelzemann,

Moeller

et al. 2021

Preprint

View full text Add to dashboard Cite

Activity recognition systems that are capable of estimating human activities from wearable inertial sensors have come a long way in the past decades. Not only have state-of-the-art methods moved away from feature engineering and have fully adopted end-to-end deep learning approaches, best practices for setting up experiments, preparing datasets, and validating activity recognition approaches have similarly evolved. This tutorial was first held at the 2021 ACM International Symposium on Wearable Computers (ISWC'21) and International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp'21). The tutorial, after a short introduction in the research field of activity recognition, provides a hands-on and interactive walk-through of the most important steps in the data pipeline for the deep learning of human activities.All presentation slides shown during the tutorial, which also contain links to all code exercises, as well as the link of the GitHub page of the tutorial can be found on: https://mariusbock.github.io/dl-for-har KEYWORDS

show abstract

“…They use a global length model for actions, which is updated during training. Souri et al [34] introduce an end-to-end method which does not use any decoding during training. They use a combination of a sequence-to-sequence model on top of a temporal convolutional network to learn the given transcript of actions while learning to temporally segment the video.…”

Section: Related Workmentioning

confidence: 99%

“…Since acquiring such annotations is very expensive, several works investigated methods to learn the models with less supervision. An example of weakly annotated training data are videos where only transcripts are provided [20,12,27,29,8,4,34,24]. While transcripts of videos can be obtained from scripts or subtitles, they are still costly to obtain.…”

Section: Introductionmentioning

confidence: 99%

SCT: Set Constrained Temporal Transformer for Set Supervised Action Segmentation

Fayyaz¹,

Gall²

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Temporal action segmentation is a topic of increasing interest, however, annotating each frame in a video is cumbersome and costly. Weakly supervised approaches therefore aim at learning temporal action segmentation from videos that are only weakly labeled. In this work, we assume that for each training video only the list of actions is given that occur in the video, but not when, how often, and in which order they occur. In order to address this task, we propose an approach that can be trained end-to-end on such data. The approach divides the video into smaller temporal regions and predicts for each region the action label and its length. In addition, the network estimates the action labels for each frame. By measuring how consistent the frame-wise predictions are with respect to the temporal regions and the annotated action labels, the network learns to divide a video into class-consistent regions. We evaluate our approach on three datasets where the approach achieves state-of-the-art results.

show abstract

Fast Weakly Supervised Action Segmentation Using Mutual Consistency

Cited by 37 publications

References 33 publications

Weakly-Supervised Online Action Segmentation in Multi-View Instructional Videos

Weakly-Supervised Online Action Segmentation in Multi-View Instructional Videos

Tutorial on Deep Learning for Human Activity Recognition

SCT: Set Constrained Temporal Transformer for Set Supervised Action Segmentation

Contact Info

Product

Resources

About