Efficient Two-Step Networks for Temporal Action Segmentation

Li, Yunheng; Dong, Zhuben; Liu, Kaiyuan; Feng, Lin; Hu, Lianyu; Liu, Xü; Wang, Yuhan

doi:10.1016/j.neucom.2021.04.121

Cited by 28 publications

(8 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some TCNS works [17,18,8] mainly focus on improving receptive fields that model long-term dependencies with encoder structures, dilated convolutions, or deformable convolutions. [12,21] build architecture on the two-branch approach : One branch exploits wide long-term time receptive fields based on TCNs. The second exploits frame-boundary based on action boundary regression.…”

Section: Action Segmentationmentioning

confidence: 99%

“…It has been a hot topic in human action analysis, which is widely used in video surveillance [6], action teaching, and robotics [34]. Recently, some works [17,8,10,35,21] have studied the long range dependencies between correlated actions in action segmentation using temporal convolution networks (TCNs) for models. The TCNs enhance long-term receptive fields by increasing convolution layers.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Cross-Enhancement Transformer for Action Segmentation

Wang¹,

Wang²,

Zhuang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Temporal convolutions have been the paradigm of choice in action segmentation, which enhances long-term receptive fields by increasing convolution layers. However, high layers cause the loss of local information necessary for frame recognition. To solve the above problem, a novel encoder-decoder structure is proposed in this paper, called Cross-Enhancement Transformer. Our approach can be effective learning of temporal structure representation with interactive self-attention mechanism. Concatenated each layer convolutional feature maps in encoder with a set of features in decoder produced via self-attention. Therefore, local and global information are used in a series of frame actions simultaneously. In addition, a new loss function is proposed to enhance the training process that penalizes over-segmentation errors. Experiments show that our framework performs state-of-the-art on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities and the Breakfast dataset.

show abstract

Section: Action Segmentationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Cross-Enhancement Transformer for Action Segmentation

Wang¹,

Wang²,

Zhuang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…If the model applies in the full video, our model could use postprocessing (Li et al 2021) to improve the performance of discrimination of action completeness. Evaluation Metrics To evaluate SVTAS task results, we adopt several metrics including frame-wise accuracy (Acc) (Farha and Gall 2019), mean average precision (mAP) metric with temporal IoU of 0.5 (denote by mAP@0.5) (Wang et al 2022a), the area under the AR (under specified temporal IoU thresholds for [0.5:0.05:1.0]) vs. AN (limiting the average number of proposals for each video and set to 100) curve (AUC) (Alwassel, Giancola, and Ghanem 2021) and the F1 score at temporal IoU threshold 0.1 (denote by F1@0.1) (Li et al 2022).…”

Section: Transegermentioning

confidence: 99%

Streaming Video Temporal Action Segmentation In Real Time

Wang¹,

Li²,

Dong³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Temporal action segmentation (TAS) is a critical step toward long-term video understanding. Recent studies follow a pattern that builds models based on features instead of raw video picture information. However, we claim those models are trained complicatedly and limit application scenarios. It is hard for them to segment human actions of video in real time because they must work after the full video features are extracted. As the real-time action segmentation task is different from TAS task, we define it as streaming video real-time temporal action segmentation (SVTAS) task. In this paper, we propose a real-time end-to-end multi-modality model for SVTAS task. More specifically, under the circumstances that we cannot get any future information, we segment the current human action of streaming video chunk in real time. Furthermore, the model we propose combines the last steaming video chunk feature extracted by language model with the current image feature extracted by image model to improve the quantity of real-time temporal action segmentation. To the best of our knowledge, it is the first multi-modality real-time temporal action segmentation model. Under the same evaluation criteria as full video temporal action segmentation, our model segments human action in real time with less than 40% of state-of-the-art model computation and achieves 90% of the accuracy of the full video state-of-the-art model. Code is available at https://github.com/Thinksky5124/SVTAS.git.

show abstract

“…Literature is rich of works on action recognition methodologies successfully applied to short videos analysis. In recent years, the focus has been on temporal segmentation of actions in long untrimmed videos 28 . In Industry 4.0 domain, where collaborative tasks are performed by humans and robots in highly varying conditions, it is imperative to recognize the exact beginning and ending of an action.…”

Section: Technical Validationmentioning

confidence: 99%

The HA4M dataset: Multi-Modal Monitoring of an assembly task for Human Action recognition in Manufacturing

et al. 2022

View full text Add to dashboard Cite

This paper introduces the Human Action Multi-Modal Monitoring in Manufacturing (HA4M) dataset, a collection of multi-modal data relative to actions performed by different subjects building an Epicyclic Gear Train (EGT). In particular, 41 subjects executed several trials of the assembly task, which consists of 12 actions. Data were collected in a laboratory scenario using a Microsoft® Azure Kinect which integrates a depth camera, an RGB camera, and InfraRed (IR) emitters. To the best of authors’ knowledge, the HA4M dataset is the first multi-modal dataset about an assembly task containing six types of data: RGB images, Depth maps, IR images, RGB-to-Depth-Aligned images, Point Clouds and Skeleton data. These data represent a good foundation to develop and test advanced action recognition systems in several fields, including Computer Vision and Machine Learning, and application domains such as smart manufacturing and human-robot collaboration.

show abstract

Efficient Two-Step Networks for Temporal Action Segmentation

Cited by 28 publications

References 14 publications

Cross-Enhancement Transformer for Action Segmentation

Cross-Enhancement Transformer for Action Segmentation

Streaming Video Temporal Action Segmentation In Real Time

The HA4M dataset: Multi-Modal Monitoring of an assembly task for Human Action recognition in Manufacturing

Contact Info

Product

Resources

About