2023
DOI: 10.1109/tpami.2023.3284080
|View full text |Cite
|
Sign up to set email alerts
|

C2F-TCN: A Framework for Semi- and Fully-Supervised Temporal Action Segmentation

Abstract: Temporal action segmentation tags action labels for every frame in an input untrimmed video containing multiple actions in a sequence. For the task of temporal action segmentation, we propose an encoder-decoder style architecture named C2F-TCN featuring a "coarse-to-fine" ensemble of decoder outputs. The C2F-TCN framework is enhanced with a novel model agnostic temporal feature augmentation strategy formed by the computationally inexpensive strategy of the stochastic max-pooling of segments. It produces more a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(3 citation statements)
references
References 80 publications
0
1
0
Order By: Relevance
“…After the input of the C2F module undergoes processing via a CBS, the output dimensions are transformed into B, C, H, and W, where B denotes the number of images, C indicates the number of channels, and H and W represent the height and width of the feature map, respectively [20]. The detailed structure is depicted in Figure 5.…”
Section: Yolov8 Modelmentioning
confidence: 99%
See 1 more Smart Citation
“…After the input of the C2F module undergoes processing via a CBS, the output dimensions are transformed into B, C, H, and W, where B denotes the number of images, C indicates the number of channels, and H and W represent the height and width of the feature map, respectively [20]. The detailed structure is depicted in Figure 5.…”
Section: Yolov8 Modelmentioning
confidence: 99%
“…Essentially, the output of each BottleNeck must be preserved and utilized as input for the following BottleNeck. Eventually, all corresponding feature channels are fused, resulting in dimensions of B × (N + 2)C/2 × H × W. After the input of the C2F module undergoes processing via a CBS, the output dimensions are transformed into B, C, H, and W, where B denotes the number of images, C indicates the number of channels, and H and W represent the height and width of the feature map, respectively [20]. The detailed structure is depicted in Figure 5.…”
Section: Yolov8 Modelmentioning
confidence: 99%
“…Recently, semi-supervised approaches [16]- [18] for this task have attracted increasing attention, with a small percentage of labelled videos in the training set. Iterative-Contrast-Classify (ICC) [16] is the first attempt to explore semi-supervised learning for human action segmentation, which consists of two steps, i.e., unsupervised representation learning based on contrastive learning [19] (Fig.…”
Section: Introductionmentioning
confidence: 99%