2020 25th International Conference on Pattern Recognition (ICPR) 2021
DOI: 10.1109/icpr48806.2021.9412091
|View full text |Cite
|
Sign up to set email alerts
|

Temporal Attention-Augmented Graph Convolutional Network for Efficient Skeleton-Based Human Action Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
12
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
4

Relationship

2
6

Authors

Journals

citations
Cited by 21 publications
(19 citation statements)
references
References 23 publications
0
12
0
Order By: Relevance
“…( 2) and as bilinear mapping of the form in Eq. ( 4), or if it is adequate the employ linear mappings effectively leading to neural layers combining Method CS(%) CV(%) #Streams HBRNN [8] 59.1 64.0 5 Deep LSTM [10] 60.7 67.3 1 ST-LSTM [9] 69.2 77.7 1 STA-LSTM [11] 73.4 81.2 1 VA-LSTM [12] 79.2 87.7 1 ARRN-LSTM [13] 80.7 88.8 2 2s-3DCNN [14] 66.8 72.6 2 TCN [15] 74.3 83.1 1 Clips+CNN+MTLN [16] 79.6 84.8 1 Synthesized CNN [17] 80.0 87.2 1 3scale ResNet152 [18] 85.0 92.3 1 CNN+Motion+Trans [19] 83.2 89.3 2 ST-GCN [20] 81.5 88.3 1 DPRL+GCNN [25] 83.5 89.8 1 TA-GCN [26] 87.97 94.2 1 AS-GCN [24] 86.8 94.2 2 2s-AGCN [22] 88.5 95.1 2 2s-TA-GCN [26] 88.5 95.1 2 GCN-NAS [23] 89 a Multilayer Perceptron block with a Temporal Convolution block, we conducted a second set of experiments. In this set of experiments, we used a 10-layer spatio-temporal bilinear network formed by the same data transformation sizes as in the first set of our experiments, but instead of using bilinear mappings with V (l) = V for all 10 layers of the model, we use V (l) = V, l = 1, .…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…( 2) and as bilinear mapping of the form in Eq. ( 4), or if it is adequate the employ linear mappings effectively leading to neural layers combining Method CS(%) CV(%) #Streams HBRNN [8] 59.1 64.0 5 Deep LSTM [10] 60.7 67.3 1 ST-LSTM [9] 69.2 77.7 1 STA-LSTM [11] 73.4 81.2 1 VA-LSTM [12] 79.2 87.7 1 ARRN-LSTM [13] 80.7 88.8 2 2s-3DCNN [14] 66.8 72.6 2 TCN [15] 74.3 83.1 1 Clips+CNN+MTLN [16] 79.6 84.8 1 Synthesized CNN [17] 80.0 87.2 1 3scale ResNet152 [18] 85.0 92.3 1 CNN+Motion+Trans [19] 83.2 89.3 2 ST-GCN [20] 81.5 88.3 1 DPRL+GCNN [25] 83.5 89.8 1 TA-GCN [26] 87.97 94.2 1 AS-GCN [24] 86.8 94.2 2 2s-AGCN [22] 88.5 95.1 2 2s-TA-GCN [26] 88.5 95.1 2 GCN-NAS [23] 89 a Multilayer Perceptron block with a Temporal Convolution block, we conducted a second set of experiments. In this set of experiments, we used a 10-layer spatio-temporal bilinear network formed by the same data transformation sizes as in the first set of our experiments, but instead of using bilinear mappings with V (l) = V for all 10 layers of the model, we use V (l) = V, l = 1, .…”
Section: Methodsmentioning
confidence: 99%
“…AS-GCN [24] extended the skeleton graphs to represent both structural links and actional linked and proposed an actional-structural graph convolutional network, which has an encoder-decoder structure, to capture richer dependencies from actions. DPRL+GCNN [25] and TA-GCN [26] select the most informative skeletons in a sequence to make the inference process more efficient.…”
Section: Introductionmentioning
confidence: 99%
“…GCN-based models for skeleton-based action recognition [15,16,18,22,23,27,28] operate on sequences of skeleton graphs. The spatio-temporal graph of skeletons G = (V, E) has the human body joint coordinates as nodes V and the spatial and temporal connections between them as edges E. Figure 2 (right) illustrates such a spatio-temporal graph where the spatial graph edges encode the human bones and the temporal edges connect the same joints in subsequent time-steps.…”
Section: A Spatio-temporal Graph Convolutional Networkmentioning
confidence: 99%
“…Unfortunately, the high computational complexity of these GCN-based methods makes them infeasible in real-time applications and resource-constrained online inference settings. Multiple approaches have been explored to increase the efficiency of skeleton-based action recognition recently: GCN-NAS [22] and PST-GCN [23] are neural architecture search based methods which try to find an optimized ST-GCN architecture to increase the efficiency of the classification task; ShiftGCN [24] replaces graph and temporal convolutions with a zero-FLOPs shift graph operation and pointwise convolutions as an efficient alternative to the featurepropagation rule for GCNs [25]; ShiftGCN++ [26] boost the efficiency of ShiftGCN further via progressive architecture search, knowledge-distillation, explicit spatial positional encodings, and a Dynamic Shift Graph Convolution; SGN [27] utilizes semantic information such as joint type and frame index as side information to design a compact semanticsguided neural network (SGN) for capturing both spatial and temporal correlations in joint and frame level; TA-GCN [28] tries to make inference more efficient by selecting a subset of key skeletons, which hold the most important features for action recognition, from a sequence to be processed by the spatio-temporal convolutions.…”
Section: Introductionmentioning
confidence: 99%
“…More directly related to the problem of human activity recognition, state-of-the-art results are based on Graph Neural Networks (GCNs), which treat the human skeleton joints as graph nodes, and their connections (bones) as graph edges. For instance, the authors in [13] proposed a module to select the most informative frames in a skeleton sequence and fuse it with a GCN module; the authors in [14] presented a GCN architecture that fuses information both from nodes and skeleton edges; and the authors in [15] introduced the spatial temporal GCN, which applies graph convolutions on the spatial domain and regular convolutions on the temporal domain.…”
Section: Introductionmentioning
confidence: 99%