2022
DOI: 10.3390/e24030368
|View full text |Cite
|
Sign up to set email alerts
|

A Spatio-Temporal Motion Network for Action Recognition Based on Spatial Attention

Abstract: Temporal modeling is the key for action recognition in videos, but traditional 2D CNNs do not capture temporal relationships well. 3D CNNs can achieve good performance, but are computationally intensive and not well practiced on existing devices. Based on these problems, we design a generic and effective module called spatio-temporal motion network (SMNet). SMNet maintains the complexity of 2D and reduces the computational effort of the algorithm while achieving performance comparable to 3D CNNs. SMNet contain… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 11 publications
(9 citation statements)
references
References 57 publications
0
9
0
Order By: Relevance
“…The nearest score to ours is TEA, where we obtain a substantially higher margin (52.1% vs. 51.7%), except top-5 accuracy is 0.3% lower (80.2% vs. 80.5%) when employing 10 clips. For comparison with SMNet [ 36 ], a more recent work, we noticeably outperform their work by big margins of 2.3% and 0.6% for top-1 and top-5 accuracy, respectively. This definitely demonstrates our superior submodules of MvE and DCTA combined with ME, considering SMNet also equipped their network with motion encoding.…”
Section: Experiments and Evaluationmentioning
confidence: 84%
See 2 more Smart Citations
“…The nearest score to ours is TEA, where we obtain a substantially higher margin (52.1% vs. 51.7%), except top-5 accuracy is 0.3% lower (80.2% vs. 80.5%) when employing 10 clips. For comparison with SMNet [ 36 ], a more recent work, we noticeably outperform their work by big margins of 2.3% and 0.6% for top-1 and top-5 accuracy, respectively. This definitely demonstrates our superior submodules of MvE and DCTA combined with ME, considering SMNet also equipped their network with motion encoding.…”
Section: Experiments and Evaluationmentioning
confidence: 84%
“…Furthermore, Li et al [ 33 ] introduced a new block termed as TEA to explore the benefits of the attention mechanism added to the motion calculation previously mentioned. Later, this attentive motion features module was adopted by [ 34 , 35 , 36 ]. In addition, the authors of TEA suggested overcoming the limitation of long-range temporal representation by introducing multiple temporal aggregations in a hierarchical design.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Gesture interaction [2] Look at human actions and analyze human intentions. Human action recognition [3,4] Face detection [5,6] Hearing-based HCI Speech recognition [13] Listen to human language and analyze human intent Table 2: Technical fields that may be covered by the auxiliary text reading task.…”
Section: Vision-based Hcimentioning
confidence: 99%
“…e popularization of multimedia vision sensors has led to the development of various human-centered computer vision technologies, which have been gradually integrated into and changed our lives. Vision-based human-computer interaction tasks are mostly used in text reading comprehension [1], gesture interaction [2], human action recognition [3,4], face detection [5,6], and other elds. However, unfamiliar or forgotten words make the reading and learning experience negative for both children and adults.…”
Section: Introductionmentioning
confidence: 99%