A Spatio-Temporal Motion Network for Action Recognition Based on Spatial Attention

Yang, Qi; Lu, Tongwei; Zhou, Huabing

doi:10.3390/e24030368

Cited by 11 publications

(9 citation statements)

References 57 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The nearest score to ours is TEA, where we obtain a substantially higher margin (52.1% vs. 51.7%), except top-5 accuracy is 0.3% lower (80.2% vs. 80.5%) when employing 10 clips. For comparison with SMNet [ 36 ], a more recent work, we noticeably outperform their work by big margins of 2.3% and 0.6% for top-1 and top-5 accuracy, respectively. This definitely demonstrates our superior submodules of MvE and DCTA combined with ME, considering SMNet also equipped their network with motion encoding.…”

Section: Experiments and Evaluationmentioning

confidence: 84%

“…Furthermore, Li et al [ 33 ] introduced a new block termed as TEA to explore the benefits of the attention mechanism added to the motion calculation previously mentioned. Later, this attentive motion features module was adopted by [ 34 , 35 , 36 ]. In addition, the authors of TEA suggested overcoming the limitation of long-range temporal representation by introducing multiple temporal aggregations in a hierarchical design.…”

Section: Related Workmentioning

confidence: 99%

“…Another work by Tian et al [ 41 ] introduced a 2D ResNet with Transformer injected at the top layer before the linear layer to accurately aggregate extracted local cues from preceding blocks into a video representation. Although these current approaches seem promising, a Transformer-based network is not suited for real-world applications because it is highly computationally intensive [ 36 ].…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation

Joefrie

Aono

2022

Entropy

View full text Add to dashboard Cite

Spatiotemporal and motion feature representations are the key to video action recognition. Typical previous approaches are to utilize 3D CNNs to cope with both spatial and temporal features, but they suffer from huge computations. Other approaches are to utilize (1+2)D CNNs to learn spatial and temporal features in an efficient way, but they neglect the importance of motion representations. To overcome problems with previous approaches, we propose a novel block which makes it possible to alleviate the aforementioned problems, since our block can capture spatial and temporal features more faithfully and efficiently learn motion features. This proposed block includes Motion Excitation (ME), Multi-view Excitation (MvE), and Densely Connected Temporal Aggregation (DCTA). The purpose of ME is to encode feature-level frame differences; MvE is designed to enrich spatiotemporal features with multiple view representations adaptively; and DCTA is to model long-range temporal dependencies. We inject the proposed building block, which we refer to as the META block (or simply “META”), into 2D ResNet-50. Through extensive experiments, we demonstrate that our proposed method architecture outperforms previous CNN-based methods in terms of “Val Top-1 %” measure with Something-Something v1 and Jester datasets, while the META yielded competitive results with the Moment-in-Time Mini dataset.

show abstract

Section: Experiments and Evaluationmentioning

confidence: 84%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation

Joefrie

Aono

2022

Entropy

View full text Add to dashboard Cite

show abstract

“…Gesture interaction [2] Look at human actions and analyze human intentions. Human action recognition [3,4] Face detection [5,6] Hearing-based HCI Speech recognition [13] Listen to human language and analyze human intent Table 2: Technical fields that may be covered by the auxiliary text reading task.…”

Section: Vision-based Hcimentioning

confidence: 99%

“…e popularization of multimedia vision sensors has led to the development of various human-centered computer vision technologies, which have been gradually integrated into and changed our lives. Vision-based human-computer interaction tasks are mostly used in text reading comprehension [1], gesture interaction [2], human action recognition [3,4], face detection [5,6], and other elds. However, unfamiliar or forgotten words make the reading and learning experience negative for both children and adults.…”

Section: Introductionmentioning

confidence: 99%

RWYI: Reading What You Are Interested in with a Learning-Based Text Interactive System

Wang

Yang

et al. 2022

Mobile Information Systems

Self Cite

View full text Add to dashboard Cite

As computer vision and human-computer interaction technology mature, vision-based auxiliary text reading has become the mainstream method to optimize the learning and reading experience. Most of the existing auxiliary text reading methods use scene text recognition combined with human gesture recognition to complete the task in multiple stages. However, these methods cannot accurately and effectively extract the textual information that readers are interested in complex and varied reading scenarios. To improve the text reading experience, we propose a human-centered fast auxiliary text reading method. It utilizes a hand-text hybrid object detection (HTD) model to instantly locate text of interest to readers, a font-consistent prior text image superresolution network (FCSRN) to recover low-resolution text images to enhance the accuracy of text recognition, and a convolutional recurrent neural network (CRNN) text recognition operator to obtain the content of the text, that is, interesting to readers. To verify the effectiveness of the proposed method, we tested the performance of the text localization module on a homemade HTD dataset and the performance of the FCSRN on the public text image superresolution dataset called TextZoom. Quantitative experiments on the overall performance of the fast auxiliary reading system, called reading what you are interested in (RWYI), were designed. The experiments indicate that the proposed method can meet the needs of human-computer interactive auxiliary reading in text reading scenarios and optimize the reading experience.

show abstract

SSTA-Net: Self-supervised Spatio-Temporal Attention Network for Action Recognition

Li,

Zhang,

Pei

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

A Spatio-Temporal Motion Network for Action Recognition Based on Spatial Attention

Cited by 11 publications

References 57 publications

Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation

Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation

RWYI: Reading What You Are Interested in with a Learning-Based Text Interactive System

SSTA-Net: Self-supervised Spatio-Temporal Attention Network for Action Recognition

Contact Info

Product

Resources

About