Attentional Fused Temporal Transformation Network for Video Action Recognition

Yang, Ke; Wang, Zhiyuan; Dai, Huadong; Shen, Tianlong; Qiao, Peng; Niu, Xiamu; Li, Jie Jiang Dongsheng; Dou, Yong

doi:10.1109/icassp40776.2020.9053394

Cited by 9 publications

(4 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More information, e.g., sound can be added via new streams [ 166 , 167 , 168 ]. The architecture was further investigated by trying different ways of fusing the layers and deeper networks [ 169 , 170 , 171 , 172 ]. To facilitate the high computational costs of 3D convolutional layers, Lin et al [ 173 ] introduced the Temporal Shift Module (TSM) that can be incorporated into 2D CNNs to model the exchanges among neighboring frames while maintaining the lower computational costs of 2D CNNs.…”

Section: Machine Learning Algorithms For Human Motion Analysismentioning

confidence: 99%

Using Artificial Intelligence for Assistance Systems to Bring Motor Learning Principles into Real World Motor Tasks

Vandevoorde

Vollenkemper

Schwan

et al. 2022

Sensors

View full text Add to dashboard Cite

Humans learn movements naturally, but it takes a lot of time and training to achieve expert performance in motor skills. In this review, we show how modern technologies can support people in learning new motor skills. First, we introduce important concepts in motor control, motor learning and motor skill learning. We also give an overview about the rapid expansion of machine learning algorithms and sensor technologies for human motion analysis. The integration between motor learning principles, machine learning algorithms and recent sensor technologies has the potential to develop AI-guided assistance systems for motor skill training. We give our perspective on this integration of different fields to transition from motor learning research in laboratory settings to real world environments and real world motor tasks and propose a stepwise approach to facilitate this transition.

show abstract

Section: Machine Learning Algorithms For Human Motion Analysismentioning

confidence: 99%

Using Artificial Intelligence for Assistance Systems to Bring Motor Learning Principles into Real World Motor Tasks

Vandevoorde

Vollenkemper

Schwan

et al. 2022

Sensors

View full text Add to dashboard Cite

show abstract

“…In order to integrate spatial and temporal information comprehensively, we concatenate spatial featureF s , motion FeatureF t and the output F st of FCL. The fusion and concatenation process is depicted in (7).…”

Section: Lstm and Spatiotemporal Fusionmentioning

confidence: 99%

“…In this paper, to validate the performance of the proposed iCBAM-based method, we compare our proposed method to recent popular and related approaches, including IDT [25], two-stream [3], TSN [6], C3D [28], two-stream + IDT [4], IF-TTN [7], DTPP [5] and attention-based models [32]- [34]. Since the proposed iCBAM-based spatiotemporal-stream network is trained from scratch, we do not compare it with those are pre-trained on the large dataset.…”

Section: ) Experiments Analysis On Hmdb51mentioning

confidence: 99%

“…Zhu et al [5] suggest a Deep network with Temporal Pyramid Pooling (DTPP) to realize an end-to-end video-level representation learning approach. In order to realize the long-range temporal structure modeling, Wang et al [6] propose Temporal Segment Networks (TSN) to recognize video actions and on top of TSN, an Information Fused Temporal Transformation Network (IF-TTN) is reported to learn spatiotemporal feature representation [7]. Additionally, Zang et al [8] suggest utilizing attention-based temporal weighted CNN to learn action features.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An Improved Attention-Based Spatiotemporal-Stream Model for Action Recognition in Videos

Liu

et al. 2020

IEEE Access

View full text Add to dashboard Cite

Action recognition is an important yet challenging task in computer vision. Attention mechanism not only tells where to focus but when to focus. It plays a key role in extracting discriminative spatial and temporal features for solving the task. In this paper, we propose an improved spatiotemporal attention model based on the two-stream structure to recognize the different actions in videos. Specifically, we first extract the intra-frame spatial features and inter-frame optical flow features for each video data. Then we implement an effective attention module, which sequentially infers attention maps along three separate dimensions: channel, spatial and temporal. After adaptive feature refinement based on the attention maps, we perform a temporal pooling process to squeeze the temporal dimension. Then, these achieved spatial and temporal features are fed into the spatial LSTM and temporal LSTM, respectively. Finally, we fuse the spatial feature, temporal feature and two-stream fusion feature to classify the actions in videos. Additionally, we also collect and construct a new Ping-Pong action dataset for subsequent human-robot interaction task from YouTube. It contains 2400 labeled videos for 4 categories. We compare with other action recognition algorithms and validate the feasibility and effectiveness of the proposed method on Ping-Pong action dataset and HMDB51 dataset.

show abstract