Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition

Wu, Hanbo; Ma, Xin; Li, Yibin

doi:10.1109/tcsvt.2021.3077512

Cited by 52 publications

(12 citation statements)

References 57 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The 3D-CNN model is an extension based on 2D-CNN, which introduces the temporal dimension as an additional input dimension (Wu et al, 2021 ). Similar to 2D-CNN, the 3D-CNN model consists of multiple convolutional, pooling and fully connected layers.…”

Section: Methodsmentioning

confidence: 99%

Multimodal audio-visual robot fusing 3D CNN and CRNN for player behavior recognition and prediction in basketball matches

Wang

2024

Front. Neurorobot.

View full text Add to dashboard Cite

IntroductionIntelligent robots play a crucial role in enhancing efficiency, reducing costs, and improving safety in the logistics industry. However, traditional path planning methods often struggle to adapt to dynamic environments, leading to issues such as collisions and conflicts. This study aims to address the challenges of path planning and control for logistics robots in complex environments.MethodsThe proposed method integrates information from different perception modalities to achieve more accurate path planning and obstacle avoidance control, thereby enhancing the autonomy and reliability of logistics robots. Firstly, a 3D convolutional neural network (CNN) is employed to learn the feature representation of objects in the environment for object recognition. Next, long short-term memory (LSTM) is used to model spatio-temporal features and predict the behavior and trajectory of dynamic obstacles. This enables the robot to accurately predict the future position of obstacles in complex environments, reducing collision risks. Finally, the Dijkstra algorithm is applied for path planning and control decisions to ensure the robot selects the optimal path in various scenarios.ResultsExperimental results demonstrate the effectiveness of the proposed method in terms of path planning accuracy and obstacle avoidance performance. The method outperforms traditional approaches, showing significant improvements in both aspects.DiscussionThe intelligent path planning and control scheme presented in this paper enhances the practicality of logistics robots in complex environments, thereby promoting efficiency and safety in the logistics industry.

show abstract

Section: Methodsmentioning

confidence: 99%

Multimodal audio-visual robot fusing 3D CNN and CRNN for player behavior recognition and prediction in basketball matches

Wang

2024

Front. Neurorobot.

View full text Add to dashboard Cite

show abstract

“…Appearance information in terms of multi-temporal RGB data is utilized to emphasize the underlying appearance information that would otherwise be lost with depth data alone, which helps to enhance sensitivity to interactions with tiny objects. Wu et al applied 3D CNNs with multimodal inputs to improve spatio-temporal features [232]. This method suggests two distinct video presentations; depth residual dynamic image sequence (DRDIS) and pose estimation map sequence (PEMS).…”

Section: Rgb and Depthmentioning

confidence: 99%

Transformers in Action Recognition: A Review on Temporal Modeling

Shabaninia¹,

Nezamabadi‐pour²,

Shafizadegan³

2023

Preprint

View full text Add to dashboard Cite

In vision-based action recognition, spatio-temporal features from different modalities are used for recognizing activities. Temporal modeling is a long challenge of action recognition. However, there are limited methods such as pre-computed motion features, three-dimensional (3D) filters, and recurrent neural networks (RNN) for modeling motion information in deep-based approaches. Recently, transformers' success in modeling long-range dependencies in natural language processing (NLP) tasks has gotten great attention from other domains; including speech, image, and video, to rely entirely on self-attention without using sequence-aligned RNNs or convolutions. Although the application of transformers to action recognition is relatively new, the amount of research proposed on this topic within the last few years is astounding. This paper especially reviews recent progress in deep learning methods for modeling temporal variations. It focuses on action recognition methods that use transformers for temporal modeling, discussing their main features, used modalities, and identifying opportunities and challenges for future research.

show abstract

“…In the early years, Convolutional Neural Networks (CNNs) and Long-Short Term Memory (LSTMs) have been applied to explore the spatial and temporal features of skeleton sequences [9]- [18]. However, these models fail to capture the structural connections in the human skeleton.…”

Section: Introductionmentioning

confidence: 99%

Distinct Motion-Preserving Graph Convolutional Network for Two-Person Interaction Recognition

Phuong,

Tran,

Lee

2023

IEEE Access

View full text Add to dashboard Cite

Graph Convolutional Networks (GCNs) have gained widespread adoption in modeling human skeleton sequences for two-person interaction recognition. Most GCN-based models achieve state-of-the-art results by leveraging either intra-body or inter-body connections. However, using only intra-body relations may ignore important interactive features between two individuals, whereas relying on inter-body relations may weaken the specific motion dynamics of each skeleton. To address these shortcomings, we propose a Distinct Motion-Preserving GCN (DMP-GCN) that utilizes intra-body and inter-body graphs to extract interactive features from two human bodies while preserving the distinct motion characteristics of each skeleton. Specifically, two motion-specific streams are adopted to capture specific motion features of each human skeleton and an interactive stream is applied to model the interactive dynamics of two bodies. In addition, we introduce a new graph labeling strategy called Distance Variation Labeling which is a datadriven approach for defining the edge strength in the skeleton graph. Extensive experiments show that our proposed approaches outperform state-of-the-art methods on two large-scale human interaction datasets, NTU RGB+D (mutual) and NTU RGB+D 120 (mutual).

show abstract

Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition

Cited by 52 publications

References 57 publications

Multimodal audio-visual robot fusing 3D CNN and CRNN for player behavior recognition and prediction in basketball matches

Multimodal audio-visual robot fusing 3D CNN and CRNN for player behavior recognition and prediction in basketball matches

Transformers in Action Recognition: A Review on Temporal Modeling

Distinct Motion-Preserving Graph Convolutional Network for Two-Person Interaction Recognition

Contact Info

Product

Resources

About