Learning Attention-Enhanced Spatiotemporal Representation for Action Recognition

Shi, Zhibin; Cao, Liangjie; Guan, Cheng; Zheng, Haiyong; Gu, Zhaorui; Yu, Zhibin; Zheng, Bing

doi:10.1109/access.2020.2968024

Cited by 10 publications

(6 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For supervised feature extraction, CNNs are by far, the more reliable choice; however, the case presented in this study in an unsupervised case which is implied by the multiple unlabelled IMFs from the pressure signals. This presents an opportunity for the SAE to flourish since they are efficient for learning deep feature representations from multiple inputs [20]. The deep feature learning capabilities of SAEs have been recorded for many purposes including epileptic seizure detection [21], rotating machinery prognostics [6], anomaly detection [14], and a host of many other applications.…”

Section: Related Workmentioning

confidence: 99%

A CEEMDAN-Assisted Deep Learning Model for the RUL Estimation of Solenoid Pumps

Akpudo¹,

Hur²

2021

Electronics

View full text Add to dashboard Cite

This paper develops a data-driven remaining useful life prediction model for solenoid pumps. The model extracts high-level features using stacked autoencoders from decomposed pressure signals (using complementary ensemble empirical mode decomposition with adaptive noise (CEEMDAN) algorithm). These high-level features are then received by a recurrent neural network-gated recurrent units (GRUs) for the RUL estimation. The case study presented demonstrates the robustness of the proposed RUL estimation model with extensive empirical validations. Results support the validity of using the CEEMDAN for non-stationary signal decomposition and the accuracy, ease-of-use, and superiority of the proposed DL-based model for solenoid pump failure prognostics.

show abstract

Section: Related Workmentioning

confidence: 99%

A CEEMDAN-Assisted Deep Learning Model for the RUL Estimation of Solenoid Pumps

Akpudo¹,

Hur²

2021

Electronics

View full text Add to dashboard Cite

show abstract

“…TS+LST [1] UCF101/HMDB51 94.8%/70.2% AE-I3D [19] UCF101/HMDB51 95.9%/74.7% KF+SAMA [41] UCF101 95.9%…”

Section: Trajectorymentioning

confidence: 99%

“…[15], [17], [18] use the Attention Mechanism based multi-layered Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) model to improve the performance of their arithmetic. Shi, Z et al [19] proposed AE-I3D (Attention-Enhanced I3D) network for action recognition, the concept of AE-I3D is to enhance the spatiotemporal representation through inflate soft attention in spatiotemporal scope, and adopt softmax to generate the probability distribution of attentional features.…”

Section: Introductionmentioning

confidence: 99%

“…Inspired by previous works, we utilize the optical flow information of the image and introduce the visual attention mechanism to enhance the serviceable feature information for action recognition. The Attention Mechanism designed in this paper is based on the CAM method, which is entirely different from the Attention Mechanism used in [15] - [19]. Moreover, the background detection method based on ViBe [20] is added in this paper to eliminate the influence of background clutter, which has not been used in previous work.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

LaM-2SRN: A Method Which Can Enhance Local Features and Detect Moving Objects for Action Recognition

2020

View full text Add to dashboard Cite

Visual attention mechanism has been widely used in computer vision and plays a vital role in the research of human action recognition. In this paper, we explore a novel moving target detection mechanism for human action recognition and propose a new 3D CNN (3D Convolutional Neural Network) model, dubbed LaM-2SRN (Local Features Enhanced and Moving target detected 2Stream-ResNet) for extracting and learning attention-enhanced spatiotemporal features. The contributions of this paper are as follow: First, the traditional CAM (Class Activation Map) based visual attention algorithm is used to obtain the optical flow information of the human region, thus eliminating the influence of irrelevant optical flow information (such as background clutter). Second, the ViBe algorithm is used to identify the moving target in the continuous frame, retain the optical flow information of the moving target, and make it complement with the optical flow information of the human region, to obtain a complete motion descriptor. After the motion information is marked in the video frame, we put the marked video frames and the original video frames into the 3D CNN and 2D CNN models respectively, to acquired mixed descriptor, which serves as the basis for video classification. Different from most of the previous target detection algorithms, this paper not only detects the human target, but also detects the moving foreground targets in the video, so that the pre-trained CNN model can obtain more complete motion information from videos. We use RGB-only video data for evaluation on two benchmarks: UCF101 and HMDB51, the experimental results demonstrate that our LaM-2SRN is comparable to the previous state-of-the-art algorithms. INDEX TERMS Action recognition, Video understanding, Visual attention, Background detection, CNN.

show abstract

“…As reported in [1], the test classification accuracy of ImageNets has been improved substantially compared with other methods at that time. Furthermore, besides classification tasks, attention mechanism has been also used in many other tasks such as object detection [3], [4], semantic segmentation [5], [6], super resolution [7], [8], action recognition [9], [10], etc. As the most popular attention method, SE technology used pooling operators to achieve the invariant feature of each channel, bringing nonlinearity at the same time.…”

Section: Introductionmentioning

confidence: 99%

Kernel Product Neural Networks

Zhou

Shen

et al. 2021

IEEE Access

View full text Add to dashboard Cite

Attention is an important field to explore the importance of each convolutional kernel channel/weight. The existing attention methods mostly use the Squeeze-and-Excitation (SE) technology to extract the global nonlinear feature vectors as the weights of corresponding feature maps. However, the pooling operators and fully-connected layers used in SE technology extract global features at the cost of much valuable information loss and the parameter amount increase. Actually, the feature map containing full information is a ready-made and better attention for other feature maps in the same layer. Simultaneously the products of feature maps will bring powerful non-linearity. Seeing this, Kernel Product (KP) technology is proposed to simply get useful nonlinear attention. To verify the effectiveness of KP, the proposed KP module is employed on Selective Kernel Networks (SKNets) to take the place of the original SE technology. The variety of SKNets is called Kernel Product Networks (KPNets) in this paper. In addition, identity mapping is used to solve the non-convergence problem in very deep neural networks. The KPNets are evaluated on ImageNet-1k, CIFAR-10, and CIFAR-100. The experiment results show that KPNets outperform many state-of-the-art methods and get a similar but more efficient performance than its SKNets with counterpart.

show abstract

Learning Attention-Enhanced Spatiotemporal Representation for Action Recognition

Cited by 10 publications

References 50 publications

A CEEMDAN-Assisted Deep Learning Model for the RUL Estimation of Solenoid Pumps

A CEEMDAN-Assisted Deep Learning Model for the RUL Estimation of Solenoid Pumps

LaM-2SRN: A Method Which Can Enhance Local Features and Detect Moving Objects for Action Recognition

Kernel Product Neural Networks

Contact Info

Product

Resources

About