R-STAN: Residual Spatial-Temporal Attention Network for Action Recognition

Liu, Quanle; Che, Xiangjiu; Bie, Mei

doi:10.1109/access.2019.2923651

Cited by 38 publications

(15 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compared STRM using closed-set based methods and open-set based methods. The closed-set methods were: iDT [18], Two-stream [31], FstCN [71], MoFAP [72], MIFS [8], LTC [34], R-STAN [73], ST-Pyramid Network [74], ATW [75], DOVF [76], Four-Stream [77], TLE [78], and DTPP [79]. The open-set methods were: ODN [43], P-ODN [44], SDMM [48], and Mishra et al [47].…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

Spatio-Temporal Representation Matching-Based Open-Set Action Recognition by Joint Learning of Motion and Appearance

Yoon

Jeon

2019

IEEE Access

View full text Add to dashboard Cite

In this paper, we propose the spatio-temporal representation matching (STRM) for video-based action recognition under the open-set condition. Open-set action recognition is a more challenging problem than closed-set action recognition since samples of the untrained action class need to be recognized and most of the conventional frameworks are likely to give a false prediction. To handle the untrained action classes, we propose STRM, which involves jointly learning both motion and appearance. STRM extracts spatio-temporal representations from video clips through a joint learning pipeline with both motion and appearance information. Then, STRM computes the similarities between the ST-representations to find the one with highest similarity. We set the experimental protocol for open-set action recognition and carried out experiments on UCF101 and HMDB51 to evaluate STRM. We first investigated the effects of different hyperparameter settings on STRM, and then compared its performance with existing state-of-the-art methods. The experimental results showed that the proposed method not only outperformed existing methods under the open-set condition, but also provided comparable performance to the state-of-the-art methods under the closed-set condition.

show abstract

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

Spatio-Temporal Representation Matching-Based Open-Set Action Recognition by Joint Learning of Motion and Appearance

Yoon

Jeon

2019

IEEE Access

View full text Add to dashboard Cite

show abstract

“…In the image translation task, a channel attention network was designed by Sun et al [39], with which the original function in the encoder and the conversion function in the decoder can be better integrated. In addition, Liu et al [40] proposed a spatiotemporal attention module for video action recognition. Gao et al [41] introduced a residual attention mechanism to one convolutional layer object tracking network to avoid data imbalance.…”

Section: Attention Mechanismmentioning

confidence: 99%

An Improved Faster R-CNN for High-Speed Railway Dropper Detection

Guo

Liu

et al. 2020

IEEE Access

View full text Add to dashboard Cite

Overhead contact systems (OCSs) are the power supply facility of high-speed trains and plays a vital role in the operation of high-speed trains. The dropper is an important guarantee for the suspension system of the OCS. Faults of the dropper, such as slack and breakage, can cause a certain threat to the power supply system. How to use artificial intelligence technologies to detect faults is an urgent technical problem to be solved. Because droppers are very small in whole images, a feasible solution to the problem is to identify and locate the droppers first, then segment them, and then identify the fault type of the segmented droppers. This paper proposes an improved Faster R-CNN algorithm that can accurately identify and locate droppers. The innovations of the method consist of two parts. First, a balanced attention feature pyramid network (BA-FPN) is used to predict the detection anchor. Based on the attention mechanism, BA-FPN performs feature fusion on feature maps of different levels of the feature pyramid network to balance the original features of each layer. After that, a center-point rectangle loss (CR Loss) is designed as the bounding box regression loss function of Faster R-CNN. Through a center-point rectangle penalty term, the anchor box quickly moves closer to the ground-truth box during the training process. We validate the improved Faster R-CNN through extensive experiments on the VOC 2012 and MSCOCO 2014 datasets. Experimental results prove the effectiveness of the proposed network combined with attention feature fusion and center-point rectangle loss. On the OCS dataset, the accuracy using the combination of the improved Faster R-CNN and ResNet-101 reached 86.

show abstract

“…Previously, 2D convolutional neural networks [27], [28] trained by ImageNet [29] were usually exploited for RGB image classification. However, for the task of video classification, appearance information is not enough, and dynamic features representation play a vital role in the process of recognition [9], [30]. To simulate motion information, K. Simonyan et al proposed a two-stream ConvNet architecture which incorporates spatial and temporal networks [8], where the temporal stream is trained to recognize actions from motion in the form of dense optical flow.…”

Section: Related Workmentioning

confidence: 99%