Compressed Video Action Recognition

Wu, Chao-Yuan; Zaheer, Manzil; Hu, Hexiang; Manmatha, R.; Smola, Alexander J.; Krähenbühl, Philipp

doi:10.1109/cvpr.2018.00631

Cited by 271 publications

(173 citation statements)

References 41 publications

Supporting

Mentioning

173

Contrasting

Order By: Relevance

“…Motion Vector (MV) was a coarse representation of motion, but it can be obtained directly from compressed video streams without extra calculation. Therefore, Enhanced Motion Vectors CNN (EMV-CNN) [32] used motion vector as the input of temporal CNN to improve inference speed and CoViAR [29] adopted an accumulated motion vector for real-time action recognition. Suffered from the lack of fine detailed motion information in MV, recognition performance was degraded dramatically.…”

Section: Related Workmentioning

confidence: 99%

“…Experiment results prove that our network is highly tolerant to the quality of motion input thanks to the combination of short-term spatiotemporal feature fusion, sequentially middle-term temporal modeling and long-term temporal consensus. EMV-CNN and CoViAR [32,29] also used motion vectors but the simple replacement without consideration of more effective spatiotemporal representation results in a significant performance degradation than opticalflow-based Two-Stream CNN.…”

Section: Exploration Studymentioning

confidence: 99%

“…Method UCF-101 HMDB-51 iDT [24] 86.4 61.7 Two stream CNN [15] 88.0 59.4 TDD [26] 91.5 65.9 Long Term Convolution [22] 91.7 64.8 Spatiotemporal Pyramid Network [28] 94.6 68.9 Spatiotemporal Multiplier Network [6] 94.2 68.9 Two stream TSN [27] 94.0 68.5 ST-VLMPF [4] 93.6 69.5 Two-Stream I3D [2] 93.4 66.4 Lattice LSTM [18] 93.6 66.2 Full OFF [19] 96.0 74.2 Full IF-TTN 96.2 74.8 C3D [20] 82.3 -TSN(RGB) [27] 85.7 51.0 TSN(RGB+RGB Difference) [27] 91.0 -RGB+EMV-CNN 86.4 53.0 CoViAR [29] 90.4 59.1 real-time OFF [19] 93.3 -MV-IF-TTN 94.5 70.0 while the lower part presents real-time methods. Notice that for non-real-time methods we assemble the optical flow and motion vectors based IF-TTN scores to make final predictions (denoted as Full IF-TTN).…”

Section: Comparison With the State Of The Artmentioning

confidence: 99%

See 2 more Smart Citations

Attentional Fused Temporal Transformation Network for Video Action Recognition

Yang¹,

Wang²,

Dai³

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Effective spatiotemporal feature representation is crucial to the video-based action recognition task. Focusing on discriminate spatiotemporal feature learning, we propose Information Fused Temporal Transformation Network (IF-TTN) for action recognition on top of popular Temporal Segment Network (TSN) framework. In the network, Information Fusion Module (IFM) is designed to fuse the appearance and motion features at multiple ConvNet levels for each video snippet, forming a short-term video descriptor. With fused features as inputs, Temporal Transformation Networks (TTN) are employed to model middle-term temporal transformation between the neighboring snippets following a sequential order. As TSN itself depicts longterm temporal structure by segmental consensus, the proposed network comprehensively considers multiple granularity temporal features. Our IF-TTN achieves the stateof-the-art results on two most popular action recognition datasets: UCF101 and HMDB51. Empirical investigation reveals that our architecture is robust to the input motion map quality. Replacing optical flow with the motion vectors from compressed video stream, the performance is still comparable to the flow-based methods while the testing speed is 10x faster.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Exploration Studymentioning

confidence: 99%

Section: Comparison With the State Of The Artmentioning

confidence: 99%

See 1 more Smart Citation

Attentional Fused Temporal Transformation Network for Video Action Recognition

Yang¹,

Wang²,

Dai³

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Hallucination Since the computation of optical flow is time consuming and storage demanding, some attempts to learn other way to replace the flow to represent motion information. (Wu, C.Y.et al, 2017) [12]proposed that compressed video algorithms can decrease the redundant information, so that can be used accumulated motion vector and residuals to describe motion. Compared to traditional flow methods, motion vectors bring more than 20 times acceleration although a significant drop in accuracy.…”

Section: Related Workmentioning

confidence: 99%

Bypass Enhancement RGB Stream Model for Pedestrian Action Recognition of Autonomous Vehicles

Cao

2020

Communications in Computer and Information Science

View full text Add to dashboard Cite

Pedestrian action recognition and intention prediction is one of the core issues in the field of autonomous driving. In this research field, action recognition is one of the key technologies. A large number of scholars have done a lot of work to improve the accuracy of the algorithm for the task. However, there are relatively few studies and improvements in the computational complexity of algorithms and system real-time. In the autonomous driving application scenario, the real-time performance and ultra-low latency of the algorithm are extremely important evaluation indicators, which are directly related to the availability and safety of the autonomous driving system. To this end, we construct a bypass enhanced RGB flow model, which combines the previous two-branch algorithm to extract RGB feature information and optical flow feature information respectively. In the training phase, the two branches are merged by distillation method, and the bypass enhancement is combined in the inference phase to ensure accuracy. The real-time behavior of the behavior recognition algorithm is significantly improved on the premise that the accuracy does not decrease. Experiments confirm the superiority and effectiveness of our algorithm.

show abstract

“…They jointly trained a compression network with an inference network and bring performance gain. On the video side, Wu et al [25] designed a compressed video action recognition system by using separate networks for I-frames and P-frames. Their approach is more efficient than the conventional 3D convolution structures.…”

Section: Related Workmentioning

confidence: 99%

Exploring Semantic Segmentation on the DCT Representation

Hang

2019

Proceedings of the ACM Multimedia Asia

View full text Add to dashboard Cite

Typical convolutional networks are trained and conducted on RGB images. However, images are often compressed for memory savings and efficient transmission in real-world applications. In this paper, we explore methods for performing semantic segmentation on the discrete cosine transform (DCT) representation defined by the JPEG standard. We first rearrange the DCT coefficients to form a preferred input type, then we tailor an existing network to the DCT inputs. The proposed method has an accuracy close to the RGB model at about the same network complexity. Moreover, we investigate the impact of selecting different DCT components on segmentation performance. With a proper selection, one can achieve the same level accuracy using only 36% of the DCT coefficients. We further show the robustness of our method under quantization errors. To our knowledge, this paper is the first to explore semantic segmentation on the DCT representation.

show abstract

Compressed Video Action Recognition

Cited by 271 publications

References 41 publications

Attentional Fused Temporal Transformation Network for Video Action Recognition

Attentional Fused Temporal Transformation Network for Video Action Recognition

Bypass Enhancement RGB Stream Model for Pedestrian Action Recognition of Autonomous Vehicles

Exploring Semantic Segmentation on the DCT Representation

Contact Info

Product

Resources

About