End-to-End Learning of Motion Representation for Video Understanding

Fan, Linlin; Huang, Wenbing; Gan, Chuang; Ermon, Stefano; Gong, Boqing; Huang, Junzhou

doi:10.1109/cvpr.2018.00630

Cited by 211 publications

(140 citation statements)

References 46 publications

Supporting

Mentioning

129

Contrasting

Order By: Relevance

“…Understanding human actions in videos has been becoming a prominent research topic in computer vision, owing to its various applications in security surveillance, human behavior analysis and many other areas [10,35,38,12,13,14,15,16,42]. Despite the fruitful progress in this vein, there are still some challenging tasks demanding further exploration -temporal action localization is such an example.…”

Section: Introductionmentioning

confidence: 99%

Graph Convolutional Networks for Temporal Action Localization

Zeng

Huang

Gan

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

480

233

View full text Add to dashboard Cite

Most state-of-the-art action localization systems process each action proposal individually, without explicitly exploiting their relations during learning. However, the relations between proposals actually play an important role in action localization, since a meaningful action always consists of multiple proposals in a video. In this paper, we propose to exploit the proposal-proposal relations using Graph Convolutional Networks (GCNs). First, we construct an action proposal graph, where each proposal is represented as a node and their relations between two proposals as an edge. Here, we use two types of relations, one for capturing the context information for each proposal and the other one for characterizing the correlations between distinct actions. Then we apply the GCNs over the graph to model the relations among different proposals and learn powerful representations for the action classification and localization. Experimental results show that our approach significantly outperforms the state-of-the-art on THUMOS14 (49.1% versus 42.8%). Moreover, augmentation experiments on ActivityNet also verify the efficacy of modeling action proposal relationships.

show abstract

Section: Introductionmentioning

confidence: 99%

Graph Convolutional Networks for Temporal Action Localization

Zeng

Huang

Gan

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

480

233

View full text Add to dashboard Cite

show abstract

“…In this subsection, we would like to see whether the performance can be further improved with the motion information added. We extract the optical flow using the initialized TVNet [7] without finetuning, and calculate the optical flow statistics as described in [22], then concatenate the statistics to the content-aware features. The performance comparison of our model with/without motion information on KoNViD-1k is shown in Figure 7.…”

Section: Motion Informationmentioning

confidence: 99%

Quality Assessment of In-the-Wild Videos

Jiang

2019

Proceedings of the 27th ACM International Conference on Multimedia

230

245

View full text Add to dashboard Cite

Quality assessment of in-the-wild videos is a challenging problem because of the absence of reference videos and shooting distortions. Knowledge of the human visual system can help establish methods for objective quality assessment of in-the-wild videos. In this work, we show two eminent effects of the human visual system, namely, content-dependency and temporal-memory effects, could be used for this purpose. We propose an objective no-reference video quality assessment method by integrating both effects into a deep neural network. For content-dependency, we extract features from a pre-trained image classification neural network for its inherent content-aware property. For temporal-memory effects, long-term dependencies, especially the temporal hysteresis, are integrated into the network with a gated recurrent unit and a subjectivelyinspired temporal pooling layer. To validate the performance of our method, experiments are conducted on three publicly available inthe-wild video quality assessment databases: KoNViD-1k, CVD2014, and LIVE-Qualcomm, respectively. Experimental results demonstrate that our proposed method outperforms five state-of-the-art methods by a large margin, specifically, 12.39%, 15.71%, 15.45%, and 18.09% overall performance improvements over the second-best method VBLIINDS, in terms of SROCC, KROCC, PLCC and RMSE, respectively. Moreover, the ablation study verifies the crucial role of both the content-aware features and the modeling of temporalmemory effects. The PyTorch implementation of our method is released at https://github.com/lidq92/VSFA. KEYWORDSvideo quality assessment; human visual system; content dependency; temporal-memory effects; in-the-wild videos ACM Reference Format:

show abstract

“…However, calculating optical flow with TV-L1 method [38] is expensive in both time and space. Recently many approaches have been proposed to estimate optical flow with CNN [5,14,6,21] or explored alternatives of optical flow [33,39,26,18]. TSN frameworks [33] involved RGB difference between two frames to represent motion in videos.…”

Section: Related Workmentioning

confidence: 99%

STM: SpatioTemporal and Motion Encoding for Action Recognition

Jiang

Wang

Gan

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

400

239

View full text Add to dashboard Cite

Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e., Something-Something v1 & v2 and Jester) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51) with the help of encoding spatiotemporal and motion features together.

show abstract

End-to-End Learning of Motion Representation for Video Understanding

Cited by 211 publications

References 46 publications

Graph Convolutional Networks for Temporal Action Localization

Graph Convolutional Networks for Temporal Action Localization

Quality Assessment of In-the-Wild Videos

STM: SpatioTemporal and Motion Encoding for Action Recognition

Contact Info

Product

Resources

About