Revisiting the Effectiveness of Off-the-shelf Temporal Modeling Approaches for Large-scale Video Classification

Bian, Yunlong; Gan, Chuang; Liu, Xiao; Fu, Li; Long, Xiang; Li, Yandong; Qi, Heng; Zhou, Jie; Wen, Shifeng; Lin, Yuanqing

doi:10.48550/arxiv.1708.03805

Cited by 34 publications

(28 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The change in both training and validation sets generates a small discrepancy between experiments conducted at different times. We explicitly denote results on the original Kinetics dataset with an asterisk (*) in all tables and provide the list of videos available at the time of our experiments to enable others to reproduce our results 1 . HMDB-51 and UCF-101.…”

Section: Datasetsmentioning

confidence: 99%

“…Kinetics-400 ARTNet [33] RGB+Flow 72.4* TSN [30] RGB+Flow 73.9* R(2+1)D [31] RGB+Flow 75.4* NL I3D [34] RGB 77.7* SAN [1] RGB+Flow+Audio 77.7* I3D [3] RGB 70.6 / 71.1* I3D [3] Flow 62.1 / 63.9* I3D [3] RGB+Flow 72.6 / 74.1* S3D-G [35] RGB 74.0 / 74.7* S3D-G [35] Flow 67.3 / 68.0* S3D-G [35] RGB+Flow 76.2 / 77.2* D3D RGB 75.9 D3D+S3D-G RGB+RGB 76.5…”

Section: Modalitymentioning

confidence: 99%

See 1 more Smart Citation

D3D: Distilled 3D Networks for Video Action Recognition

Stroud

Ross

Sun

et al. 2020

2020 IEEE Winter Conference on Applications of Computer Vision (WACV)

146

View full text Add to dashboard Cite

State-of-the-art methods for video action recognition commonly use an ensemble of two networks: the spatial stream, which takes RGB frames as input, and the temporal stream, which takes optical flow as input. In recent work, both of these streams consist of 3D Convolutional Neural Networks, which apply spatiotemporal filters to the video clip before performing classification. Conceptually, the temporal filters should allow the spatial stream to learn motion representations, making the temporal stream redundant. However, we still see significant benefits in action recognition performance by including an entirely separate temporal stream, indicating that the spatial stream is "missing" some of the signal captured by the temporal stream. In this work, we first investigate whether motion representations are indeed missing in the spatial stream of 3D CNNs. Second, we demonstrate that these motion representations can be improved by distillation, by tuning the spatial stream to predict the outputs of the temporal stream, effectively combining both models into a single stream. Finally, we show that our Distilled 3D Network (D3D) achieves performance on par with two-stream approaches, using only a single model and with no need to compute optical flow.

show abstract

Section: Datasetsmentioning

confidence: 99%

Section: Modalitymentioning

confidence: 99%

D3D: Distilled 3D Networks for Video Action Recognition

Stroud

Ross

Sun

et al. 2020

2020 IEEE Winter Conference on Applications of Computer Vision (WACV)

146

View full text Add to dashboard Cite

show abstract

“…A recent research topic is to estimate optical flow by CNNs [8,35,31,18,26,4]. These approaches cast the optical flow estimation as an optimization problem with respect to the CNN parameters.…”

Section: Related Workmentioning

confidence: 99%

End-to-End Learning of Motion Representation for Video Understanding

Fan

Huang

Gan

et al. 2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Self Cite

211

129

View full text Add to dashboard Cite

Despite the recent success of end-to-end learned representations, hand-crafted optical flow features are still widely used in video analysis tasks. To fill this gap, we propose TVNet, a novel end-to-end trainable neural network, to learn optical-flow-like features from data. TVNet subsumes a specific optical flow solver, the TV-L1 method, and is initialized by unfolding its optimization iterations as neural layers. TVNet can therefore be used directly without any extra learning. Moreover, it can be naturally concatenated with other task-specific networks to formulate an end-to-end architecture, thus making our method more efficient than current multi-stage approaches by avoiding the need to pre-compute and store features on disk. Finally, the parameters of the TVNet can be further fine-tuned by end-to-end training. This enables TVNet to learn richer and task-specific patterns beyond exact optical flow. Extensive experiments on two action recognition benchmarks verify the effectiveness of the proposed approach. Our TVNet achieves better accuracies than all compared methods, while being competitive with the fastest counterpart in terms of features extraction time. * indicates equal contributions. This work was conducted when Lijie Fan was served as a research intern in Tencent AI Lab.

show abstract

“…Top-5 blVNet [18] 73.5 91.2 --STM [31] 73.7 91.6 --TEA [41] 76.1 92.5 --TS S3D-G [60] 77.2 93.0 --3-stream SATT [8] 77.7 93.2 --AVSlowFast, R101 [59] 78.8 93.6 85.0 † -LGD-3D R101 [48] 79.4 94.4 --SlowFast R101-NL [20] 79.8 93.9 --ViViT-Base [6] 80 [32] and Kinetics Sound [4]. We report top-1 and top-5 classification accuracy.…”

Section: Moments In Timementioning

confidence: 99%

Attention Bottlenecks for Multimodal Fusion

Nagrani

Yang

Arnab

et al. 2021

Preprint

View full text Add to dashboard Cite

Humans perceive the world by concurrently processing and fusing highdimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality ('late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses 'fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.Preprint. Under review.

show abstract

Revisiting the Effectiveness of Off-the-shelf Temporal Modeling Approaches for Large-scale Video Classification

Cited by 34 publications

References 8 publications

D3D: Distilled 3D Networks for Video Action Recognition

D3D: Distilled 3D Networks for Video Action Recognition

End-to-End Learning of Motion Representation for Video Understanding

Attention Bottlenecks for Multimodal Fusion

Contact Info

Product

Resources

About