2015 IEEE International Conference on Computer Vision (ICCV) 2015
DOI: 10.1109/iccv.2015.510
|View full text |Cite
|
Sign up to set email alerts
|

Learning Spatiotemporal Features with 3D Convolutional Networks

Abstract: We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets; 2) A homogeneous architecture with small 3 × 3 × 3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets; and 3) Our learned features, namely C3D (Convolutional… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

16
5,381
2
9

Year Published

2015
2015
2022
2022

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 7,556 publications
(5,749 citation statements)
references
References 38 publications
16
5,381
2
9
Order By: Relevance
“…By jointly encoding spatio-temporal information in the learning process, 3D convolutional networks [50] have achieved good performance in semantic video short classification. Other works use 2D CNN structures to perform recognition and detection tasks in video by fine-tuning the networks using video frames [51], or by combining video frames and optical flow maps as the input layer [42] [52].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…By jointly encoding spatio-temporal information in the learning process, 3D convolutional networks [50] have achieved good performance in semantic video short classification. Other works use 2D CNN structures to perform recognition and detection tasks in video by fine-tuning the networks using video frames [51], or by combining video frames and optical flow maps as the input layer [42] [52].…”
Section: Related Workmentioning
confidence: 99%
“…Tran et al [50] proposed a 3D convolutional network for video classification. However, this structure does not produce the pixel-level labeling that is required in our task.…”
Section: Pixel-level Cnnmentioning
confidence: 99%
“…Here, we adopt the recent approach of Xu et al [38], which encodes features learned by a conv-net model using VLAD. Here, we use the activations from the fc7 layer from a 3D conv-net [34] as our features. We first learn a codebook using k-means with k = 256.…”
Section: Daps For Action Detectionmentioning
confidence: 99%
“…Our network integrates the following modules: Visual encoder: It encodes a small video volume into a meaningful low dimensional feature vector. In practice, we use activations from the top layer of a 3D convolutional network trained for action classification (C3D network [34]). Sequence encoder: It encodes the sequence of visual codes as a discriminative sequence of hidden states.…”
Section: Architecturementioning
confidence: 99%
See 1 more Smart Citation