2017 IEEE International Conference on Computer Vision (ICCV) 2017
DOI: 10.1109/iccv.2017.590
|View full text |Cite
|
Sign up to set email alerts
|

Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

Abstract: Convolutional Neural Networks (CNN) have been regarded as a powerful class of models for image recognition problems. Nevertheless, it is not trivial when utilizing a CNN for learning spatio-temporal video representation. A few studies have shown that performing 3D convolutions is a rewarding approach to capture both spatial and temporal dimensions in videos. However, the development of a very deep 3D CNN from scratch results in expensive computational cost and memory demand. A valid question is why not recycle… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
1,010
1
3

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 1,550 publications
(1,078 citation statements)
references
References 32 publications
2
1,010
1
3
Order By: Relevance
“…However, 3D convolution increases computational complexity substantially and greatly affects inference efficiency. Some works [21,37,29] factorize the spatial-temporal 3D convolution into one 2D spatial convolution and one 1D temporal convolution, and than ensemble them in a sequential or parallel manner. However, the spatial and temporal information are modeled independently.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…However, 3D convolution increases computational complexity substantially and greatly affects inference efficiency. Some works [21,37,29] factorize the spatial-temporal 3D convolution into one 2D spatial convolution and one 1D temporal convolution, and than ensemble them in a sequential or parallel manner. However, the spatial and temporal information are modeled independently.…”
Section: Related Workmentioning
confidence: 99%
“…3D-Conv [13,27,2] directly applies 3D convolutional kernels (e.g., 3 × 3 × 3), resulting into a heavy computational cost. (2+1)D-Conv [41,39,29,37,21] decomposes the 3D kernels into 2D spatial kernels and 1D temporal kernels, which are sequentially stacked. Different from the above convolutions which regard all input channels as a whole, STH constructs spatio-temporal hybrid kernels by mixing the basic spatial and temporal kernels along the input channels in one convolutional layer, resulting in deeper integration of spatio-temporal information in one layer.…”
Section: The #Param Of An Augmented Kernel Of the 2d-conv Ismentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, 3D CNNs have shown promise in 3D shape recognition, 3D object detection, CAD classification, hand gesture recognition, and diagnostic imaging predictions . 3D CNNs can reach into 3D space and aggregate authentically three‐dimensional information about a structure.…”
Section: Introductionmentioning
confidence: 99%
“…Recently, 3D CNNs have shown promise in 3D shape recognition, 3D object detection, CAD classification, hand gesture recognition, and diagnostic imaging predictions. 16,[25][26][27][28][29] 3D CNNs can reach into 3D space and aggregate authentically three-dimensional information about a structure. 3D CNNs have the ability to train on a richer and more holistic learning environment, but are much more memory-intensive than 2D models.…”
Section: Introductionmentioning
confidence: 99%