2017
DOI: 10.48550/arxiv.1708.05038
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ConvNet Architecture Search for Spatiotemporal Feature Learning

Abstract: Learning image representations with ConvNets by pretraining on ImageNet has proven useful across many visual understanding tasks including object detection, semantic segmentation, and image captioning. Although any image representation can be applied to video frames, a dedicated spatiotemporal representation is still vital in order to incorporate motion patterns that cannot be captured by appearance based models alone. This paper presents an empirical ConvNet architecture search for spatiotemporal feature lear… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
100
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 99 publications
(102 citation statements)
references
References 34 publications
2
100
0
Order By: Relevance
“…Classical models for video action recognition [10,12,17,19,26,66,80,88,89,91] aim to predict action categories without paying attention to action orders as many as possible due to simple frame feature aggregation such as pooling. Nevertheless, our task intends to verify two videos with large as well as subtle step-level transformations.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Classical models for video action recognition [10,12,17,19,26,66,80,88,89,91] aim to predict action categories without paying attention to action orders as many as possible due to simple frame feature aggregation such as pooling. Nevertheless, our task intends to verify two videos with large as well as subtle step-level transformations.…”
Section: Methodsmentioning
confidence: 99%
“…Traditional action-related tasks such as action recognition, action detection, and action segmentation have been greatly developed due to the advances in CNNs. i) As a means of general video representation, deeplearning-based action recognition can be generally summarized to stream-based methods [10,12,17,19,26,52,66,80,88,89,91,105] and skeleton-based [21,81,93,99] methods. Both kinds of methods aim to produce a feature representation for each trimmed video, to which a video-level label over predefined action categories is predicted according.…”
Section: Related Workmentioning
confidence: 99%
“…It is well known that optical flow is computed by using the RGB frames, which is time-consuming and would bring in a bottleneck. The second class is based on a series of 3D convolutional networks, such as, C3D [16], I3D [7], T3D [38], Res3D [39], and so on, which are extended from 2D networks in spatiotemporal dimension. Due to the computation consumption of general 3D convolutional networks, Qiu et al [40] proposed a Pseudo-3D residual network (P3D) that decomposes the convolutions into separate 2D spatial and 1D temporal filters.…”
Section: A Rgb-based Action Recognitionmentioning
confidence: 99%
“…C3D [30] constructs 3D kernels to extract short-term information from the RGB frame input. R(2+1)D [31] applies skip connection to C3D and explore different 3D and 2D convolution combinations. I3D [2] inflates 2D convolutional and pooling kernels of 2D CNN trained on image datasets into 3D to use well-trained 2D CNN parameters.…”
Section: Action Recognitionmentioning
confidence: 99%