2018
DOI: 10.1109/tcsvt.2017.2682196
|View full text |Cite
|
Sign up to set email alerts
|

Pooling the Convolutional Layers in Deep ConvNets for Video Action Recognition

Abstract: Deep ConvNets have shown its good performance in image classification tasks. However it still remains as a problem in deep video representation for action recognition. The problem comes from two aspects: on one hand, current video ConvNets are relatively shallow compared with image ConvNets, which limits its capability of capturing the complex video action information; on the other hand, temporal information of videos is not properly utilized to pool and encode the video sequences.Towards these issues, in this… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
53
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 110 publications
(55 citation statements)
references
References 43 publications
1
53
0
Order By: Relevance
“…ConvNets for Feature Extraction versus End-to-End Classification: Table 1(b) shows that treating the ConvNets as feature extractors performs significantly better than using them for end-to-end classification. This agrees with the observations of others [2,37,54]. We further observe that the VLAD encoded conv5 features perform better than fc6.…”
Section: Implementation Detailssupporting
confidence: 93%
See 1 more Smart Citation
“…ConvNets for Feature Extraction versus End-to-End Classification: Table 1(b) shows that treating the ConvNets as feature extractors performs significantly better than using them for end-to-end classification. This agrees with the observations of others [2,37,54]. We further observe that the VLAD encoded conv5 features perform better than fc6.…”
Section: Implementation Detailssupporting
confidence: 93%
“…Another attractive property of using feature representations is that we can manipulate them in various ways to further improve the performance. For instance, we can employ different (i) encoding methods: Fisher vector [25], VideoDarwin [6]; (ii) normalization techniques: rank normalization [18]; and (iii) pooling methods: line pooling [54], trajectory pooling [43,54], etc. Early versus Late Fusion: Table 1(b) also shows that early fusion of features through concatenation performs better than late fusion of SVM probabilities.…”
Section: Implementation Detailsmentioning
confidence: 99%
“…Visual representations play an important role in visual recognition. In particular, the visual representations learned with deep Convolutional Neural Networks (CNNs) have improved the performances of object recognition, e.g., (Chatfield et al 2014;He et al 2016;Simonyan and Zisserman 2015;Szegedy et al 2015), and human action recognition, e.g., Wang et al 2016;Wu et al 2016;Zhao et al 2015). Benefitting from deep learning, zero-shot visual recognition performances have also been boosted, e.g., (Akata et al 2014;Al-Halah and Stiefelhagen 2015;Reed et al 2016).…”
Section: Introductionmentioning
confidence: 99%
“…The task of video-based driving behaviour recognition facilitates intelligent surveillance and can be regarded as a fine-grained video-based human behaviour recognition. Recognition of human behaviour in videos has received great attention in recent researches [3,4,5,6]. In [3], Karpathy et al used stacked video frames as input and trained a multiple resolution CNN to recognize human behaviour.…”
Section: Introductionmentioning
confidence: 99%