2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.01233
|View full text |Cite
|
Sign up to set email alerts
|

Learning Spatio-Temporal Representation With Local and Global Diffusion

Abstract: Convolutional Neural Networks (CNN) have been regarded as a powerful class of models for visual recognition problems. Nevertheless, the convolutional filters in these networks are local operations while ignoring the large-range dependency. Such drawback becomes even worse particularly for video recognition, since video is an information-intensive media with complex temporal variations. In this paper, we present a novel framework to boost the spatio-temporal representation learning by Local and Global Diffusion… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
90
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 166 publications
(91 citation statements)
references
References 46 publications
(75 reference statements)
1
90
0
Order By: Relevance
“…Table 6 shows comparison results with conventional methods on the UCF-101 dataset. LGD-3D Two-stream and PoTion + I3D showed similar accuracies to that of the proposed method, but the accuracy of the proposed method was higher on other datasets [ 6 , 36 ].…”
Section: Resultsmentioning
confidence: 86%
“…Table 6 shows comparison results with conventional methods on the UCF-101 dataset. LGD-3D Two-stream and PoTion + I3D showed similar accuracies to that of the proposed method, but the accuracy of the proposed method was higher on other datasets [ 6 , 36 ].…”
Section: Resultsmentioning
confidence: 86%
“…uses a shared network of 2D CNNs over three orthogonal views of video to obtain spatial and temporal signals for action recognition. (Qiu et al, 2019) adopts a twopath network architecture that integrates global and local information of both temporal and spatial dimensions for video classification. Other research areas that investigate spatio-temporal learning include video captioning (Aafaq et al, 2019), video super-resolution (Li et al, 2019b), and video object segmentation (Xu et al, 2019).…”
Section: Related Workmentioning
confidence: 99%
“…In=c log(pn), (6) where I n=c is an indicator function which equals to 1 if n is the ground truth class label c, otherwise 0. For location regression, we employ the Smooth L1 loss (S L1 ) to force the proposal (ϕ c , ϕ w ) to move towards its closest ground truth proposal (g c , g w ).…”
Section: Training and Inferencementioning
confidence: 99%
“…With the tremendous increase in online and personal media archives, people are generating, storing, and consuming a large collection of videos. This trend encourages the development of effective and efficient algorithms to intelligently parse video data [1,2,3,4,5,6] and discover semantic information [7,8]. One fundamental challenge underlying the success of these advances is action detection from videos in both temporal [9,10] and spatio-temporal aspects [11].…”
Section: Introductionmentioning
confidence: 99%