2021
DOI: 10.48550/arxiv.2105.03245
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Adaptive Focus for Efficient Video Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
16
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 6 publications
(16 citation statements)
references
References 47 publications
0
16
0
Order By: Relevance
“…State Representation Motivated by (Wu et al 2019(Wu et al , 2020Wang et al 2021), the FSA receives information from the state signal s t to make the frame selection decision. To select appropriate frame pairs, the information fed to the FSA should contain: 1) current performance of the pose estimator (to see how much room the pose estimator E has for improvement); and 2) global contextual information in the video (to see where the informative frames might be).…”
Section: Frame Selection Agent (Fsa)mentioning
confidence: 99%
“…State Representation Motivated by (Wu et al 2019(Wu et al , 2020Wang et al 2021), the FSA receives information from the state signal s t to make the frame selection decision. To select appropriate frame pairs, the information fed to the FSA should contain: 1) current performance of the pose estimator (to see how much room the pose estimator E has for improvement); and 2) global contextual information in the video (to see where the informative frames might be).…”
Section: Frame Selection Agent (Fsa)mentioning
confidence: 99%
“…Wu et al [47] utilizes multi-agent reinforce learning to model parallel frame sampling and Lin et al [24] make one-step decision with holistic view. Meng et al [27] and Wang et al [42,44] focus their attention on spatial redundancy. Panda et al adaptively decide modalities for video segments.…”
Section: Related Workmentioning
confidence: 99%
“…Results on AcitivtyNet. We compare the proposed method with recent SOTA methods on AcitivtyNet in Table 3: SCSampler [20], AR-Net [27], AdaMML [30], VideoIQ [36], AdaFcous [42], Dynamic-STE [19] and FrameExit [12]. Experimental result shows that our method outperforms all existing methods with ResNet50 as the main recognition network.…”
Section: Comparison With Simple Baselinesmentioning
confidence: 99%
See 1 more Smart Citation
“…Another type of dynamic CNNs skips redundant layers [38,42,48] or channels [28] conditioned on the inputs. Besides, the spatial adaptive paradigm [14,4,44,39,43] has been proposed for efficient image and video recognition. Although these works are related to DVT on the spirit of adaptive computation, they are developed based on CNN, while DVT is tailored for vision Transformers.…”
Section: Introductionmentioning
confidence: 99%