2022
DOI: 10.1007/978-3-031-19772-7_14
|View full text |Cite
|
Sign up to set email alerts
|

AdaFocusV3: On Unified Spatial-Temporal Dynamic Video Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 10 publications
(5 citation statements)
references
References 54 publications
0
3
0
Order By: Relevance
“…TSN [ 14 ], TRN [ 15 ], ECO [ 47 ], TSMNet [ 16 , 17 ] and MSNet [ 48 ] are static to input and therefore require no policy model other than the classification model. The other CNN-based methods we picked (Adafuse [ 35 ], Dynamic-STE [ 40 ], FrameExit [ 39 ], LiteEval [ 29 ], SCSampler [ 32 ], ARNet [ 36 ], VideoIQ [ 34 ] and Adafocus v1-3 [ 36 , 37 , 38 ]) include dynamic optimization of inferencing. They optimize the input-size or inference pipeline dynamically from each viewpoint and require a policy model (and additional loss functions to train them) to adjust inference other than the classification model.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…TSN [ 14 ], TRN [ 15 ], ECO [ 47 ], TSMNet [ 16 , 17 ] and MSNet [ 48 ] are static to input and therefore require no policy model other than the classification model. The other CNN-based methods we picked (Adafuse [ 35 ], Dynamic-STE [ 40 ], FrameExit [ 39 ], LiteEval [ 29 ], SCSampler [ 32 ], ARNet [ 36 ], VideoIQ [ 34 ] and Adafocus v1-3 [ 36 , 37 , 38 ]) include dynamic optimization of inferencing. They optimize the input-size or inference pipeline dynamically from each viewpoint and require a policy model (and additional loss functions to train them) to adjust inference other than the classification model.…”
Section: Methodsmentioning
confidence: 99%
“…These approaches usually focus on dynamic optimizations of input-size using additional policy models or modules. For example, they select salient frames/clips as input [ 29 , 30 , 31 , 32 ]; adjust the resolutions [ 33 ], precision of network quantization [ 34 ] and channel fusion [ 35 ]; predict crop regions to input [ 36 , 37 , 38 ]; or determine whether to exit early in the time direction [ 39 ] or to use the heavy teacher knowledge [ 40 ]. Our method, on the other had, selects spatiotemporal patches to input.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…As a special type of efficient neural networks, dynamic neural networks can adaptively change their inference complexity under different computational budgets, latency requirements, and prediction confidence requirements. Existing dynamic neural networks are designed in different aspects in-cluding sample-wise (Teerapittayanon, McDanel, and Kung 2016;Wang et al 2018;Veit and Belongie 2018;Wu et al 2018;Yu et al 2019;Guo et al 2019), spatial-wise (Li et al 2017;Wang et al 2019aWang et al , 2020Yang et al 2020;Wang et al 2022a), and temporal-wise dynamism (Shen et al 2017;Yu, Lee, and Le 2017;Wu et al 2019;Meng et al 2020;Wang et al 2022b), as categorized by (Han et al 2021). Specially, as one type of sample-wise methods, depth-wise dynamic models with early exits adaptively exit at different layer depths given different inputs (Huang et al 2017;Li et al 2019;McGill and Perona 2017;Jie et al 2019;Yang et al 2020).…”
Section: Dynamic Neural Networkmentioning
confidence: 99%
“…Fortunately, images often have more spatial redundancy than languages (Wang, Stuijk, and De Haan 2014), such as regions with task-unrelated objects. Thus, many works (Wang et al 2020;Yang et al 2020;Wang et al 2022aWang et al , 2021bWang et al , 2022bHan et al 2022bHan et al , 2021c o try to adaptively reduce the input resolution of convolution neural networks. Also, great efforts have been made to excavate redundant tokens for ViTs.…”
Section: Introductionmentioning
confidence: 99%