Adaptive Focus for Efficient Video Recognition

Wang, Yulin; Chen, Zhaoxi; Jiang, Haojun; Song, Shiji; Han, Yizeng; Huang, Gao

doi:10.48550/arxiv.2105.03245

Cited by 6 publications

(16 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…State Representation Motivated by (Wu et al 2019(Wu et al , 2020Wang et al 2021), the FSA receives information from the state signal s t to make the frame selection decision. To select appropriate frame pairs, the information fed to the FSA should contain: 1) current performance of the pose estimator (to see how much room the pose estimator E has for improvement); and 2) global contextual information in the video (to see where the informative frames might be).…”

Section: Frame Selection Agent (Fsa)mentioning

confidence: 99%

REMOTE: Reinforced Motion Transformation Network for Semi-supervised 2D Pose Estimation in Videos

Rahmani

Fan

et al. 2022

AAAI

View full text Add to dashboard Cite

Existing approaches for 2D pose estimation in videos often require a large number of dense annotations, which are costly and labor intensive to acquire. In this paper, we propose a semi-supervised REinforced MOtion Transformation nEtwork (REMOTE) to leverage a few labeled frames and temporal pose variations in videos, which enables effective learning of 2D pose estimation in sparsely annotated videos. Specifically, we introduce a Motion Transformer (MT) module to perform cross frame reconstruction, aiming to learn motion dynamic knowledge in videos. Besides, a novel reinforcement learning-based Frame Selection Agent (FSA) is designed within our framework, which is able to harness informative frame pairs on the fly to enhance the pose estimator under our cross reconstruction mechanism. We conduct extensive experiments that show the efficacy of our proposed REMOTE framework.

show abstract

Section: Frame Selection Agent (Fsa)mentioning

confidence: 99%

REMOTE: Reinforced Motion Transformation Network for Semi-supervised 2D Pose Estimation in Videos

Rahmani

Fan

et al. 2022

AAAI

View full text Add to dashboard Cite

show abstract

“…Wu et al [47] utilizes multi-agent reinforce learning to model parallel frame sampling and Lin et al [24] make one-step decision with holistic view. Meng et al [27] and Wang et al [42,44] focus their attention on spatial redundancy. Panda et al adaptively decide modalities for video segments.…”

Section: Related Workmentioning

confidence: 99%

“…Results on AcitivtyNet. We compare the proposed method with recent SOTA methods on AcitivtyNet in Table 3: SCSampler [20], AR-Net [27], AdaMML [30], VideoIQ [36], AdaFcous [42], Dynamic-STE [19] and FrameExit [12]. Experimental result shows that our method outperforms all existing methods with ResNet50 as the main recognition network.…”

Section: Comparison With Simple Baselinesmentioning

confidence: 99%

“…To verify that performance promotion can be achieved on more untrimmed datasets, we also evaluate our method on FCVID in Table 5, which shows that our method outperforms competing methods in terms of accuracy while saving much computation cost. Compared with SOTA approach AdaFocus [42], which is motivated by selecting salient spatial regions, we achieve…”

Section: Comparison Withmentioning

confidence: 99%

See 1 more Smart Citation

Temporal Saliency Query Network for Efficient Video Recognition

Xia¹,

Wang²,

Wu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Efficient video recognition is a hot-spot research topic with the explosive growth of multimedia data on the Internet and mobile devices. Most existing methods select the salient frames without awareness of the class-specific saliency scores, which neglect the implicit association between the saliency of frames and its belonging category. To alleviate this issue, we devise a novel Temporal Saliency Query (TSQ) mechanism, which introduces class-specific information to provide fine-grained cues for saliency measurement. Specifically, we model the class-specific saliency measuring process as a query-response task. For each category, the common pattern of it is employed as a query and the most salient frames are responded to it. Then, the calculated similarities are adopted as the frame saliency scores. To achieve it, we propose a Temporal Saliency Query Network (TSQNet) that includes two instantiations of the TSQ mechanism based on visual appearance similarities and textual event-object relations. Afterward, cross-modality interactions are imposed to promote the information exchange between them. Finally, we use the class-specific saliencies of the most confident categories generated by two modalities to perform the selection of salient frames. Extensive experiments demonstrate the effectiveness of our method by achieving state-of-the-art results on ActivityNet, FCVID and Mini-Kinetics datasets. Our project page is at https://lawrencexia2008.github.io/ projects/tsqnet.

show abstract

“…Another type of dynamic CNNs skips redundant layers [38,42,48] or channels [28] conditioned on the inputs. Besides, the spatial adaptive paradigm [14,4,44,39,43] has been proposed for efficient image and video recognition. Although these works are related to DVT on the spirit of adaptive computation, they are developed based on CNN, while DVT is tailored for vision Transformers.…”

Section: Introductionmentioning

confidence: 99%

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

Wang¹,

Huang²,

Song³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Vision Transformers (ViT) have achieved remarkable success in large-scale image recognition. They split every 2D image into a fixed number of patches, each of which is treated as a token. Generally, representing an image with more tokens would lead to higher prediction accuracy, while it also results in drastically increased computational cost. To achieve a decent trade-off between accuracy and speed, the number of tokens is empirically set to 16x16. In this paper, we argue that every image has its own characteristics, and ideally the token number should be conditioned on each individual input. In fact, we have observed that there exist a considerable number of "easy" images which can be accurately predicted with a mere number of 4x4 tokens, while only a small fraction of "hard" ones need a finer representation. Inspired by this phenomenon, we propose a Dynamic Transformer to automatically configure a proper number of tokens for each input image. This is achieved by cascading multiple Transformers with increasing numbers of tokens, which are sequentially activated in an adaptive fashion at test time, i.e., the inference is terminated once a sufficiently confident prediction is produced. We further design efficient feature reuse and relationship reuse mechanisms across different components of the Dynamic Transformer to reduce redundant computations. Extensive empirical results on ImageNet, CIFAR-10, and CIFAR-100 demonstrate that our method significantly outperforms the competitive baselines in terms of both theoretical computational efficiency and practical inference speed.

show abstract

Adaptive Focus for Efficient Video Recognition

Cited by 6 publications

References 47 publications

REMOTE: Reinforced Motion Transformation Network for Semi-supervised 2D Pose Estimation in Videos

REMOTE: Reinforced Motion Transformation Network for Semi-supervised 2D Pose Estimation in Videos

Temporal Saliency Query Network for Efficient Video Recognition

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

Contact Info

Product

Resources

About