Spatially Adaptive Feature Refinement for Efficient Inference

Han, Yizeng; Huang, Gao; Song, Sejun; Yang, Le; Zhang, Yitian; Jiang, Haojun

doi:10.1109/tip.2021.3125263

Cited by 17 publications

(18 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In contrast, our deformable attention takes a powerful and yet simple design to learn a set of global keys shared among visual tokens, and can be adopted as a general backbone for various vision tasks. Our method can also be viewed as a spatial adaptive mechanism, which has been proved effective in various works [16,38].…”

Section: Related Workmentioning

confidence: 99%

Vision Transformer with Deformable Attention

Xia¹,

Pan²,

Song³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Transformers have recently shown superior performances on various vision tasks. The large, sometimes even global, receptive field endows Transformer models with higher representation power over their CNN counterparts. Nevertheless, simply enlarging receptive field also gives rise to several concerns. On the one hand, using dense attention e.g., in ViT, leads to excessive memory and computational cost, and features can be influenced by irrelevant parts which are beyond the region of interests. On the other hand, the sparse attention adopted in PVT or Swin Transformer is data agnostic and may limit the ability to model long range relations. To mitigate these issues, we propose a novel deformable self-attention module, where the positions of key and value pairs in self-attention are selected in a data-dependent way. This flexible scheme enables the self-attention module to focus on relevant regions and capture more informative features. On this basis, we present Deformable Attention Transformer, a general backbone model with deformable attention for both image classification and dense prediction tasks. Extensive experiments show that our models achieve consistently improved results on comprehensive benchmarks. Code is available at https://github.com/LeapLabTHU/DAT.

show abstract

Section: Related Workmentioning

confidence: 99%

Vision Transformer with Deformable Attention

Xia¹,

Pan²,

Song³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Dynamic inference results. We apply our training strategy on MSDNet with 5 and 7 exits and compare with three groups of competitive baseline methods: classic networks (ResNet [13], DenseNet [18]), pruning-based approaches (Sparse Structure Selection (SSS) [19], Transformable Architecture Search (TAS) [6]), and dynamic networks (Shallow-Deep Networks (SDN) [23], Dynamic Convolutions (DynConv) [44], and Spatially Adaptive Feature Refinement (SAR) [12]).…”

Section: Cifar Resultsmentioning

confidence: 99%

“…Improving the inference efficiency of deep learning has become a research trend. Popular solutions include lightweight architecture design [16,50], network pruning [28,29,35,48], weight quantization [20,10], and dynamic neural networks [11,37,17,47,32,1,46,44,12]. Dynamic networks have attracted considerable research interests due to their favorable efficiency and representation power [11].…”

Section: Introductionmentioning

confidence: 99%

Learning to Weight Samples for Dynamic Early-exiting Networks

Han¹,

Pu²,

Lai³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Early exiting is an effective paradigm for improving the inference efficiency of deep networks. By constructing classifiers with varying resource demands (the exits), such networks allow easy samples to be output at early exits, removing the need for executing deeper layers. While existing works mainly focus on the architectural design of multi-exit networks, the training strategies for such models are largely left unexplored. The current state-of-the-art models treat all samples the same during training. However, the early-exiting behavior during testing has been ignored, leading to a gap between training and testing. In this paper, we propose to bridge this gap by sample weighting. Intuitively, easy samples, which generally exit early in the network during inference, should contribute more to training early classifiers. The training of hard samples (mostly exit from deeper layers), however, should be emphasized by the late classifiers. Our work proposes to adopt a weight prediction network to weight the loss of different training samples at each exit. This weight prediction network and the backbone model are jointly optimized under a meta-learning framework with a novel optimization objective. By bringing the adaptive behavior during inference into the training phase, we show that the proposed weighting mechanism consistently improves the trade-off between classification accuracy and inference efficiency. Code is available at https://github.com/LeapLabTHU/L2W-DEN.

show abstract

“…Visual grounding (VG) task [13,24,40,65] has achieved great progress in recent years, with the advances in both computer vision [16,20,21,25,26,46,56,57,59] and natural language processing [4,14,41,50,53]. It aims to localize the objects referred by natural language queries, which is essential for various vision-language tasks, e.g., visual question answering [2] and visual commonsense reasoning [67].…”

Section: Introductionmentioning

confidence: 99%

Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding

Jiang¹,

Lin²,

Han³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Visual grounding, i.e., localizing objects in images according to natural language queries, is an important topic in visual language understanding. The most effective approaches for this task are based on deep learning, which generally require expensive manually labeled image-query or patch-query pairs. To eliminate the heavy dependence on human annotations, we present a novel method, named Pseudo-Q, to automatically generate pseudo language queries for supervised training. Our method leverages an off-the-shelf object detector to identify visual objects from unlabeled images, and then language queries for these objects are obtained in an unsupervised fashion with a pseudo-query generation module. Then, we design a task-related query prompt module to specifically tailor generated pseudo language queries for visual grounding tasks. Further, in order to fully capture the contextual relationships between images and language queries, we develop a visual-language model equipped with multi-level cross-modality attention mechanism. Extensive experimental results demonstrate that our method has two notable benefits: (1) it can reduce human annotation costs significantly, e.g., 31% on RefCOCO [65] without degrading original model's performance under the fully supervised setting, and (2) without bells and whistles, it achieves superior or comparable performance compared to state-of-theart weakly-supervised visual grounding methods on all the five datasets we have experimented. Code is available at https://github.com/LeapLabTHU/Pseudo-Q.* Equal contribution. † This work was done during an internship at Tsinghua.

show abstract

Spatially Adaptive Feature Refinement for Efficient Inference

Cited by 17 publications

References 38 publications

Vision Transformer with Deformable Attention

Vision Transformer with Deformable Attention

Learning to Weight Samples for Dynamic Early-exiting Networks

Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding

Contact Info

Product

Resources

About