Glance and Focus Networks for Dynamic Visual Recognition

Huang, Gao; Wang, Yulin; Lv, Kangchen; Jiang, Haojun; Huang, Wenhui; Pan, Qi; Song, Shiji

doi:10.48550/arxiv.2201.03014

Cited by 3 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, Transformer [50] has attracted the attention of computer vision community due to its success in the field of natural language processing. A series of Transformer-based methods [13,27,56,51,36,18,12,6,57,60,25,42] have been developed for high-level vision tasks, including image classification [36,13,27,44,49], object detection [34,48,36,4,6], segmentation [55,51,16,2], etc. Although vision Transformer has shown its superiority on modeling long-range dependency [13,43], there are still many works demonstrating that the convolution can help Transformer achieve better visual representation [56,58,61,60,25].…”

Section: Vision Transformermentioning

confidence: 99%

Activating More Pixels in Image Super-Resolution Transformer

Chen¹,

Wang²,

Zhou³

et al. 2022

Preprint

View full text Add to dashboard Cite

Transformer-based methods have shown impressive performance in low-level vision tasks, such as image super-resolution. However, we find that these networks can only utilize a limited spatial range of input information through attribution analysis. This implies that the potential of Transformer is still not fully exploited in existing networks. In order to activate more input pixels for reconstruction, we propose a novel Hybrid Attention Transformer (HAT). It combines channel attention and self-attention schemes, thus making use of their complementary advantages. Moreover, to better aggregate the cross-window information, we introduce an overlapping cross-attention module to enhance the interaction between neighboring window features. In the training stage, we additionally propose a same-task pre-training strategy to bring further improvement. Extensive experiments show the effectiveness of the proposed modules, and the overall method significantly outperforms the state-of-the-art methods by more than 1dB. Codes and models will be available at https://github.com/chxy95/HAT.

show abstract

Section: Vision Transformermentioning

confidence: 99%

Activating More Pixels in Image Super-Resolution Transformer

Chen¹,

Wang²,

Zhou³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Visual grounding (VG) task [13,24,40,65] has achieved great progress in recent years, with the advances in both computer vision [16,20,21,25,26,46,56,57,59] and natural language processing [4,14,41,50,53]. It aims to localize the objects referred by natural language queries, which is essential for various vision-language tasks, e.g., visual question answering [2] and visual commonsense reasoning [67].…”

Section: Introductionmentioning

confidence: 99%

Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding

Jiang¹,

Lin²,

Han³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Visual grounding, i.e., localizing objects in images according to natural language queries, is an important topic in visual language understanding. The most effective approaches for this task are based on deep learning, which generally require expensive manually labeled image-query or patch-query pairs. To eliminate the heavy dependence on human annotations, we present a novel method, named Pseudo-Q, to automatically generate pseudo language queries for supervised training. Our method leverages an off-the-shelf object detector to identify visual objects from unlabeled images, and then language queries for these objects are obtained in an unsupervised fashion with a pseudo-query generation module. Then, we design a task-related query prompt module to specifically tailor generated pseudo language queries for visual grounding tasks. Further, in order to fully capture the contextual relationships between images and language queries, we develop a visual-language model equipped with multi-level cross-modality attention mechanism. Extensive experimental results demonstrate that our method has two notable benefits: (1) it can reduce human annotation costs significantly, e.g., 31% on RefCOCO [65] without degrading original model's performance under the fully supervised setting, and (2) without bells and whistles, it achieves superior or comparable performance compared to state-of-theart weakly-supervised visual grounding methods on all the five datasets we have experimented. Code is available at https://github.com/LeapLabTHU/Pseudo-Q.* Equal contribution. † This work was done during an internship at Tsinghua.

show abstract

Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning

Sponner,

Waschneck,

Kumar

2024

ACM Comput. Surv.

View full text Add to dashboard Cite

Adaptive optimization methods for deep learning adjust the inference task to the current circumstances at runtime to improve the resource footprint while maintaining the model’s performance. These methods are essential for the widespread adoption of deep learning, as they offer a way to reduce the resource footprint of the inference task while also having access to additional information about the current environment. This survey covers the state-of-the-art at-runtime optimization methods, provides guidance for readers to choose the best method for their specific use-case, and also highlights current research gaps in this field.

show abstract

Glance and Focus Networks for Dynamic Visual Recognition

Cited by 3 publications

References 0 publications

Activating More Pixels in Image Super-Resolution Transformer

Activating More Pixels in Image Super-Resolution Transformer

Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding

Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning

Contact Info

Product

Resources

About