QuadTree Attention for Vision Transformers

Tang, Shengjian; Zhang, Jiahui; Zhu, Siyu; Tan, P.

doi:10.48550/arxiv.2201.02767

Cited by 4 publications

(25 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To solve this issue, transformer-based detector-free methods have emerged as more robust alternatives, demonstrating impressive matching abilities in texture-less regions [43,18,47,57,4]. However, the high computational cost of attention limits transformer-based methods to 'semi-dense' matching, where source matching points are spaced apart at intervals of coarse feature space, as shown in Fig.…”

Section: Methodsmentioning

confidence: 99%

“…Figure 1: QuadTree [47] (a,d) vs our CasMTR (b,c,e). Our method achieves more fine-grained matching pairs for both source and target images (b).…”

Section: Methodsmentioning

confidence: 99%

“…Figure 2: Illustration of CasMTR pipeline; and our novelties compared against the existing steps from detector-free matching methods [43,47,4] are highlighted in red. without a substantial increase in computational costs.…”

Section: Methodsmentioning

confidence: 99%

“…Suffixes '-8c', '-4c', and '-2c' denote matching at 1/8, 1/4, and 1/2 of image size. Baseline: QuadTree [47] with the same backbone as ours. Directly implementing QuadTree-4c causes Out-of-memory (OOM) error in a 32GB GPU, so its inference speed is estimated in brackets.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Improving Transformer-based Image Matching by Cascaded Capturing Spatially Informative Keypoints

Cao¹,

Fu²

2023

Preprint

View full text Add to dashboard Cite

Learning robust local image feature matching is a fundamental low-level vision task, which has been widely explored in the past few years. Recently, detector-free local feature matchers based on transformers have shown promising results, which largely outperform pure Convolutional Neural Network (CNN) based ones. But correlations produced by transformer-based methods are spatially limited to the center of source views' coarse patches, because of the costly attention learning. In this work, we rethink this issue and find that such matching formulation degrades pose estimation, especially for low-resolution images. So we propose a transformer-based cascade matching model -Cascade feature Matching TRansformer (CasMTR), to efficiently learn dense feature correlations, which allows us to choose more reliable matching pairs for the relative pose estimation. Instead of re-training a new detector, we use a simple yet effective Non-Maximum Suppression (NMS) post-process to filter keypoints through the confidence map, and largely improve the matching precision. CasMTR achieves state-of-the-art performance in indoor and outdoor pose estimation as well as visual localization. Moreover, thorough ablations show the efficacy of the proposed components and techniques. † Corresponding author.Under review.

show abstract

Section: Methodsmentioning

confidence: 99%

“…Figure 1: QuadTree [47] (a,d) vs our CasMTR (b,c,e). Our method achieves more fine-grained matching pairs for both source and target images (b).…”

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Improving Transformer-based Image Matching by Cascaded Capturing Spatially Informative Keypoints

Cao¹,

Fu²

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…We refer to (Fisher, 2012) for an overview of the main variants. Adaptive quadrees have also been successfully introduced in Transformer architectures (Tang et al, 2022), suggesting that further techniques linked to Collages and fractal compression may be beneficial in this domain. Finally, images generated from IFSs have been used to construct artificial pretraining datasets for large vision models (Kataoka et al, 2020).…”

Section: Fractal Compressionmentioning

confidence: 99%

Self-Similarity Priors: Neural Collages as Differentiable Fractal Representations

Poli¹,

Xu²,

Massaroli³

et al. 2022

Preprint

View full text Add to dashboard Cite

Many patterns in nature exhibit self-similarity: they can be compactly described via self-referential transformations. Said patterns commonly appear in natural and artificial objects, such as molecules, shorelines, galaxies and even images. In this work, we investigate the role of learning in the automated discovery of self-similarity and in its utilization for downstream tasks. To this end, we design a novel class of implicit operators, Neural Collages, which (1) represent data as the parameters of a self-referential, structured transformation, and (2) employ hypernetworks to amortize the cost of finding these parameters to a single forward pass. We investigate how to leverage the representations produced by Neural Collages in various tasks, including data compression and generation. Neural Collage image compressors are orders of magnitude faster than other self-similarity-based algorithms during encoding and offer compression rates competitive with implicit methods. Finally, we showcase applications of Neural Collages for fractal art and as deep generative models.

show abstract

Semantic‐guided fusion for multiple object tracking and RGB‐T tracking

Liu

Luo²,

Zhang

et al. 2023

IET Image Processing

View full text Add to dashboard Cite

The attention mechanism has produced impressive results in object tracking, but for a good trade‐off between performance and efficiency, CNN‐based approaches still dominate, owing to quadratic complexity of attention. Here, the SGF module is proposed, an efficient feature fusion block for effective object tracking with reduced linear complexity of attention. The proposed method fuses feature with attention in a coarse‐to‐fine manner. In the low‐resolution semantic branch, the top K regions with highest attention scores are selected; in the high‐resolution detail branch, attention is only calculated within regions corresponding to the top K regions. Thus, the features from the high‐resolution branch can be efficiently fused under the guidance of low‐resolution branch. Experiments on RGB and RGB‐T datasets with reformed FairMOT and MDNet+RGBT trackers demonstrated the effectiveness of the proposed method.

show abstract

QuadTree Attention for Vision Transformers

Cited by 4 publications

References 28 publications

Improving Transformer-based Image Matching by Cascaded Capturing Spatially Informative Keypoints

Improving Transformer-based Image Matching by Cascaded Capturing Spatially Informative Keypoints

Self-Similarity Priors: Neural Collages as Differentiable Fractal Representations

Semantic‐guided fusion for multiple object tracking and RGB‐T tracking

Contact Info

Product

Resources

About