COTR: Correspondence Transformer for Matching Across Images

Jiang, Wei; Trulls, Eduard; Hosang, Jan; Tagliasacchi, Andrea; Yi, Kwang Moo

doi:10.1109/iccv48922.2021.00615

Cited by 157 publications

(58 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For those works addressing visual correspondence, LoFTR [27] uses a cross and self-attention module to refine the feature maps conditioned on both input images, and formulate the hand-crafted aggregation layer with dual-softmax [19], [57], and Optimal Transport [25] to infer correspondences. In another work, COTR [58] takes coordinates as input and addresses dense correspondence tasks without the use of a correlation map. Unlike these, for the first time, we propose a transformer-based cost aggregation module.…”

Section: Transformers In Visionmentioning

confidence: 99%

CATs++: Boosting Cost Aggregation with Convolutions and Transformers

Cho¹,

Hong²,

Kim³

2022

Preprint

View full text Add to dashboard Cite

Cost aggregation is a highly important process in image matching tasks, which aims to disambiguate the noisy matching scores. Existing methods generally tackle this by hand-crafted or CNN-based methods, which either lack robustness to severe deformations or inherit the limitation of CNNs that fail to discriminate incorrect matches due to limited receptive fields and inadaptability. In this paper, we introduce Cost Aggregation with Transformers (CATs) to tackle this by exploring global consensus among initial correlation map with the help of some architectural designs that allow us to fully enjoy global receptive fields of self-attention mechanism. To this end, we include appearance affinity modeling, which helps to disambiguate the noisy initial correlation maps. Furthermore, we introduce some techniques, including multi-level aggregation to exploit rich semantics present at different feature levels and swapping self-attention to obtain reciprocal matching scores to act as strong regularization. Although competitive performance can be attained by CATs, it may face some limitations, i.e., high computational costs induced by the use of a standard transformer that its complexity grows with the size of spatial and feature dimensions, which restrict its applicability only at limited resolution and result in rather limited performance. To overcome this, we propose CATs++, an extension of CATs. Concretely, we introduce early convolutions prior to cost aggregation with a transformer to control the number of tokens as well as to inject some convolutional inductive bias, and propose a novel transformer architecture for both efficient and effective cost aggregation, which results in apparent performance boost and cost reduction. With the reduced costs, we manage to compose our network with a hierarchical structure to process higher-resolution inputs. With these combined, we conduct experiments to demonstrate the effectiveness of the proposed model over the latest methods. We evaluate the proposed method on standard benchmarks, including PF-WILLOW, PF-PASCAL, and SPair-71k. Our proposed methods outperform the previous state-of-the-art methods by large margins, setting a new state-of-the-art for all the benchmarks. We also provide extensive ablation studies and analyses.

show abstract

Section: Transformers In Visionmentioning

confidence: 99%

CATs++: Boosting Cost Aggregation with Convolutions and Transformers

Cho¹,

Hong²,

Kim³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…[1] propose a sparse correspondence method for inter-class scenarios; leveraging pre-trained CNN features. Recent works employ transformers for dense correspondence in intra-class pairs [8,47,22]. However, those methods fail to find meaningful correspondences under significant pose, scale and appearance changes.…”

Section: Related Workmentioning

confidence: 99%

Deep ViT Features as Dense Visual Descriptors

Amir¹,

Gandelsman²,

Bagon³

et al. 2021

Preprint

View full text Add to dashboard Cite

BAIR) (a) Input images (b) Co-segmented objects and parts (d) Correspondences Co-segmentation and part co-segmentation Point correspondence (c) Input image pair Figure 1: Deep ViT features applied to vision tasks. We demonstrate the effectiveness of deep features extracted from a selfsupervised, pre-trained ViT model (DINO-ViT) as dense patch descriptors via real-world vision tasks: (a-b) co-segmentation & part co-segmentation: given a set of input images (e.g., 4 input images), we automatically co-segment semantically common foreground objects (e.g., animals), and then further partition them into common parts; (c-d) point correspondence:given a pair of input images, we automatically extract a sparse set of corresponding points. We tackle these tasks by applying only lightweight, simple methodologies such as clustering or binning, to deep ViT features.

show abstract

“…They work well for continuous frames but are inadequate to handle image pairs with large displacements. Very recently, the concurrent works [39,16] involve global context between matches by using transformers [42] which achieve great success in many NLP and vision tasks [11,6,51] based on the attention mechanism. Different from these works, we propose to adopt sparse correspondence as prior and design lightweighted network layers to efficiently propagate the contextual information to all image points, allowing predicting dense correspondence for arbitrary points.…”

Section: Related Workmentioning

confidence: 99%

“…But the key difference is that we propose a more sophisticated graph to model multi-level contexts using sparse correspondence as prior and develop a general architecture to infuse the contextual information into local features. In our graph-structured network, the message passing layers are implemented with the attention-based mechanism of Transformer [42], which is also used by some recent works [39,16].…”

Section: Related Workmentioning

confidence: 99%

“…Image correspondence is the foundation of many computer vision tasks, such as geometric matching [28,41,40], pose estimation [39,37], visual localization [53], and optical flow [28,41,40,16]. Although being long explored, it remains an open question, especially for images under large appearance or view changes, or containing textureless or repetitive regions.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

DenseGAP: Graph-Structured Dense Correspondence Learning with Anchor Points

Kuang¹,

Li²,

He³

et al. 2021

Preprint

View full text Add to dashboard Cite

Establishing dense correspondence between two images is a fundamental computer vision problem, which is typically tackled by matching local feature descriptors. However, without global awareness, such local features are often insufficient for disambiguating similar regions. And computing the pairwise feature correlation across images is both computationexpensive and memory-intensive. To make the local features aware of the global context and improve their matching accuracy, we introduce DenseGAP, a new solution for efficient Dense correspondence learning with a Graph-structured neural network conditioned on Anchor Points. Specifically, we first propose a graph structure that utilizes anchor points to provide sparse but reliable prior on inter-and intra-image context and propagates them to all image points via directed edges. We also design a graph-structured network to broadcast multi-level contexts via light-weighted message-passing layers and generate high-resolution feature maps at low memory cost. Finally, based on the predicted feature maps, we introduce a coarse-to-fine framework for accurate correspondence prediction using cycle consistency. Our feature descriptors capture both local and global information, thus enabling a continuous feature field for querying arbitrary points at high resolution. Through comprehensive ablative experiments and evaluations on large-scale indoor and outdoor datasets, we demonstrate that our method advances the stateof-the-art of correspondence learning on most benchmarks. All of our training and evaluation codes are available at https://formyfamily.github.io/DenseGAP/.

show abstract

COTR: Correspondence Transformer for Matching Across Images

Cited by 157 publications

References 46 publications

CATs++: Boosting Cost Aggregation with Convolutions and Transformers

CATs++: Boosting Cost Aggregation with Convolutions and Transformers

Deep ViT Features as Dense Visual Descriptors

DenseGAP: Graph-Structured Dense Correspondence Learning with Anchor Points

Contact Info

Product

Resources

About