GiT: Graph Interactive Transformer for Vehicle Re-Identification

Shen, Fei; Xie, Yi; Zhu, Jianqing; Zhu, Xiaobin; Zeng, Huanqiang

doi:10.1109/tip.2023.3238642

Cited by 60 publications

(35 citation statements)

References 94 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unlike graph transformers being applied to general node classification, relatively few studies have applied graphs to ViTs for vision applications. Shen et al [ 17 ] proposed a graph interactive transformer (GiT) for vehicle reidentification. Using this method, the GiT is divided into two modules: the original transformer module for extracting powerful global patch features and a local correlation graph (LCG) module for extracting local features that are distinct within the patch.…”

Section: Related Studiesmentioning

confidence: 99%

“…Unlike with other graph-based transformers [ 17 , 18 , 19 ], which apply graphs and attention in parallel and combine the outputs, this study is the first attempt to apply a graph inside the transformer head and replace MHA with a few GHA mechanisms. Moreover, there is no need for a class token in patch embedding, and thus the number of operations can be reduced.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Rethinking Attention Mechanisms in Vision Transformers with Graph Structures

Kim,

2024

Sensors

View full text Add to dashboard Cite

In this paper, we propose a new type of vision transformer (ViT) based on graph head attention (GHA). Because the multi-head attention (MHA) of a pure ViT requires multiple parameters and tends to lose the locality of an image, we replaced MHA with GHA by applying a graph to the attention head of the transformer. Consequently, the proposed GHA maintains both the locality and globality of the input patches and guarantees the diversity of the attention. The proposed GHA-ViT commonly outperforms pure ViT-based models using small-sized CIFAR-10/100, MNIST, and MNIST-F datasets and a medium-sized ImageNet-1K dataset in scratch training. A Top-1 accuracy of 81.7% was achieved for ImageNet-1K using GHA-B, which is a base model with approximately 29 M parameters. In addition, with CIFAR-10/100, the existing ViT and parameters are reduced 17-fold and the performance increased by 0.4/4.3%, respectively. The proposed GHA-ViT shows promising results in terms of the number of parameters and operations and the level of accuracy in comparison with other state-of-the-art ViT-lightweight models.

show abstract

Section: Related Studiesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Rethinking Attention Mechanisms in Vision Transformers with Graph Structures

Kim,

2024

Sensors

View full text Add to dashboard Cite

show abstract

“…EMRN 29 proposes a multi-resolution features dimension uniform module to fix dimensional features from images of varying resolutions, thus solving the multi-scale problem. Besides, GiT 30 uses a graph network approach to propose a structure where graphs and transformers interact constantly, enabling close collaboration between global and local features for vehicle Re-ID. The dual-relational attention module (DRAM) 31 models the importance of feature points in the spatial dimension and the channel dimension to form a three-dimensional attention module to mine more detailed semantic information.…”

Section: Related Work On the Vehicle Re-id Taskmentioning

confidence: 99%

A novel dual-pooling attention module for UAV vehicle re-identification

Guo,

Yang,

Jia

et al. 2024

Sci Rep

View full text Add to dashboard Cite

Vehicle re-identification (Re-ID) involves identifying the same vehicle captured by other cameras, given a vehicle image. It plays a crucial role in the development of safe cities and smart cities. With the rapid growth and implementation of unmanned aerial vehicles (UAVs) technology, vehicle Re-ID in UAV aerial photography scenes has garnered significant attention from researchers. However, due to the high altitude of UAVs, the shooting angle of vehicle images sometimes approximates vertical, resulting in fewer local features for Re-ID. Therefore, this paper proposes a novel dual-pooling attention (DpA) module, which achieves the extraction and enhancement of locally important information about vehicles from both channel and spatial dimensions by constructing two branches of channel-pooling attention (CpA) and spatial-pooling attention (SpA), and employing multiple pooling operations to enhance the attention to fine-grained information of vehicles. Specifically, the CpA module operates between the channels of the feature map and splices features by combining four pooling operations so that vehicle regions containing discriminative information are given greater attention. The SpA module uses the same pooling operations strategy to identify discriminative representations and merge vehicle features in image regions in a weighted manner. The feature information of both dimensions is finally fused and trained jointly using label smoothing cross-entropy loss and hard mining triplet loss, thus solving the problem of missing detail information due to the high height of UAV shots. The proposed method’s effectiveness is demonstrated through extensive experiments on the UAV-based vehicle datasets VeRi-UAV and VRU.

show abstract

“…For example, in the task of face recognition (Boutros et al, 2022 ), deep learning methods effectively capture the facial information under complex conditions, enabling accurate identification of individuals based on semantic attributes. Similarly, in vehicle re-identification (Shen et al, 2023 ), the metric learning framework facilitates reliable screening of complex multi-view positive samples, leading to precise consensus decision-making despite variations in multi-sensor data. A prominent network structure that implements the metric learning framework is the siamese neural network, exemplified by MatchNet (Han et al, 2015 ).…”

Section: Introductionmentioning

confidence: 99%

Metric networks for enhanced perception of non-local semantic information

Zhou

Zhang

2023

Front. Neurorobot.

View full text Add to dashboard Cite

IntroductionMetric learning, as a fundamental research direction in the field of computer vision, has played a crucial role in image matching. Traditional metric learning methods aim at constructing two-branch siamese neural networks to address the challenge of image matching, but they often overlook to cross-source and cross-view scenarios.MethodsIn this article, a multi-branch metric learning model is proposed to address these limitations. The main contributions of this work are as follows: Firstly, we design a multi-branch siamese network model that enhances measurement reliability through information compensation among data points. Secondly, we construct a non-local information perception and fusion model, which accurately distinguishes positive and negative samples by fusing information at different scales. Thirdly, we enhance the model by integrating semantic information and establish an information consistency mapping between multiple branches, thereby improving the robustness in cross-source and cross-view scenarios.ResultsExperimental tests which demonstrate the effectiveness of the proposed method are carried out under various conditions, including homologous, heterogeneous, multi-view, and crossview scenarios. Compared to the state-of-the-art comparison algorithms, our proposed algorithm achieves an improvement of ~1, 2, 1, and 1% in terms of similarity measurement Recall@10, respectively, under these four conditions.DiscussionIn addition, our work provides an idea for improving the crossscene application ability of UAV positioning and navigation algorithm.

show abstract

GiT: Graph Interactive Transformer for Vehicle Re-Identification

Cited by 60 publications

References 94 publications

Rethinking Attention Mechanisms in Vision Transformers with Graph Structures

Rethinking Attention Mechanisms in Vision Transformers with Graph Structures

A novel dual-pooling attention module for UAV vehicle re-identification

Metric networks for enhanced perception of non-local semantic information

Contact Info

Product

Resources

About