Multi-Scale Interactive Transformer for Remote Sensing Cross-Modal Image-Text Retrieval

Wang, Yijing; Ma, Jingjing; Li, Mingteng; Tang, Xu; Han, Xiao; Jiao, Licheng

doi:10.1109/igarss46834.2022.9883252

Cited by 8 publications

(4 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In CV, transformers have demonstrated advantages in processing multimodal data due to their more general and flexible modeling space [ 74 ]. Consequently, researchers started employing transformers to address multimodal problems in RS image text retrieval [ 75 ] and RS visual question answering [ 76 ]. Currently, there exists a scarcity of large-scale multimodal datasets, leading to researchers’ need to collect multimodal data by themselves.…”

Section: Challengesmentioning

confidence: 99%

Transformers for Remote Sensing: A Systematic Review and Analysis

Wang,

Ma,

et al. 2024

Sensors

View full text Add to dashboard Cite

Research on transformers in remote sensing (RS), which started to increase after 2021, is facing the problem of a relative lack of review. To understand the trends of transformers in RS, we undertook a quantitative analysis of the major research on transformers over the past two years by dividing the application of transformers into eight domains: land use/land cover (LULC) classification, segmentation, fusion, change detection, object detection, object recognition, registration, and others. Quantitative results show that transformers achieve a higher accuracy in LULC classification and fusion, with more stable performance in segmentation and object detection. Combining the analysis results on LULC classification and segmentation, we have found that transformers need more parameters than convolutional neural networks (CNNs). Additionally, further research is also needed regarding inference speed to improve transformers’ performance. It was determined that the most common application scenes for transformers in our database are urban, farmland, and water bodies. We also found that transformers are employed in the natural sciences such as agriculture and environmental protection rather than the humanities or economics. Finally, this work summarizes the analysis results of transformers in remote sensing obtained during the research process and provides a perspective on future directions of development.

show abstract

Section: Challengesmentioning

confidence: 99%

Transformers for Remote Sensing: A Systematic Review and Analysis

Wang,

Ma,

et al. 2024

Sensors

View full text Add to dashboard Cite

show abstract

“…Furthermore, several studies address the challenges posed by the multi-scale features of remote sensing images, as the differences in target scales make the semantic alignment of cross-modal features more complex [30]. As documented in [18,24,30,31], two main challenges arise in cross-modal retrieval due to multiple scales: (1) effectively utilizing the diverse scale features of an image, including emphasizing salient features and preserving information related to small targets; (2) modeling the intricate relationships among multiscale targets. To address these challenges, Yuan et al [18] introduced a multi-scale vision self-attention module that comprehensively investigates multi-scale information and eliminates redundant features by merging cross-layer features of a convolutional neural network (CNN).…”

Section: Introductionmentioning

confidence: 99%

“…To address these challenges, Yuan et al [18] introduced a multi-scale vision self-attention module that comprehensively investigates multi-scale information and eliminates redundant features by merging cross-layer features of a convolutional neural network (CNN). Additionally, Wang et al [31] designed a lightweight sub-module for multi-scale feature exploration that utilizes parallel networks with distinct receptive fields to extract and integrate multi-scale features. Yao et al [30] focused on modeling the relationships among multi-scale targets by constructing hypergraph networks at different levels to depict the connections between objects of varying scales.…”

Section: Introductionmentioning

confidence: 99%

A Fine-Grained Semantic Alignment Method Specific to Aggregate Multi-Scale Information for Cross-Modal Remote Sensing Image Retrieval

Zheng,

Wang,

Wang

et al. 2023

Sensors

View full text Add to dashboard Cite

Due to the swift growth in the scale of remote sensing imagery, scholars have progressively directed their attention towards achieving efficient and adaptable cross-modal retrieval for remote sensing images. They have also steadily tackled the distinctive challenge posed by the multi-scale attributes of these images. However, existing studies primarily concentrate on the characterization of these features, neglecting the comprehensive investigation of the complex relationship between multi-scale targets and the semantic alignment of these targets with text. To address this issue, this study introduces a fine-grained semantic alignment method that adequately aggregates multi-scale information (referred to as FAAMI). The proposed approach comprises multiple stages. Initially, we employ a computing-friendly cross-layer feature connection method to construct a multi-scale feature representation of an image. Subsequently, we devise an efficient feature consistency enhancement module to rectify the incongruous semantic discrimination observed in cross-layer features. Finally, a shallow cross-attention network is employed to capture the fine-grained semantic relationship between multiple-scale image regions and the corresponding words in the text. Extensive experiments were conducted using two datasets: RSICD and RSITMD. The results demonstrate that the performance of FAAMI surpasses that of recently proposed advanced models in the same domain, with significant improvements observed in R@K and other evaluation metrics. Specifically, the mR values achieved by FAAMI are 23.18% and 35.99% for the two datasets, respectively.

show abstract

“…Transformer model is popular in image-text matching due to the introduction of an attention mechanism with strong reasoning ability (Jie et al, 2021;Wang et al, 2022;Yang et al, 2023;Messina et al, 2021). For example, in (Yang et al, 2023), the authors employ a transformer encoder to extract intra-modality relationships present within both the image and text.…”

Section: Introductionmentioning

confidence: 99%

An Image-Text Matching Method for Multi-Modal Robots

Zheng,

2023

Journal of Organizational and End User Computing

View full text Add to dashboard Cite

With the rapid development of artificial intelligence and deep learning, image-text matching has gradually become an important research topic in cross-modal fields. Achieving correct image-text matching requires a strong understanding of the correspondence between visual and textual information. In recent years, deep learning-based image-text matching methods have achieved significant success. However, image-text matching requires a deep understanding of intra-modal information and the exploration of fine-grained alignment between image regions and textual words. How to integrate these two aspects into a single model remains a challenge. Additionally, reducing the internal complexity of the model and effectively constructing and utilizing prior knowledge are also areas worth exploring, therefore addressing the issues of excessive computational complexity in existing fine-grained matching methods and the lack of multi-perspective matching.

show abstract

Multi-Scale Interactive Transformer for Remote Sensing Cross-Modal Image-Text Retrieval

Cited by 8 publications

References 7 publications

Transformers for Remote Sensing: A Systematic Review and Analysis

Transformers for Remote Sensing: A Systematic Review and Analysis

A Fine-Grained Semantic Alignment Method Specific to Aggregate Multi-Scale Information for Cross-Modal Remote Sensing Image Retrieval

An Image-Text Matching Method for Multi-Modal Robots

Contact Info

Product

Resources

About