IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium 2022
DOI: 10.1109/igarss46834.2022.9883252
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Scale Interactive Transformer for Remote Sensing Cross-Modal Image-Text Retrieval

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
8

Relationship

0
8

Authors

Journals

citations
Cited by 8 publications
(4 citation statements)
references
References 7 publications
0
4
0
Order By: Relevance
“…In CV, transformers have demonstrated advantages in processing multimodal data due to their more general and flexible modeling space [ 74 ]. Consequently, researchers started employing transformers to address multimodal problems in RS image text retrieval [ 75 ] and RS visual question answering [ 76 ]. Currently, there exists a scarcity of large-scale multimodal datasets, leading to researchers’ need to collect multimodal data by themselves.…”
Section: Challengesmentioning
confidence: 99%
“…In CV, transformers have demonstrated advantages in processing multimodal data due to their more general and flexible modeling space [ 74 ]. Consequently, researchers started employing transformers to address multimodal problems in RS image text retrieval [ 75 ] and RS visual question answering [ 76 ]. Currently, there exists a scarcity of large-scale multimodal datasets, leading to researchers’ need to collect multimodal data by themselves.…”
Section: Challengesmentioning
confidence: 99%
“…Furthermore, several studies address the challenges posed by the multi-scale features of remote sensing images, as the differences in target scales make the semantic alignment of cross-modal features more complex [30]. As documented in [18,24,30,31], two main challenges arise in cross-modal retrieval due to multiple scales: (1) effectively utilizing the diverse scale features of an image, including emphasizing salient features and preserving information related to small targets; (2) modeling the intricate relationships among multiscale targets. To address these challenges, Yuan et al [18] introduced a multi-scale vision self-attention module that comprehensively investigates multi-scale information and eliminates redundant features by merging cross-layer features of a convolutional neural network (CNN).…”
Section: Introductionmentioning
confidence: 99%
“…To address these challenges, Yuan et al [18] introduced a multi-scale vision self-attention module that comprehensively investigates multi-scale information and eliminates redundant features by merging cross-layer features of a convolutional neural network (CNN). Additionally, Wang et al [31] designed a lightweight sub-module for multi-scale feature exploration that utilizes parallel networks with distinct receptive fields to extract and integrate multi-scale features. Yao et al [30] focused on modeling the relationships among multi-scale targets by constructing hypergraph networks at different levels to depict the connections between objects of varying scales.…”
Section: Introductionmentioning
confidence: 99%
“…Transformer model is popular in image-text matching due to the introduction of an attention mechanism with strong reasoning ability (Jie et al, 2021;Wang et al, 2022;Yang et al, 2023;Messina et al, 2021). For example, in (Yang et al, 2023), the authors employ a transformer encoder to extract intra-modality relationships present within both the image and text.…”
Section: Introductionmentioning
confidence: 99%