Retrieval Topic Recurrent Memory Network for Remote Sensing Image Captioning

Wang, Binqiang; Zheng, Xiangtao; Lu, Xiaoqiang

doi:10.1109/jstars.2019.2959208

Cited by 48 publications

(26 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To solve the problem that the current caption generator requires high computing power, Genc et al proposed a decoder based on support vector machine [26], which is effective when just a limited amount of training samples is available. Despite the relative maturity of caption-based RSCTIR [12], [27]- [29], the model inevitably introduces noise due to the limitations of dual-stage process, thus affecting the accuracy of retrieval.…”

Section: A Rs Cross-modal Text-image Retrieval (Rsctir)mentioning

confidence: 99%

Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information

Yuan

Zhang

Tian

et al. 2022

IEEE Trans. Geosci. Remote Sensing

View full text Add to dashboard Cite

Cross-modal remote sensing text-image retrieval (RSCTIR) has recently become an urgent research hotspot due to its ability of enabling fast and flexible information extraction on remote sensing (RS) images. However, current RSCTIR methods mainly focus on global features of RS images, which leads to the neglect of local features that reflect target relationships and saliency. In this article, we first propose a novel RSCTIR framework based on global and local information (GaLR), and design a multi-level information dynamic fusion (MIDF) module to efficaciously integrate features of different levels. MIDF leverages local information to correct global information, utilizes global information to supplement local information, and uses the dynamic addition of the two to generate prominent visual representation. To alleviate the pressure of the redundant targets on the graph convolution network (GCN) and to improve the model's attention on salient instances during modeling local features, the denoised representation matrix and the enhanced adjacency matrix (DREA) are devised to assist GCN in producing superior local representations. DREA not only filters out redundant features with high similarity, but also obtains more powerful local features by enhancing the features of prominent objects. Finally, to make full use of the information in the similarity matrix during inference, we come up with a plug-and-play multivariate rerank (MR) algorithm. The algorithm utilizes the k nearest neighbors of the retrieval results to perform a reverse search, and improves the performance by combining multiple components of bidirectional retrieval. Extensive experiments on public datasets strongly demonstrate the state-of-the-art performance of GaLR methods on the RSCTIR task. The code of GaLR method, MR algorithm, and corresponding files have been made available at: https://github.com/xiaoyuan1996/GaLR.

show abstract

Section: A Rs Cross-modal Text-image Retrieval (Rsctir)mentioning

confidence: 99%

Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information

Yuan

Zhang

Tian

et al. 2022

IEEE Trans. Geosci. Remote Sensing

View full text Add to dashboard Cite

show abstract

“…Besides tailored attention mechanisms, other previous studies on remote sensing image captioning have explored alternative paths for improving the results of standard encoderdecoder neural models. These include studies (a) exploring multi-scale feature representations [15], [16]; (b) using novel loss functions [17] or training procedures based on reinforcement learning [18], improving on the standard cross-entropy loss [17]; (c) extending and combining the set of reference captions, associated to each image, through summarization [19] or retrieval [20] approaches; or (d) using decoder components based on the Transformer architecture [18].…”

Section: Related Workmentioning

confidence: 99%

Using Neural Encoder-Decoder Models With Continuous Outputs for Remote Sensing Image Captioning

Ramos

Martins

2022

IEEE Access

View full text Add to dashboard Cite

This research was supported through COST Action Multi3Generation Ref. CA18231, and also through Fundação para a Ciência e Tecnologia (FCT), namely through the FCT project grant with reference PTDC/CCI-CIF/32607/2017 (MIMU) and through the Ph.D. scholarship with reference 2020.06106.BD, as well as through the INESC-ID multi-annual funding from the PIDDAC programme with reference UIDB/50021/2020. We also gratefully acknowledge the support of NVIDIA Corporation, with the donation of the two Titan Xp GPUs used in our experiments.

show abstract

“…Aerial image captioning: Aerial image captioning aims to accurately describe the image by generating concise and flexible sentences. Recently, many aerial image captioning methods [34][35][36][37][38] have been proposed for a better understanding of image content. They are aided by three public datasets, UCM-captions dataset [39], Sydney-captions dataset [39], and RSICD [40].…”

Section: A Aerial Image Processing Tasksmentioning

confidence: 99%

“…A sound active attention framework is proposed in [34] to generate image descriptions with the guidance of sound. Furthermore, based on the existing five sentences in datasets, a topic word strategy is proposed in [35] to describe image with a memory network. In addition, a visual aligning attention model is proposed in [17] to provide accurate image captioning by focusing on regions of interest.…”

Section: A Aerial Image Processing Tasksmentioning

confidence: 99%