2020
DOI: 10.1109/jstars.2019.2959208
|View full text |Cite
|
Sign up to set email alerts
|

Retrieval Topic Recurrent Memory Network for Remote Sensing Image Captioning

Abstract: Remote sensing image (RSI) captioning aims to generate sentences to describe the content of RSIs. Generally, five sentences are used to describe the RSI in caption datasets. Every sentence can just focus on part of images' contents due to the different attention parts of annotation persons. One annotated sentence may be ambiguous compared with other four sentences. However, previous methods, treating five sentences separately, may generate an ambiguous sentence. In order to consider five sentences together, a … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
20
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 48 publications
(26 citation statements)
references
References 48 publications
0
20
0
Order By: Relevance
“…To solve the problem that the current caption generator requires high computing power, Genc et al proposed a decoder based on support vector machine [26], which is effective when just a limited amount of training samples is available. Despite the relative maturity of caption-based RSCTIR [12], [27]- [29], the model inevitably introduces noise due to the limitations of dual-stage process, thus affecting the accuracy of retrieval.…”
Section: A Rs Cross-modal Text-image Retrieval (Rsctir)mentioning
confidence: 99%
“…To solve the problem that the current caption generator requires high computing power, Genc et al proposed a decoder based on support vector machine [26], which is effective when just a limited amount of training samples is available. Despite the relative maturity of caption-based RSCTIR [12], [27]- [29], the model inevitably introduces noise due to the limitations of dual-stage process, thus affecting the accuracy of retrieval.…”
Section: A Rs Cross-modal Text-image Retrieval (Rsctir)mentioning
confidence: 99%
“…Besides tailored attention mechanisms, other previous studies on remote sensing image captioning have explored alternative paths for improving the results of standard encoderdecoder neural models. These include studies (a) exploring multi-scale feature representations [15], [16]; (b) using novel loss functions [17] or training procedures based on reinforcement learning [18], improving on the standard cross-entropy loss [17]; (c) extending and combining the set of reference captions, associated to each image, through summarization [19] or retrieval [20] approaches; or (d) using decoder components based on the Transformer architecture [18].…”
Section: Related Workmentioning
confidence: 99%
“…Aerial image captioning: Aerial image captioning aims to accurately describe the image by generating concise and flexible sentences. Recently, many aerial image captioning methods [34][35][36][37][38] have been proposed for a better understanding of image content. They are aided by three public datasets, UCM-captions dataset [39], Sydney-captions dataset [39], and RSICD [40].…”
Section: A Aerial Image Processing Tasksmentioning
confidence: 99%
“…A sound active attention framework is proposed in [34] to generate image descriptions with the guidance of sound. Furthermore, based on the existing five sentences in datasets, a topic word strategy is proposed in [35] to describe image with a memory network. In addition, a visual aligning attention model is proposed in [17] to provide accurate image captioning by focusing on regions of interest.…”
Section: A Aerial Image Processing Tasksmentioning
confidence: 99%