Exploring Models and Data for Remote Sensing Image Caption Generation

Lu, Xiaoqiang; Wang, Binqiang; Zheng, Xiangtao; Li, Xuelong

doi:10.1109/tgrs.2017.2776321

Cited by 353 publications

(283 citation statements)

References 46 publications

Supporting

Mentioning

244

Contrasting

Unclassified

Order By: Relevance

“…The compression methods for multisource image/video data are designed from the perspective of image features, which usually mine similarities between image blocks by matching feature points. Moreover, multiscale features for image representation are proposed to extend representation from single payload to multiple payloads, as being proposed in References [35][36][37][38], which is also a way to build relations between multiple data sources. However, computational complexity is high, and the actual correspondence between the selected image block and the coding object is often lacking, which is not conducive to large-area matching.…”

Section: Video Compression Of Multisource Image/video Datamentioning

confidence: 99%

Towards Real-Time Service from Remote Sensing: Compression of Earth Observatory Video Data via Long-Term Background Referencing

Xiao

Zhu

et al. 2018

Remote Sensing

View full text Add to dashboard Cite

City surveillance enables many innovative applications of smart cities. However, the real-time utilization of remotely sensed surveillance data via unmanned aerial vehicles (UAVs) or video satellites is hindered by the considerable gap between the high data collection rate and the limited transmission bandwidth. High efficiency compression of the data is in high demand. Long-term background redundancy (LBR) (in contrast to local spatial/temporal redundancies in a single video clip) is a new form of redundancy common in Earth observatory video data (EOVD). LBR is induced by the repetition of static landscapes across multiple video clips and becomes significant as the number of video clips shot of the same area increases. Eliminating LBR improves EOVD coding efficiency considerably. First, this study proposes eliminating LBR by creating a long-term background referencing library (LBRL) containing high-definition geographically registered images of an entire area. Then, it analyzes the factors affecting the variations in the image representations of the background. Next, it proposes a method of generating references for encoding current video and develops the encoding and decoding framework for EOVD compression. Experimental results show that encoding UAV video clips with the proposed method saved an average of more than 54% bits using references generated under the same conditions. Bitrate savings reached 25-35% when applied to satellite video data with arbitrarily collected reference images. Applying the proposed coding method to EOVD will facilitate remote surveillance, which can foster the development of online smart city applications.

show abstract

Section: Video Compression Of Multisource Image/video Datamentioning

confidence: 99%

Towards Real-Time Service from Remote Sensing: Compression of Earth Observatory Video Data via Long-Term Background Referencing

Xiao

Zhu

et al. 2018

Remote Sensing

View full text Add to dashboard Cite

show abstract

“…To follow the direction of scene caption, a well-annotated scene caption dataset is also necessary. Researchers have presented a few exemplary works on remote sensing image caption [23,24], and have constructed a large-scale dataset under specific annotated instructions in consideration of characteristics of remote sensing images, e.g., not using words that represent the concept of "direction" and "vague". We believe that the scene caption will be a new chance to generate better description of scenes in remote sensing images and will receive more concerns from remote sensing community.…”

Section: Better Describing the Content Of Scenesmentioning

confidence: 99%

Recent Advances and Opportunities in Scene Classification of Aerial Images with Deep Models

Xia

Yang

et al. 2018

IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium

View full text Add to dashboard Cite

Scene classification is a fundamental task in interpretation of remote sensing images, and has become an active research topic in remote sensing community due to its important role in a wide range of applications. Over the past years, tremendous efforts have been made for developing powerful approaches for scene classification of remote sensing images, evolving from the traditional bag-of-visual-words model to the new generation deep convolutional neural networks (CNNs). The deep CNN based methods have exhibited remarkable breakthrough on performance, dramatically outperforming previous methods which strongly rely on hand-crafted features. However, performance with deep CNNs has gradually plateaued on existing public scene datasets, due to the notable drawbacks of these datasets, such as the small scale and low-diversity of training samples. Therefore, to promote the development of new methods and move the scene classification task a step further, we deeply discuss the existing problems in scene classification task, and accordingly present three open directions. We believe these potential directions will be instructive for the researchers in this field.

show abstract

“…With the development of deep learning on computer vision, scene understanding [9,10,11,12,13,14] achieves a remarkable progress. At present, CNN-based methods [15,16] attain the significant performance for crowd counting.…”

Section: Introductionmentioning

confidence: 99%

SCAR: Spatial-/channel-wise attention regression networks for crowd counting

2019

View full text Add to dashboard Cite

Recently, crowd counting is a hot topic in crowd analysis. Many CNNbased counting algorithms attain good performance. However, these methods only focus on the local appearance features of crowd scenes but ignore the large-range pixel-wise contextual and crowd attention information. To remedy the above problems, in this paper, we introduce the Spatial-/Channelwise Attention Models into the traditional Regression CNN to estimate the density map, which is named as "SCAR". It consists of two modules, namely Spatial-wise Attention Model (SAM) and Channel-wise Attention Model (CAM). The former can encode the pixel-wise context of the entire image to more accurately predict density maps at the pixel level. The latter attempts to extract more discriminative features among different channels, which aids model to pay attention to the head region, the core of crowd scenes. Intuitively, CAM alleviates the mistaken estimation for background regions. Finally, two types of attention information and traditional CNN's feature maps are integrated by a concatenation operation. Furthermore, the extensive experiments are conducted on four popular datasets, Shanghai Tech Part A/B, GCC, and UCF CC 50 Dataset. The results show that the proposed method achieves state-of-the-art results.

show abstract

Exploring Models and Data for Remote Sensing Image Caption Generation

Cited by 353 publications

References 46 publications

Towards Real-Time Service from Remote Sensing: Compression of Earth Observatory Video Data via Long-Term Background Referencing

Towards Real-Time Service from Remote Sensing: Compression of Earth Observatory Video Data via Long-Term Background Referencing

Recent Advances and Opportunities in Scene Classification of Aerial Images with Deep Models

SCAR: Spatial-/channel-wise attention regression networks for crowd counting

Contact Info

Product

Resources

About