2021
DOI: 10.3390/rs13030516
|View full text |Cite
|
Sign up to set email alerts
|

Vision Transformers for Remote Sensing Image Classification

Abstract: In this paper, we propose a remote-sensing scene-classification method based on vision transformers. These types of networks, which are now recognized as state-of-the-art models in natural language processing, do not rely on convolution layers as in standard convolutional neural networks (CNNs). Instead, they use multihead attention mechanisms as the main building block to derive long-range contextual relation between pixels in images. In a first step, the images under analysis are divided into patches, then c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
135
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
1
1

Relationship

1
8

Authors

Journals

citations
Cited by 288 publications
(136 citation statements)
references
References 46 publications
(53 reference statements)
0
135
0
Order By: Relevance
“…From this, the preceding material, the Butterworth filter is manually calibrated, and a spectrogram is used to ensure that the waveforms have been recovered [40]. While the accuracy is increased, a downside of this approach is that other birds might still be making the same signal, which is when used alongside it [41]. But in this case, bird species have frequencies that are close to one another and appear to possess identical characteristics separating the various recordings of birds' calls to increase the success of singlelabel methods results would be very impractical so doing so would only make the success of the results of each form of detection less likely [42].…”
Section: Pre-processingmentioning
confidence: 99%
“…From this, the preceding material, the Butterworth filter is manually calibrated, and a spectrogram is used to ensure that the waveforms have been recovered [40]. While the accuracy is increased, a downside of this approach is that other birds might still be making the same signal, which is when used alongside it [41]. But in this case, bird species have frequencies that are close to one another and appear to possess identical characteristics separating the various recordings of birds' calls to increase the success of singlelabel methods results would be very impractical so doing so would only make the success of the results of each form of detection less likely [42].…”
Section: Pre-processingmentioning
confidence: 99%
“…In the first set of experiments, we tested the effect of using data augmentation on the classification results. Experimental results on single-label scene classification have shown that using a combination of data augmentation techniques can improve the classification results [25]. Therefore, we augmented the dataset with additional samples using random flipping, rotations, and cutout during training.…”
Section: Experiments 1: the Effect Of Data Augmentationmentioning
confidence: 99%
“…Meanwhile, a new type of deep-learning model known as transformers has been developed for natural language processing (NLP) and has started to gain some popularity in computer vision [24] and remote sensing communities [25,26]. A transformer is an architecture that was first introduced by Vaswani et al in 2017 for machine translation [27].…”
Section: Introductionmentioning
confidence: 99%
“…e experimental results showed that the vision transformer classification accuracy rate of remote sensing images exceeds the CNN model [15].…”
Section: Introductionmentioning
confidence: 96%