2022
DOI: 10.1109/tgrs.2022.3152566
|View full text |Cite
|
Sign up to set email alerts
|

Vision Transformer: An Excellent Teacher for Guiding Small Networks in Remote Sensing Image Scene Classification

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
26
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
9

Relationship

0
9

Authors

Journals

citations
Cited by 42 publications
(26 citation statements)
references
References 86 publications
0
26
0
Order By: Relevance
“…The structure of the ViT is completely different from the CNN, which treats the 2D image as the 1D ordered sequence and applies the selfattention mechanism for global dependency modelling, demonstrating stronger global feature extraction. Driven by this, many researchers in the field of remote sensing introduced ViTs for segmentation-related tasks, such as land cover classification [63][64][65][66][67][68], urban scene parsing [69][70][71][72][73][74], change detection [75,76], road extraction [77] and especially building extraction [78]. For example, Chen et al [79] proposed a sparse token Transformer to learn the global dependency of tokens in both spatial and channel dimensions, achieving state-of-the-art accuracy on benchmark building extraction datasets.…”
Section: B Vit-based Building Extraction Methodsmentioning
confidence: 99%
“…The structure of the ViT is completely different from the CNN, which treats the 2D image as the 1D ordered sequence and applies the selfattention mechanism for global dependency modelling, demonstrating stronger global feature extraction. Driven by this, many researchers in the field of remote sensing introduced ViTs for segmentation-related tasks, such as land cover classification [63][64][65][66][67][68], urban scene parsing [69][70][71][72][73][74], change detection [75,76], road extraction [77] and especially building extraction [78]. For example, Chen et al [79] proposed a sparse token Transformer to learn the global dependency of tokens in both spatial and channel dimensions, achieving state-of-the-art accuracy on benchmark building extraction datasets.…”
Section: B Vit-based Building Extraction Methodsmentioning
confidence: 99%
“…In this framework, the ViT model is used to capture semantic features, while the CNN model is used to extract local structure information. In [47], the advantages of the two models are integrated without improving the computational complexity by knowledge distillation, in which the ViT is worked as a teacher to guide the student model ResNet18. Besides classifying tasks, the paper also proves that this method has good generalization ability for different tasks.…”
Section: Related Workmentioning
confidence: 99%
“…The detailed steps are listed as follows: Given a 3D image, it is first reshaped into a set of non-overlapping and flat 2D patches x n ∈ n×(P 2 •C) , where H, W, C represent the height, width, and channel of the images, P represents the resolution of the patches, and n is the number of embedded patches. Subsequently, the patches are projected to the D dimension by a learnable embedding matrix E ∈ (P 2 •C)×d [56]. Up to this point, an image is transformed into a sequence consisting of n patches which can also be called tokens.…”
Section: Vision Transformermentioning
confidence: 99%
“…Horry et al [55] introduced a highly scalable and accurate ViT model into a low-parameter CNN for land use classification, aiming to cope with the growing amount of satellite data. Similarly, Xu et al [56] proposed an end-to-end network, called ET-GSNet, to improve the generalization of the different tasks by knowledge distillation. Besides, transformer and convolution-based SAR image despeckling networks were proposed by Perera et al [57] that also possessed superior performance compared to the pure CNN.…”
Section: Introductionmentioning
confidence: 99%