Scaled ReLU Matters for Training Vision Transformers

Wang, Pichao; Wang, Xue; Luo, Hao; Zhou, Jingkai; Zhou, Zhipeng; Wang, Fan; Li, Hao; Jin, Rong

doi:10.1609/aaai.v36i3.20150

Cited by 23 publications

(5 citation statements)

References 57 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since then it has been widely used in natural language processing (NLP), such as BERT [42]. Due to the successful application of Transformer in the field of NLP, Transformer has been concerned to be applied in the field of computer vision in recent years, such as Vision Transformer (ViT) [43], Swin Transformer [44], et al ViT divides the input image into non-overlapping image blocks and linearly projects each image block into a d-dimensional feature vector using the learnable weight matrix [45]. Inspired by ViT, the spectrum is divided into several patches of the same sequence length as the input of Transformer to reduce the length of the input sequence, facilitating straightforward processing and analysis with lower computational complexity.…”

Section: A Transformer Based Encoder For Htdmentioning

confidence: 99%

An Unsupervised Momentum Contrastive Learning Based Transformer Network for Hyperspectral Target Detection

Wang,

Chen,

Zhao

et al. 2024

IEEE J. Sel. Top. Appl. Earth Observations Remote Sensing

View full text Add to dashboard Cite

Hyperspectral target detection plays a pivotal role in various civil and military applications. Although recent advancements in deep learning have largely embraced supervised learning approaches, they often hindered by the limited availability of labeled data. Unsupervised learning, therefore, emerges as a promising alternative, yet its potential has not been fully realized in current methodologies. This paper proposes an innovative unsupervised learning framework employing a momentum contrastive learning-based transformer network specifically tailored for hyperspectral target detection. The proposed approach innovatively combines transformer-based encoder and momentum encoder networks to enhance feature extraction capabilities, adeptly capturing both local spectral details and long-range spectral dependencies through the novel overlapping spectral patch embedding and a cross-token feedforward layer. This dual-encoder design significantly improves the model's ability to discern relevant spectral features amidst complex backgrounds. Through unsupervised momentum contrastive learning, a dynamically updated queue of negative sample features is utilized so that the model can demonstrate superior spectral discriminability. This is further bolstered by a unique background suppression mechanism leveraging nonlinear transformations of cosine similarity detection results, with two nonlinearly pull-up operations, significantly enhancing target detection sensitivity, where the nonlinearly operations are the exponential function with its normalization and the power function with its normalization, respectively. Comparative analysis against seven state-of-the-art hyperspectral target detection methods across four real hyperspectral images demonstrates the effectiveness of the proposed method for hyperspectral target detection, with an increase in detection accuracy and a competitive computational efficiency. An extensive ablation study further validates the critical components of the proposed framework, confirming its comprehensive capability and applicability in hyperspectral target detection scenarios.

show abstract

Section: A Transformer Based Encoder For Htdmentioning

confidence: 99%

An Unsupervised Momentum Contrastive Learning Based Transformer Network for Hyperspectral Target Detection

Wang,

Chen,

Zhao

et al. 2024

IEEE J. Sel. Top. Appl. Earth Observations Remote Sensing

View full text Add to dashboard Cite

show abstract

“…As is known, a conventional transformer is initially designed for handling sequential data in NLP, and how to map the image to a patch sequence is vital for a vision transformer. ViT [24] directly splits the input image into 16 × 16 non-overlap patches, while other recent works [40] find that convolution in patch embedding makes a significant contribution in mapping the image to a token sequence with higher quality. Following the existing works [21,26] adopting overlapped patch embedding, we first take a 7 × 7 convolution layer with a stride of 2 as the first layer in the patch embedding, followed by an extra 3 × 3 convolution layer with a stride of 1.…”

Section: Patch Embeddingmentioning

confidence: 99%

DilatedFormer: dilated granularity transformer network for placental maturity grading in ultrasound

Wu,

Yang,

Zhu

et al. 2023

Front. Phys.

View full text Add to dashboard Cite

Placental maturity grading (PMG) is often utilized for evaluating fetal growth and maternal health. Currently, PMG often relied on the subjective judgment of the clinician, which is time-consuming and tends to incur a wrong estimation due to redundancy and repeatability of the process. The existing methods often focus on designing diverse hand-crafted features or combining deep features and hand-crafted features to learn a hybrid feature with an SVM for grading the placental maturity of ultrasound images. Motivated by the dominated performance of end-to-end convolutional neural networks (CNNs) at diverse medical imaging tasks, we devise a dilated granularity transformer network for learning multi-scale global transformer features for boosting PMG. Our network first devises dilated transformer blocks to learn multi-scale transformer features at each convolutional layer and then integrates these obtained multi-scale transformer features for predicting the final result of PMG. We collect 500 ultrasound images to verify our network, and experimental results show that our network clearly outperforms state-of-the-art methods on PMG. In the future, we will strive to improve the computational complexity and generalization ability of deep neural networks for PMG.

show abstract

“…We adopt the idea of [62] to parameterize the ReLU function. The function is extended into scaled ReLU (sReLU) [64]: where W a is the scaling matrix. In order to preserve the gradient stability of the adaptation process, we follow two design choices from [59]: (1) Unlike with [64], we do not parameterize the negative values; (2) W a is initialized as an identity matrix and restricted to be diagonal.…”

Section: B Structural Transformation On Relumentioning

confidence: 99%

Learnable MFCCs for Speaker Verification

Liu

Sahidullah

Kinnunen

2021

2021 IEEE International Symposium on Circuits and Systems (ISCAS)

View full text Add to dashboard Cite

We propose a learnable mel-frequency cepstral coefficients (MFCCs) front-end architecture for deep neural network (DNN) based automatic speaker verification. Our architecture retains the simplicity and interpretability of MFCC-based features while allowing the model to be adapted to data flexibly. In practice, we formulate data-driven version of four linear transforms in a standard MFCC extractor -windowing, discrete Fourier transform (DFT), mel filterbank and discrete cosine transform (DCT). Results reported reach up to 6.7% (VoxCeleb1) and 9.7% (SITW) relative improvement in term of equal error rate (EER) from static MFCCs, without additional tuning effort.Index Terms-Speaker verification, feature extraction, melfrequency cesptral coefficients (MFCCs).

show abstract

Scaled ReLU Matters for Training Vision Transformers

Cited by 23 publications

References 57 publications

An Unsupervised Momentum Contrastive Learning Based Transformer Network for Hyperspectral Target Detection

An Unsupervised Momentum Contrastive Learning Based Transformer Network for Hyperspectral Target Detection

DilatedFormer: dilated granularity transformer network for placental maturity grading in ultrasound

Learnable MFCCs for Speaker Verification

Contact Info

Product

Resources

About