SparseBERT: Rethinking the Importance Analysis in Self-attention

Shi, Han; Gao, Jiahui; Ren, Xiaozhe; Xu, Hang; Liang, Xiaodan; Li, Zhenguo; Kwok, James T.

doi:10.48550/arxiv.2102.12871

Cited by 3 publications

(4 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we compare our work with some existing theoretical results on the transformer model [13,14,15,10]. Since these works use similar methods to those in [13], we focus on the theoretical contributions of this paper.…”

Section: Comparison and Discussionmentioning

confidence: 99%

“…Later, [14] provides a unified framework to analyze sparse transformer models. Recently, [10] studies the significance of different positions in the attention matrix during pre-training and shows that diagonal elements in the attention map are the least important compared with other attention positions. From a statistical machine learning point of view, the authors in [4] propose a classifier based on a transformer model and show that this classifier can circumvent the curse of dimensionality.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Attention Enables Zero Approximation Error

Fang¹,

Ouyang²,

Cheng³

2022

Preprint

View full text Add to dashboard Cite

Deep learning models have been widely applied in various aspects of daily life. Many variant models based on deep learning structures have achieved even better performances. Attention-based architectures have become almost ubiquitous in deep learning structures. Especially, the transformer model has now defeated the convolutional neural network in image classification tasks to become the most widely used tool. However, the theoretical properties of attention-based models are seldom considered. In this work, we show that with suitable adaptations, the single-head self-attention transformer with a fixed number of transformer encoder blocks and free parameters is able to generate any desired polynomial of the input with no error. The number of transformer encoder blocks is the same as the degree of the target polynomial. Even more exciting, we find that these transformer encoder blocks in this model do not need to be trained. As a direct consequence, we show that the single-head self-attention transformer with increasing numbers of free parameters is universal. These surprising theoretical results clearly explain the outstanding performances of the transformer model and may shed light on future modifications in real applications. We also provide some experiments to verify our theoretical result.

show abstract

Section: Comparison and Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Attention Enables Zero Approximation Error

Fang¹,

Ouyang²,

Cheng³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…A number of efficient Transformer variants have been proposed to mitigate the quadratic complexity of self-attention (Child et al, 2019;Beltagy et al, 2020;Zaheer et al, 2020;Shi et al, 2021). One straightforward way to exploit the intrinsic redundancy in attention is forming sparse patterns as in…”

Section: Intrinsic Sparsity In Attention Weightsmentioning

confidence: 99%

Transformer Acceleration with Dynamic Sparse Attention

Liu¹,

Qu²,

Chen³

et al. 2021

Preprint

View full text Add to dashboard Cite

Transformers are the mainstream of NLP applications and are becoming increasingly popular in other domains such as Computer Vision. Despite the improvements in model quality, the enormous computation costs make Transformers difficult at deployment, especially when the sequence length is large in emerging applications. Processing attention mechanism as the essential component of Transformer is the bottleneck of execution due to the quadratic complexity. Prior art explores sparse patterns in attention to support long sequence modeling, but those pieces of work are on static or fixed patterns. We demonstrate that the sparse patterns are dynamic, depending on input sequences. Thus, we propose the Dynamic Sparse Attention (DSA) that can efficiently exploit the dynamic sparsity in the attention of Transformers. Compared with other methods, our approach can achieve better trade-offs between accuracy and model complexity. Moving forward, we identify challenges and provide solutions to implement DSA on existing hardware (GPUs) and specialized hardware in order to achieve practical speedup and efficiency improvements for Transformer execution.

show abstract

“…It greatly enhances the state-of-the-art (SOTA) for many tasks in natural language processing (NLP) and computer vision (CV), which are two major application fields of AI. Among them, transformer [34,9,19,29] has found its almost ubiquitous applications in the field of NLP due to its great advantage of long-range capture and parallelism capability compared to the previous prevalent recurrent neural network (RNN).…”

Section: Introductionmentioning

confidence: 99%

PSViT: Better Vision Transformer via Token Pooling and Attention Sharing

Chen,

Li,

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we observe two levels of redundancies when applying vision transformers (ViT) for image recognition. First, fixing the number of tokens through the whole network produces redundant features at the spatial level. Second, the attention maps among different transformer layers are redundant. Based on the observations above, we propose a PSViT: a ViT with token Pooling and attention Sharing to reduce the redundancy, effectively enhancing the feature representation ability, and achieving a better speed-accuracy trade-off. Specifically, in our PSViT, token pooling can be defined as the operation that decreases the number of tokens at the spatial level. Besides, attention sharing will be built between the neighboring transformer layers for reusing the attention maps having a strong correlation among adjacent layers. Then, a compact set of the possible combinations for different token pooling and attention sharing mechanisms are constructed. Based on the proposed compact set, the number of tokens in each layer and the choices of layers sharing attention can be treated as hyper-parameters that are learned from data automatically. Experimental results show that the proposed scheme can achieve up to 6.6% accuracy improvement in ImageNet classification compared with the DeiT in [33].

show abstract

SparseBERT: Rethinking the Importance Analysis in Self-attention

Cited by 3 publications

References 22 publications

Attention Enables Zero Approximation Error

Attention Enables Zero Approximation Error

Transformer Acceleration with Dynamic Sparse Attention

PSViT: Better Vision Transformer via Token Pooling and Attention Sharing

Contact Info

Product

Resources

About