2021
DOI: 10.48550/arxiv.2102.12871
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

SparseBERT: Rethinking the Importance Analysis in Self-attention

Han Shi,
Jiahui Gao,
Xiaozhe Ren
et al.

Abstract: Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity. As the core component, selfattention module has aroused widespread interests. Attention map visualization of a pre-trained model is one direct method for understanding selfattention mechanism and some common patterns are observed in visualization. Based on these patterns, a series of efficient transformers are proposed with corresponding sparse attention masks. Besides above empirical results, univers… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 22 publications
0
4
0
Order By: Relevance
“…In this section, we compare our work with some existing theoretical results on the transformer model [13,14,15,10]. Since these works use similar methods to those in [13], we focus on the theoretical contributions of this paper.…”
Section: Comparison and Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…In this section, we compare our work with some existing theoretical results on the transformer model [13,14,15,10]. Since these works use similar methods to those in [13], we focus on the theoretical contributions of this paper.…”
Section: Comparison and Discussionmentioning
confidence: 99%
“…Later, [14] provides a unified framework to analyze sparse transformer models. Recently, [10] studies the significance of different positions in the attention matrix during pre-training and shows that diagonal elements in the attention map are the least important compared with other attention positions. From a statistical machine learning point of view, the authors in [4] propose a classifier based on a transformer model and show that this classifier can circumvent the curse of dimensionality.…”
Section: Introductionmentioning
confidence: 99%
“…A number of efficient Transformer variants have been proposed to mitigate the quadratic complexity of self-attention (Child et al, 2019;Beltagy et al, 2020;Zaheer et al, 2020;Shi et al, 2021). One straightforward way to exploit the intrinsic redundancy in attention is forming sparse patterns as in…”
Section: Intrinsic Sparsity In Attention Weightsmentioning
confidence: 99%
“…It greatly enhances the state-of-the-art (SOTA) for many tasks in natural language processing (NLP) and computer vision (CV), which are two major application fields of AI. Among them, transformer [34,9,19,29] has found its almost ubiquitous applications in the field of NLP due to its great advantage of long-range capture and parallelism capability compared to the previous prevalent recurrent neural network (RNN).…”
Section: Introductionmentioning
confidence: 99%