ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9415117
|View full text |Cite
|
Sign up to set email alerts
|

AttentionLite: Towards Efficient Self-Attention Models for Vision

Abstract: We propose a novel framework for producing a class of parameter and compute efficient models called AttentionLite suitable for resource constrained applications. Prior work has primarily focused on optimizing models either via knowledge distillation or pruning. In addition to fusing these two mechanisms, our joint optimization framework also leverages recent advances in self-attention as a substitute for convolutions. We can simultaneously distill knowledge from a compute heavy teacher while also pruning the s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 19 publications
(5 citation statements)
references
References 8 publications
0
5
0
Order By: Relevance
“…In particular, we initialize a PR model with the weights and mask of best PR model of stage 2 and allow only the parameters to train. We train the PR model with distillation via KL-divergence loss (Hinton et al, 2015;Kundu & Sundaresan, 2021) from a pre-trained AR along with a CE-loss. Moreover, we introduce an AR-PR post-ReLU activation mismatch (PRAM) penalty into the loss function.…”
Section: Maximizing Activation Similarity Via Distillationmentioning
confidence: 99%
“…In particular, we initialize a PR model with the weights and mask of best PR model of stage 2 and allow only the parameters to train. We train the PR model with distillation via KL-divergence loss (Hinton et al, 2015;Kundu & Sundaresan, 2021) from a pre-trained AR along with a CE-loss. Moreover, we introduce an AR-PR post-ReLU activation mismatch (PRAM) penalty into the loss function.…”
Section: Maximizing Activation Similarity Via Distillationmentioning
confidence: 99%
“…Various attention mechanisms have been successfully used in computer vision, especially in the field of semantic segmentation. Normally, there are two attention mechanisms: soft-attention mechanism [19][20][21][22] and self-attention mechanism [23][24][25][26]. In the soft-attention mechanism, channel attention and spatial attention are often used for dealing with the task of semantic segmentation.…”
Section: Attention Mechanismmentioning
confidence: 99%
“…As a result, the computational load of the model is greatly reduced, and the efficiency of the model is improved without losing too much accuracy. Recently, various attention mechanisms [19][20][21][22][23][24][25][26] have been successfully applied in many computer vision tasks. Such as SENet [19] and CBAM [20], these papers prove that weighting in space and channel is helpful to improve feature extraction.…”
Section: Introductionmentioning
confidence: 99%
“…Similarly, when dealing with information, the attention mechanism only focusses attention on the part of the regional information that is conducive to the realisation of the task, which not only describes the focus of the model but also improves the representation of features. To solve the problem of convolution position insensitivity, AttentionLite [41] uses a self‐attention mechanism instead of convolution, generating weights from trainable queries, keys, and values. It requires only a small number of parameters and can be better than similar models in accuracy.…”
Section: Related Workmentioning
confidence: 99%