2020
DOI: 10.48550/arxiv.2010.01791
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 0 publications
0
5
0
Order By: Relevance
“…It is known that many parameters in a neural network are redundant and can be pruned (Li et al, 2021;Lai et al, 2021). This has also been shown for pre-trained Transformer (Chen et al, 2020a;Lin et al, 2020;Gao et al, 2021b;Michel et al, 2019;Voita et al, 2019). A popular pruning method is to discard the parameters with small absolute values (Han et al, 2015;Guo et al, 2016).…”
Section: Domain-adaptive Pre-training (Da-training)mentioning
confidence: 81%
“…It is known that many parameters in a neural network are redundant and can be pruned (Li et al, 2021;Lai et al, 2021). This has also been shown for pre-trained Transformer (Chen et al, 2020a;Lin et al, 2020;Gao et al, 2021b;Michel et al, 2019;Voita et al, 2019). A popular pruning method is to discard the parameters with small absolute values (Han et al, 2015;Guo et al, 2016).…”
Section: Domain-adaptive Pre-training (Da-training)mentioning
confidence: 81%
“…Therefore, in this experiment, we investigate whether Robustly Optimized BERT Pretraining Approach (RoBERTa) [ 34 ] based pre-training on the symbolic representations is useful for improving activity recognition performance. While RoBERTa can increase the computational footprint of the recognition system, it can be potentially replaced with recent advancements in distilling and pruning BERT models such as SNIP [ 90 ], ALBERT [ 91 ], and DistillBERT [ 92 ] while maintaining similar performance.…”
Section: Resultsmentioning
confidence: 99%
“…This is promising as advancements in NLP can also result in tandem improvements in sensor-based HAR. In resource-constrained situations, however, works miniaturizing and pruning language models [ 90 , 91 , 92 ] can be employed to reduce size while maintaining similar performance.…”
Section: Discussionmentioning
confidence: 99%
“…Winata et al [5] constructed a lightweight but effective end-to-end speech recognition model using low-rank decomposition in Speech-Transformer [29]. In addition, common model compression methods include knowledge distillation [30,31] and pruning operations [32], but knowledge distillation and pruning mostly require retraining of the model, and the training process is more complicated. To the best of our knowledge, most studies have focused on the feed-forward and convolutional layers in the network, and few studies have been conducted on the multi-head attention module, especially the multi-head self-attention module of the currently more advanced Conformer end-to-end speech recognition model.…”
Section: Related Workmentioning
confidence: 99%