Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior

Lin, Zi; Liu, Jeremiah Zhe; Yang, Zi; Hua, Nan; Roth, Dan

doi:10.48550/arxiv.2010.01791

Cited by 5 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is known that many parameters in a neural network are redundant and can be pruned (Li et al, 2021;Lai et al, 2021). This has also been shown for pre-trained Transformer (Chen et al, 2020a;Lin et al, 2020;Gao et al, 2021b;Michel et al, 2019;Voita et al, 2019). A popular pruning method is to discard the parameters with small absolute values (Han et al, 2015;Guo et al, 2016).…”

Section: Domain-adaptive Pre-training (Da-training)mentioning

confidence: 81%

Adapting a Language Model While Preserving its General Knowledge

Zixuan¹,

Shao²,

Lin³

et al. 2023

Preprint

View full text Add to dashboard Cite

Domain-adaptive pre-training (or DA-training for short), also known as post-training, aims to train a pre-trained general-purpose language model (LM) using an unlabeled corpus of a particular domain to adapt the LM so that end-tasks in the domain can give improved performances. However, existing DA-training methods are in some sense blind as they do not explicitly identify what knowledge in the LM should be preserved and what should be changed by the domain corpus. This paper shows that the existing methods are suboptimal and proposes a novel method to perform a more informed adaptation of the knowledge in the LM by (1) soft-masking the attention heads based on their importance to best preserve the general knowledge in the LM and (2) contrasting the representations of the general and the full (both general and domain knowledge) to learn an integrated representation with both general and domain-specific knowledge. Experimental results will demonstrate the effectiveness of the proposed approach. 1

show abstract

Section: Domain-adaptive Pre-training (Da-training)mentioning

confidence: 81%

Adapting a Language Model While Preserving its General Knowledge

Zixuan¹,

Shao²,

Lin³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Therefore, in this experiment, we investigate whether Robustly Optimized BERT Pretraining Approach (RoBERTa) [ 34 ] based pre-training on the symbolic representations is useful for improving activity recognition performance. While RoBERTa can increase the computational footprint of the recognition system, it can be potentially replaced with recent advancements in distilling and pruning BERT models such as SNIP [ 90 ], ALBERT [ 91 ], and DistillBERT [ 92 ] while maintaining similar performance.…”

Section: Resultsmentioning

confidence: 99%

“…This is promising as advancements in NLP can also result in tandem improvements in sensor-based HAR. In resource-constrained situations, however, works miniaturizing and pruning language models [ 90 , 91 , 92 ] can be employed to reduce size while maintaining similar performance.…”

Section: Discussionmentioning

confidence: 99%

Towards Learning Discrete Representations via Self-Supervision for Wearables-Based Human Activity Recognition

Haresamudram,

Essa,

Plötz

2024

Sensors

View full text Add to dashboard Cite

Human activity recognition (HAR) in wearable and ubiquitous computing typically involves translating sensor readings into feature representations, either derived through dedicated pre-processing procedures or integrated into end-to-end learning approaches. Independent of their origin, for the vast majority of contemporary HAR methods and applications, those feature representations are typically continuous in nature. That has not always been the case. In the early days of HAR, discretization approaches had been explored—primarily motivated by the desire to minimize computational requirements on HAR, but also with a view on applications beyond mere activity classification, such as, for example, activity discovery, fingerprinting, or large-scale search. Those traditional discretization approaches, however, suffer from substantial loss in precision and resolution in the resulting data representations with detrimental effects on downstream analysis tasks. Times have changed, and in this paper, we propose a return to discretized representations. We adopt and apply recent advancements in vector quantization (VQ) to wearables applications, which enables us to directly learn a mapping between short spans of sensor data and a codebook of vectors, where the index comprises the discrete representation, resulting in recognition performance that is at least on par with their contemporary, continuous counterparts—often surpassing them. Therefore, this work presents a proof of concept for demonstrating how effective discrete representations can be derived, enabling applications beyond mere activity classification but also opening up the field to advanced tools for the analysis of symbolic sequences, as they are known, for example, from domains such as natural language processing. Based on an extensive experimental evaluation of a suite of wearable-based benchmark HAR tasks, we demonstrate the potential of our learned discretization scheme and discuss how discretized sensor data analysis can lead to substantial changes in HAR.

show abstract

“…Winata et al [5] constructed a lightweight but effective end-to-end speech recognition model using low-rank decomposition in Speech-Transformer [29]. In addition, common model compression methods include knowledge distillation [30,31] and pruning operations [32], but knowledge distillation and pruning mostly require retraining of the model, and the training process is more complicated. To the best of our knowledge, most studies have focused on the feed-forward and convolutional layers in the network, and few studies have been conducted on the multi-head attention module, especially the multi-head self-attention module of the currently more advanced Conformer end-to-end speech recognition model.…”

Section: Related Workmentioning

confidence: 99%

Efficient Conformer for Agglutinative Language ASR Model Using Low-Rank Approximation and Balanced Softmax

2023

View full text Add to dashboard Cite

Recently, the performance of end-to-end speech recognition has been further improved based on the proposed Conformer framework, which has also been widely used in the field of speech recognition. However, the Conformer model is mostly applied to very widespread languages, such as Chinese and English, and rarely applied to speech recognition of Central and West Asian agglutinative languages. There are more network parameters in the Conformer end-to-end speech recognition model, so the structure of the model is complex, and it consumes more resources. At the same time, we found that there is a long-tail problem in Kazakh, i.e., the distribution of high-frequency words and low-frequency words is not uniform, which makes the recognition accuracy of the model low. For these reasons, we made the following improvements to the Conformer baseline model. First, we constructed a low-rank multi-head self-attention encoder and decoder using low-rank approximation decomposition to reduce the number of parameters of the multi-head self-attention module and model’s storage space. Second, to alleviate the long-tail problem in Kazakh, the original softmax function was replaced by a balanced softmax function in the Conformer model; Third, we use connectionist temporal classification (CTC) as an auxiliary task to speed up the model training and build a multi-task lightweight but efficient Conformer speech recognition model with hybrid CTC/Attention. To evaluate the effectiveness of the proposed model, we conduct experiments on the open-source Kazakh language dataset, during which no external language model is used, and the number of parameters is relatively compressed by 7.4% and the storage space is relatively reduced by 13.5 MB, while the training speed and word error rate remain basically unchanged.

show abstract

Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior

Cited by 5 publications

References 0 publications

Adapting a Language Model While Preserving its General Knowledge

Adapting a Language Model While Preserving its General Knowledge

Towards Learning Discrete Representations via Self-Supervision for Wearables-Based Human Activity Recognition

Efficient Conformer for Agglutinative Language ASR Model Using Low-Rank Approximation and Balanced Softmax

Contact Info

Product

Resources

About