Structured Pruning Learns Compact and Accurate Models

Xia, Mengzhou; Zhong, Zexuan; Chen, Danqi

doi:10.48550/arxiv.2204.00408

Cited by 8 publications

(19 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the sake of a corner case that all structures in a module are pruned, we skip the module by feeding the input as the output. While we can alternate to an quite recent pruning method [22] exploiting both coarse-grained and fine-grained strategies for state-of-the-art performance, we argue that our framework is agnostic to pruning methods and keep the pruning method simple.…”

Section: A Technical Details Of Pruningmentioning

confidence: 99%

“…A number of approaches have been proposed to identify a good structure at a scale, including dynamic search [12], layer dropping [21] and pruning [11]. In this work, we adopt pruning to assign structures A k to the candidates due to its known advantages in knowledge distillation [22]. Concretely, following previous work [11], the pruning starts with the least important parameters/features based on their importance scores, which are approximated by masking the parameterized structures.…”

Section: Specificationmentioning

confidence: 99%

“…The GLUE originally consists of two sequence classification tasks, SST-2 [26] and CoLA [27], with seven sequence-pair classification tasks, i.e., MRPC [28], STS-B [29], QQP, MNLI [30], QNLI [31], RTE [32] and WNLI [33]. We exclude WNLI and CoLA due to the evaluation inconsistency (in other words, compressed LMs get dramatically worse results while original LMs get much better ones as found out in [22]) and use the other seven tasks for evaluation. Following the work in BERT [1], we report F1 on MRPC and QQP, Spearman Correlation scores (Sp Corr) on STS-B, and Accuracy (Acc) on other tasks.…”

Section: Datasets and Metricsmentioning

confidence: 99%

“…Model Pruning Inspired by the idea that not all parameters contribute equally to the overall performance of a model, model pruning [43] is widely adopted to waive the parameters with little impact. Model pruning spans from unstructured pruning [19,[44][45][46] to structured pruning [11,12,18,22,47]. Unstructured pruning prunes parameters at neuron level referring to parameter magnitude [43,44] or learning dynamics [45], while structured pruning [11,22] prunes parameters at module level relying on parameter sensitivity to performance.…”

Section: Related Workmentioning

confidence: 99%

“…Model pruning spans from unstructured pruning [19,[44][45][46] to structured pruning [11,12,18,22,47]. Unstructured pruning prunes parameters at neuron level referring to parameter magnitude [43,44] or learning dynamics [45], while structured pruning [11,22] prunes parameters at module level relying on parameter sensitivity to performance. Although unstructured pruning enjoys a finer-grained pruning, it can only fit specialized devices.…”

Section: Related Workmentioning

confidence: 99%

See 4 more Smart Citations

AutoDisc: Automatic Distillation Schedule for Large Language Model Compression

Zhang¹,

Yang²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Driven by the teacher-student paradigm, knowledge distillation is one of the de facto ways for language model compression. Recent studies have uncovered that conventional distillation is less effective when facing a large capacity gap between the teacher and the student, and introduced teacher assistant-based distillation to bridge the gap. As a connection, the scale and the performance of the teacher assistant is crucial for transferring the knowledge from the teacher to the student. However, existing teacher assistant-based methods manually select the scale of the teacher assistant, which fails to identify the teacher assistant with the optimal scaleperformance tradeoff. To this end, we propose an Automatic Distillation Schedule (AUTODISC) for large language model compression. In particular, AUTODISC first specifies a set of teacher assistant candidates at different scales with gridding and pruning, and then optimizes all candidates in an once-for-all optimization with two approximations. The best teacher assistant scale is automatically selected according to the scale-performance tradeoff. AUTODISC is evaluated with an extensive set of experiments on a language understanding benchmark GLUE. Experimental results demonstrate the improved performance and applicability of our AUTODISC. We further apply AUTODISC on a language model with over one billion parameters and show the scalability of AUTODISC.Recent advances [15] have shown that conventional distillation suffers from severe performance decline when facing a large capacity gap between the teacher and the student. To alleviate the shortcoming, teacher assistant-based distillation [16] has been proposed, where the teacher is first distilled into an intermediate-scale teacher assistant. This teacher assistant then serves as an alternative teacher to transfer the knowledge to the student. While teacher assistant-based distillation generally Preprint. Under review.

show abstract

Section: A Technical Details Of Pruningmentioning

confidence: 99%

Section: Specificationmentioning

confidence: 99%

Section: Datasets and Metricsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 3 more Smart Citations

AutoDisc: Automatic Distillation Schedule for Large Language Model Compression

Zhang¹,

Yang²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Structural local sparse and low-rank tracker using deep features

Zhang

Chen

2023

Multimedia Systems

View full text Add to dashboard Cite

Transformer models have achieved remarkable results in various natural language tasks, but they are often prohibitively large, requiring massive memories and computational resources.To reduce the size and complexity of these models, we propose LoSparse (Low-Rank and Sparse approximation), a novel model compression technique that approximates a weight matrix by the sum of a low-rank matrix and a sparse matrix. Our method combines the advantages of both low-rank approximations and pruning, while avoiding their limitations. Low-rank approximation compresses the coherent and expressive parts in neurons, while pruning removes the incoherent and non-expressive parts in neurons. Pruning enhances the diversity of low-rank approximations, and low-rank approximation prevents pruning from losing too many expressive neurons. We evaluate our method on natural language understanding, question answering, and natural language generation tasks. We show that it significantly outperforms existing compression methods. Our code is publicly available at https://github.com/yxli2123/LoSparse * Published as a conference paper in ICML 2023. † Li, Yu, Zhang, Liang and Zhao are affiliated with Georgia Tech. He and Chen are affiliated with Microsoft Azure.

show abstract

Medicine-Engineering Interdisciplinary Research Based on Bibliometric Analysis: A Case Study on Medicine-Engineering Institutional Cooperation of Shanghai Jiao Tong University

Wang

Cui

Deng

2022

J. Shanghai Jiaotong Univ. (Sci.)

View full text Add to dashboard Cite

This article aims to provide reference for medicine-engineering interdisciplinary research. Targeted at the scientific literature and patent literature published by Shanghai Jiao Tong University, this article attempts to set up co-occurrence matrix of medicine-engineering institutional information which was extracted from address fields of the papers, so as to construct the medicine-engineering intersection datasets. The dataset of scientific literature was analyzed using bibliometrics and visualization methods from multiple dimensions, and the most active factors, such as trends of output, journal and subject distribution, were identified from the indicators of category normalized citation impact (CNCI), times cited, keywords, citation topics and the degree of medicine-engineering interdisplinary. Research on hotspots and trends was discussed in detail. Analyses of the dataset of patent literature showed research themes and measured the degree for technology convergence of medicine-engineering.

show abstract

Structured Pruning Learns Compact and Accurate Models

Cited by 8 publications

References 29 publications

AutoDisc: Automatic Distillation Schedule for Large Language Model Compression

AutoDisc: Automatic Distillation Schedule for Large Language Model Compression

Structural local sparse and low-rank tracker using deep features

Medicine-Engineering Interdisciplinary Research Based on Bibliometric Analysis: A Case Study on Medicine-Engineering Institutional Cooperation of Shanghai Jiao Tong University

Contact Info

Product

Resources

About