A Fast Post-Training Pruning Framework for Transformers

Kwon, Woosuk; Kim, Sehoon; Mahoney, Michael W.; Hassoun, Joseph; Keutzer, Kurt; Gholami, Amir

doi:10.48550/arxiv.2204.09656

Cited by 2 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We study the performance of ZipLM when applied purely in one-shot, without any retraining. In this setting we compare against the recently proposed state-of-the-art method Kwon et al (2022) which is based on several heuristics: Fisherbased mask search, mask rearrangement and mask tuning. More accurate versions for some of those aspects arise naturally in our pruning framework.…”

Section: Additional Validationmentioning

confidence: 99%

ZipLM: Hardware-Aware Structured Pruning of Language Models

Kurtic¹,

Frantar²,

Alistarh³

2023

Preprint

View full text Add to dashboard Cite

The breakthrough performance of large language models (LLMs) comes with large computational footprints and high deployment costs. In this paper, we progress towards resolving this problem by proposing a new structured compression approach for LLMs, called ZipLM, which provides state-of-the-art compression-vs-accuracy results, while guaranteeing to match a set of (achievable) target speedups on any given target hardware. Specifically, given a task, a model, an inference environment, as well as a set of speedup targets, ZipLM identifies and removes redundancies in the model through iterative structured shrinking of the model's weight matrices. Importantly, ZipLM works in both, the post-training/one-shot and the gradual compression setting, where it produces a set of accurate models in a single run, making it highly-efficient in practice. Our approach is based on new structured pruning and knowledge distillation techniques, and consistently outperforms prior structured compression methods in terms of accuracy-versus-speedup in experiments on BERT-and GPT-family models. In particular, when compressing GPT2 model, it outperforms DistilGPT2 while being 60% smaller and 30% faster. Further, ZipLM matches performance of heavily optimized MobileBERT model, obtained via extensive architecture search, by simply pruning the baseline BERT large architecture, and outperforms all prior BERT base compression techniques like CoFi, MiniLM and TinyBERT.

show abstract

Section: Additional Validationmentioning

confidence: 99%

ZipLM: Hardware-Aware Structured Pruning of Language Models

Kurtic¹,

Frantar²,

Alistarh³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Introducing sparsity can reduce memory consumption and accelerate inference [35]. Pruning has also been used as an approach to reduce inference cost [71]. Quantization and sparse updates can reduce the training cost [85].…”

Section: Introduction 11 Backgroundmentioning

confidence: 99%

Towards resource-aware dialogue systems and sentiment analysis

Pandelea

View full text Add to dashboard Cite

In the past few years, the use of transformer-based models has experienced increasing popularity as new state-of-the-art performance was achieved in several natural language processing tasks. As these models are often extremely large, however, their use for applications within embedded devices may not be feasible. This thesis looks at two specific applications, Dialogue Systems and Sentiment Analysis.These offer great potential to enhance user experience, but at the same time, when running on embedded devices, cannot make use of the same models and algorithms designed for server-based execution, due to factors such as reduced memory capacity and limited computational power. Novel solutions that are resource-and user-aware are therefore needed.Dialogue Systems Research on building dialogue systems able to engage in natural sounding conversation with humans has attracted increasing attention in recent years. This has led to the rise of commercial conversational agents such as Google Home, Alexa and Siri situated on embedded devices, that enable users to interface with a wide range of underlying functionalities in a natural and seamless manner. However, in part due to memory and computational power constraints, these systems necessitate to either be placed on, or initiate frequent communication with, a server in order to process the users' queries. When placed on embedded systems, this communication may act as a bottleneck, resulting in delays as well as in the halt of the system should the network connection be lost or unavailable.Moreover, despite the rise of generative models such as ChatGPT, retrieval-based dialogue systems remain a promising approach due to their ability to deliver syntactically rich and informative responses while allowing for greater control on the responses that the model can provide, which may be critical in some applications. This thesis proposes a new framework for hardware-aware retrieval-based dialogue systems based on the Dual-Encoder architecture, coupled with a clustering method to group candidates pertaining to a same conversation, that reduces storage capacity and computational power requirements. xi xiiSentiment Analysis The availability of new datasets and deep learning techniques have led to a surge of effort directed towards sentiment analysis research. However, little attention has been given to the development of models that are not only accurate, but also suitable for user-specific use or geared towards resourceconstrained devices. State-of-the-art models often have tens of millions of parameters which make it unfeasible to deploy such solutions on devices characterized by limited memory and computational power. This work explores the concept of software-hardware co-design and propose a methodical procedure to select the most desirable model taking into consideration application constraints described in terms of memory and latency. In doing so, it shows how fully utilizing the feature extraction capabilities of large pre-trained language models can close the gap between the ...

show abstract

A Fast Post-Training Pruning Framework for Transformers

Cited by 2 publications

References 0 publications

ZipLM: Hardware-Aware Structured Pruning of Language Models

ZipLM: Hardware-Aware Structured Pruning of Language Models

Towards resource-aware dialogue systems and sentiment analysis

Contact Info

Product

Resources

About