Prune Once for All: Sparse Pre-Trained Language Models

Zafrir, Ofir; Larey, Ariel; Boudoukh, Guy; Shen, Haihao; Wasserblat, Moshe

doi:10.48550/arxiv.2111.05754

Cited by 11 publications

(12 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Chen et al (2020a) show a 70%-sparsity model retains the MLM accuracy produced by iterative magnitude pruning. Zafrir et al (2021) show the potential advantage of upstream unstructured pruning against downstream pruning. We consider applying CoFi for upstream pruning as a promising future direction to produce task-agnostic models with flexible structures.…”

Section: Related Workmentioning

confidence: 97%

Structured Pruning Learns Compact and Accurate Models

Xia¹,

Zhong²,

Chen³

2022

Preprint

View full text Add to dashboard Cite

The growing size of neural language models has led to increased attention in model compression.The two predominant approaches are pruning, which gradually removes weights from a pre-trained model, and distillation, which trains a smaller compact model to match a larger one. Pruning methods can significantly reduce the model size but hardly achieve large speedups as distillation. However, distillation methods require large amounts of unlabeled data and are expensive to train. In this work, we propose a task-specific structured pruning method CoFi 1 (Coarse-and Fine-grained Pruning), which delivers highly parallelizable subnetworks and matches the distillation methods in both accuracy and latency, without resorting to any unlabeled data. Our key insight is to jointly prune coarse-grained (e.g., layers) and fine-grained (e.g., heads and hidden units) modules, which controls the pruning decision of each parameter with masks of different granularity. We also devise a layerwise distillation strategy to transfer knowledge from unpruned to pruned models during optimization. Our experiments on GLUE and SQuAD datasets show that CoFi yields models with over 10× speedups with a small accuracy drop, showing its effectiveness and efficiency compared to previous pruning and distillation approaches. 2

show abstract

Section: Related Workmentioning

confidence: 97%

Structured Pruning Learns Compact and Accurate Models

Xia¹,

Zhong²,

Chen³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Second, we implement a multi-segment variant of late interaction [19] when adapting M6-Rec to tasks that require low-latency real-time inference, where most of the heavy computation is pre-computed offline and cached. Finally, to make M6-Rec deployable on edge devices such as mobile phones, we further employ techniques such as parameter sharing [20], pruning [58], quantization [57], and early-exiting [16,48] to reduce the model size and accelerate the inference speed. In summary, our main contributions are:…”

Section: Attention Maskmentioning

confidence: 99%

“…Reducing the model size keeps hardware costs down and is mandated for resource-limited edge devices. Many strategies have been explored, e.g., parameter sharing [20], distillation [17,41,47,50], pruning [5,12,58], and quantization [57]. Still, the existing tiny language models usually have over >10M parameters, while we estimate that it needs to be around 2M to avoid degrading the user experience when deploying a model to our users' mobile phones.…”

Section: Efficient Language Foundationsmentioning

confidence: 99%

M6-Rec: Generative Pretrained Language Models are Open-Ended Recommender Systems

Ma¹,

Zhou²,

Zhou³

et al. 2022

Preprint

View full text Add to dashboard Cite

Industrial recommender systems have been growing increasingly complex, may involve diverse domains such as e-commerce products and user-generated contents, and can comprise a myriad of tasks such as retrieval, ranking, explanation generation, and even AI-assisted content production. The mainstream approach so far is to develop individual algorithms for each domain and each task. In this paper, we explore the possibility of developing a unified foundation model to support open-ended domains and tasks in an industrial recommender system, which may reduce the demand on downstream settings' data and can minimize the carbon footprint by avoiding training a separate model from scratch for every task. Deriving a unified foundation is challenging due to (i) the potentially unlimited set of downstream domains and tasks, and (ii) the real-world systems' emphasis on computational efficiency. We thus build our foundation upon M6, an existing large-scale industrial pretrained language model similar to GPT-3 and T5, and leverage M6's pretrained ability for sample-efficient downstream adaptation, by representing user behavior data as plain texts and converting the tasks to either language understanding or generation. To deal with a tight hardware budget, we propose an improved version of prompt tuning that outperforms fine-tuning with negligible 1% taskspecific parameters, and employ techniques such as late interaction, early exiting, parameter sharing, and pruning to further reduce the inference time and the model size. We demonstrate the foundation model's versatility on a wide range of tasks such as retrieval, ranking, zero-shot recommendation, explanation generation, personalized content creation, and conversational recommendation, and manage to deploy it on both cloud servers and mobile devices. 1 We focus primarily on texts. However, it is straightforward to support images by converting an image into a sequence of tokens, as in DALL-E [40] and M6 [28].

show abstract

“…Different architectures have been explored in this respect, choosing to use an extractive open QA (the answers come strictly from the context) Intel/bert-large-uncased-squadv1.1sparse-80-1x4-block-pruneofa [41] for the experiments (with an f1-score of 91.174 on SQuADv1.1). Some significant tests have been carried out on this model to validate the possibilities of this new interaction.…”

Section: Nlp Using Transformers With Questions and Answersmentioning

confidence: 99%

Towards achieving a high degree of situational awareness and multimodal interaction with AR and semantic AI in industrial applications

Izquierdo-Domenech

Linares-Pellicer

Orta-Lopez

2022

Multimed Tools Appl

View full text Add to dashboard Cite

With its various available frameworks and possible devices, augmented reality is a proven useful tool in various industrial processes such as maintenance, repairing, training, reconfiguration, and even monitoring tasks of production lines in large factories. Despite its advantages, augmented reality still does not usually give meaning to the elements it complements, staying in a physical or geometric layer of its environment and without providing information that may be of great interest to industrial operators in carrying out their work. An expert’s remote human assistance is becoming an exciting complement in these environments, but this is expensive or even impossible in many cases. This paper shows how a machine learning semantic layer can complement augmented reality solutions in the industry by providing an intelligent layer, sometimes even beyond some expert’s skills. This layer, using state-of-the-art models, can provide visual validation and new inputs, natural language interaction, and automatic anomaly detection. All this new level of semantic context can be integrated into almost any current augmented reality system, improving the operator’s job with additional contextual information, new multimodal interaction and validation, increasing their work comfort, operational times, and security.

show abstract

Prune Once for All: Sparse Pre-Trained Language Models

Cited by 11 publications

References 0 publications

Structured Pruning Learns Compact and Accurate Models

Structured Pruning Learns Compact and Accurate Models

M6-Rec: Generative Pretrained Language Models are Open-Ended Recommender Systems

Towards achieving a high degree of situational awareness and multimodal interaction with AR and semantic AI in industrial applications

Contact Info

Product

Resources

About