Contrastive Distillation on Intermediate Representations for Language Model Compression

Sun, Siqi; Gan, Zhe; Fang, Yuwei; Wang, Shuohang; Liu, Jingjing

doi:10.18653/v1/2020.emnlp-main.36

Cited by 32 publications

(37 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As a variant of self-supervised representation learning, contrastive learning aims to explore the potential supervisory signals from the samples for model training, which is widely used in recent NLP tasks for representation learning [31][32][33][34][35][36][37][38]. In detail, the objective of the contrastive loss function is to pull neighbors together and push non-neighbors apart [39,40].…”

Section: Knowledge-aware Methods Vs Contrastive Learning Methods For Nlpmentioning

confidence: 99%

Knowledge-Enhanced Graph Attention Network for Fact Verification

2021

View full text Add to dashboard Cite

Fact verification aims to evaluate the authenticity of a given claim based on the evidence sentences retrieved from Wikipedia articles. Existing works mainly leverage the natural language inference methods to model the semantic interaction of claim and evidence, or further employ the graph structure to capture the relation features between multiple evidences. However, previous methods have limited representation ability in encoding complicated units of claim and evidences, and thus cannot support sophisticated reasoning. In addition, a limited amount of supervisory signals lead to the graph encoder could not distinguish the distinctions of different graph structures and weaken the encoding ability. To address the above issues, we propose a Knowledge-Enhanced Graph Attention network (KEGA) for fact verification, which introduces a knowledge integration module to enhance the representation of claims and evidences by incorporating external knowledge. Moreover, KEGA leverages an auxiliary loss based on contrastive learning to fine-tune the graph attention encoder and learn the discriminative features for the evidence graph. Comprehensive experiments conducted on FEVER, a large-scale benchmark dataset for fact verification, demonstrate the superiority of our proposal in both the multi-evidences and single-evidence scenarios. In addition, our findings show that the background knowledge for words can effectively improve the model performance.

show abstract

Section: Knowledge-aware Methods Vs Contrastive Learning Methods For Nlpmentioning

confidence: 99%

Knowledge-Enhanced Graph Attention Network for Fact Verification

2021

View full text Add to dashboard Cite

show abstract

“…While most existing methods create an injective mapping from the student encoder units to the teacher, instead proposed a way to build a many-to-many mapping for a better flow of information. One can also completely bypass the mapping by combining all outputs into one single representation vector (Sun et al, 2020a).…”

Section: Knowledge Distillationmentioning

confidence: 99%

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Ganesh

Chen

Lou

et al. 2021

Transactions of the Association for Computational Linguistics

View full text Add to dashboard Cite

Pre-trained Transformer-based models have achieved state-of-the-art performance for various Natural Language Processing (NLP) tasks. However, these models often have billions of parameters, and thus are too resource- hungry and computation-intensive to suit low- capability devices or applications with strict latency requirements. One potential remedy for this is model compression, which has attracted considerable research attention. Here, we summarize the research in compressing Transformers, focusing on the especially popular BERT model. In particular, we survey the state of the art in compression for BERT, we clarify the current best practices for compressing large-scale Transformer models, and we provide insights into the workings of various methods. Our categorization and analysis also shed light on promising future research directions for achieving lightweight, accurate, and generic NLP models.

show abstract

“…ALBERT (Lan et al, 2020) introduced cross-layer parameter sharing and low-rank approximation to reduce the number of parameters. More studies (Jiao et al, 2020;Hou et al, 2020;Khetan and Karnin, 2020;Pappas et al, 2020;Sun et al, 2020a) can be found in the comprehensive survey (Ganesh et al, 2020).…”

Section: Pre-trained Language Model Compressionmentioning

confidence: 99%

Enabling Lightweight Fine-tuning for Pre-trained Language Model Compression based on Matrix Product Operators

Liu¹,

Gao²,

Zhao³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

This paper presents a novel pre-trained language models (PLM) compression approach based on the matrix product operator (short as MPO) from quantum many-body physics. It can decompose an original matrix into central tensors (containing the core information) and auxiliary tensors (with only a small proportion of parameters). With the decomposed MPO structure, we propose a novel fine-tuning strategy by only updating the parameters from the auxiliary tensors, and design an optimization algorithm for MPO-based approximation over stacked network architectures. Our approach can be applied to the original or the compressed PLMs in a general way, which derives a lighter network and significantly reduces the parameters to be fine-tuned. Extensive experiments have demonstrated the effectiveness of the proposed approach in model compression, especially the reduction in finetuning parameters (91% reduction on average). The code to reproduce the results of this paper can be found at https://github.com/ RUCAIBox/MPOP.

show abstract

Contrastive Distillation on Intermediate Representations for Language Model Compression

Cited by 32 publications

References 21 publications

Knowledge-Enhanced Graph Attention Network for Fact Verification

Knowledge-Enhanced Graph Attention Network for Fact Verification

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Enabling Lightweight Fine-tuning for Pre-trained Language Model Compression based on Matrix Product Operators

Contact Info

Product

Resources

About