Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.36
|View full text |Cite
|
Sign up to set email alerts
|

Contrastive Distillation on Intermediate Representations for Language Model Compression

Abstract: Existing language model compression methods mostly use a simple L 2 loss to distill knowledge in the intermediate representations of a large BERT model to a smaller one. Although widely used, this objective by design assumes that all the dimensions of hidden representations are independent, failing to capture important structural knowledge in the intermediate layers of the teacher network. To achieve better distillation efficacy, we propose Contrastive Distillation on Intermediate Representations (CODIR), a pr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
37
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
4
2

Relationship

0
10

Authors

Journals

citations
Cited by 32 publications
(37 citation statements)
references
References 21 publications
0
37
0
Order By: Relevance
“…As a variant of self-supervised representation learning, contrastive learning aims to explore the potential supervisory signals from the samples for model training, which is widely used in recent NLP tasks for representation learning [31][32][33][34][35][36][37][38]. In detail, the objective of the contrastive loss function is to pull neighbors together and push non-neighbors apart [39,40].…”
Section: Knowledge-aware Methods Vs Contrastive Learning Methods For Nlpmentioning
confidence: 99%
“…As a variant of self-supervised representation learning, contrastive learning aims to explore the potential supervisory signals from the samples for model training, which is widely used in recent NLP tasks for representation learning [31][32][33][34][35][36][37][38]. In detail, the objective of the contrastive loss function is to pull neighbors together and push non-neighbors apart [39,40].…”
Section: Knowledge-aware Methods Vs Contrastive Learning Methods For Nlpmentioning
confidence: 99%
“…While most existing methods create an injective mapping from the student encoder units to the teacher, instead proposed a way to build a many-to-many mapping for a better flow of information. One can also completely bypass the mapping by combining all outputs into one single representation vector (Sun et al, 2020a).…”
Section: Knowledge Distillationmentioning
confidence: 99%
“…ALBERT (Lan et al, 2020) introduced cross-layer parameter sharing and low-rank approximation to reduce the number of parameters. More studies (Jiao et al, 2020;Hou et al, 2020;Khetan and Karnin, 2020;Pappas et al, 2020;Sun et al, 2020a) can be found in the comprehensive survey (Ganesh et al, 2020).…”
Section: Pre-trained Language Model Compressionmentioning
confidence: 99%