Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.21
|View full text |Cite
|
Sign up to set email alerts
|

Calibration of Pre-trained Transformers

Abstract: Pre-trained Transformers are now ubiquitous in natural language processing, but despite their high end-task performance, little is known empirically about whether they are calibrated. Specifically, do these models' posterior probabilities provide an accurate empirical measure of how likely the model is to be correct on a given example? We focus on BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) in this work, and analyze their calibration across three tasks: natural language inference, paraphrase d… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

5
122
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 133 publications
(155 citation statements)
references
References 32 publications
5
122
1
Order By: Relevance
“…Previous works that purport to calibrate LMs (Desai and Durrett, 2020;Jagannatha and Yu, 2020) mainly focus on the former use case, using representations learned by LMs to predict target classes (for tasks such as natural language inference or part-of-speech tagging) or identify answer spans (for tasks such as extractive QA). In contrast, we focus on the latter case, calibrating LMs themselves by treating them as natural language generators that predict the next words given a particular input.…”
Section: Lm-based Question Answeringmentioning
confidence: 99%
See 3 more Smart Citations
“…Previous works that purport to calibrate LMs (Desai and Durrett, 2020;Jagannatha and Yu, 2020) mainly focus on the former use case, using representations learned by LMs to predict target classes (for tasks such as natural language inference or part-of-speech tagging) or identify answer spans (for tasks such as extractive QA). In contrast, we focus on the latter case, calibrating LMs themselves by treating them as natural language generators that predict the next words given a particular input.…”
Section: Lm-based Question Answeringmentioning
confidence: 99%
“…Temperature-based Scaling methods are first proposed on classification tasks (Guo et al, 2017;Desai and Durrett, 2020), where a positive scalar temperature hyperparameter τ is introduced in the final classification layer to make the probability distribution either more peaky or smooth: softmax(z/τ ). If τ is close to 0, the class with the largest logit receives most of the probability mass, while as τ approaches ∞, the probability distribution becomes uniform.…”
Section: Post-hoc Calibrationmentioning
confidence: 99%
See 2 more Smart Citations
“…One way to mitigate this concern is to use calibration, which encourages the confidence level to correspond to the probability that the model is correct (DeGroot and Fienberg, 1983). In this paper we use temperature calibration, which is a simple technique that has been shown to work well in practice (Guo et al, 2017), in particular for BERT fine-tuning (Desai and Durrett, 2020). The method learns a single parameter, denoted temperature or T , and divides each of the logits {z i } by T before applying the softmax function:…”
Section: Premise: Models Vary In Size Examples Vary In Complexitymentioning
confidence: 99%