Calibration of Pre-trained Transformers

Desai, Shrey; Durrett, Greg

doi:10.18653/v1/2020.emnlp-main.21

Cited by 133 publications

(155 citation statements)

References 32 publications

Supporting

Mentioning

122

Contrasting

Order By: Relevance

“…Previous works that purport to calibrate LMs (Desai and Durrett, 2020;Jagannatha and Yu, 2020) mainly focus on the former use case, using representations learned by LMs to predict target classes (for tasks such as natural language inference or part-of-speech tagging) or identify answer spans (for tasks such as extractive QA). In contrast, we focus on the latter case, calibrating LMs themselves by treating them as natural language generators that predict the next words given a particular input.…”

Section: Lm-based Question Answeringmentioning

confidence: 99%

“…Temperature-based Scaling methods are first proposed on classification tasks (Guo et al, 2017;Desai and Durrett, 2020), where a positive scalar temperature hyperparameter τ is introduced in the final classification layer to make the probability distribution either more peaky or smooth: softmax(z/τ ). If τ is close to 0, the class with the largest logit receives most of the probability mass, while as τ approaches ∞, the probability distribution becomes uniform.…”

Section: Post-hoc Calibrationmentioning

confidence: 99%

“…Calibration Calibration is a well-studied topic in other tasks such as medical diagnosis (Jiang et al, 2012) and image recognition (Guo et al, 2017;Lee et al, 2018). Previous works in NLP have examined calibration in structured prediction problems such as part-of-speech tagging and named entity recognition (Jagannatha and Yu, 2020), natural language understanding tasks such as natural language inference and paraphrase detection (Desai and Durrett, 2020). In contrast, we focus on calibrating LMs themselves by treating them as natural language generators that predict the next words given a particular input.…”

Section: Related Workmentioning

confidence: 99%

“…Specifically, we fine-tune the LM using softmax-or margin-based objective functions based on multiple candidate answers. For post-hoc calibration, we examined temperaturebased scaling and feature-based decision tree that take prediction probability and input-related features as input and produce calibrated confidence (Jagannatha and Yu, 2020;Desai and Durrett, 2020). We also study the sensitivity of LMs' confidence estimation with respect to language variation by paraphrasing candidate answers and augmenting questions using retrieved context.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

How Can We Know What Language Models Know?

Jiang

Araki³

et al. 2020

Transactions of the Association for Computational Linguistics

686

490

View full text Add to dashboard Cite

Recent work has presented intriguing results examining the knowledge contained in language models (LMs) by having the LM fill in the blanks of prompts such as “ Obama is a __ by profession”. These prompts are usually manually created, and quite possibly sub-optimal; another prompt such as “ Obama worked as a __ ” may result in more accurately predicting the correct profession. Because of this, given an inappropriate prompt, we might fail to retrieve facts that the LM does know, and thus any given prompt only provides a lower bound estimate of the knowledge contained in an LM. In this paper, we attempt to more accurately estimate the knowledge contained in LMs by automatically discovering better prompts to use in this querying process. Specifically, we propose mining-based and paraphrasing-based methods to automatically generate high-quality and diverse prompts, as well as ensemble methods to combine answers from different prompts. Extensive experiments on the LAMA benchmark for extracting relational knowledge from LMs demonstrate that our methods can improve accuracy from 31.1% to 39.6%, providing a tighter lower bound on what LMs know. We have released the code and the resulting LM Prompt And Query Archive (LPAQA) at https://github.com/jzbjyb/LPAQA .

show abstract

Section: Lm-based Question Answeringmentioning

confidence: 99%

Section: Post-hoc Calibrationmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

How Can We Know What Language Models Know?

Jiang

Araki³

et al. 2020

Transactions of the Association for Computational Linguistics

686

490

View full text Add to dashboard Cite

show abstract

“…One way to mitigate this concern is to use calibration, which encourages the confidence level to correspond to the probability that the model is correct (DeGroot and Fienberg, 1983). In this paper we use temperature calibration, which is a simple technique that has been shown to work well in practice (Guo et al, 2017), in particular for BERT fine-tuning (Desai and Durrett, 2020). The method learns a single parameter, denoted temperature or T , and divides each of the logits {z i } by T before applying the softmax function:…”

Section: Premise: Models Vary In Size Examples Vary In Complexitymentioning

confidence: 99%

The Right Tool for the Job: Matching Model and Instance Complexities

Schwartz¹,

Stanovsky²,

Swayamdipta³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs. To better respect a given inference budget, we propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit" from neural network calculations for simple instances, and late (and accurate) exit for hard instances. To achieve this, we add classifiers to different layers of BERT and use their calibrated confidence scores to make early exit decisions. We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks. Our method presents a favorable speed/accuracy tradeoff in almost all cases, producing models which are up to five times faster than the state of the art, while preserving their accuracy. Our method also requires almost no additional training resources (in either time or parameters) compared to the baseline BERT model. Finally, our method alleviates the need for costly retraining of multiple models at different levels of efficiency; we allow users to control the inference speed/accuracy tradeoff using a single trained model, by setting a single variable at inference time. We publicly release our code. 1 * Research completed during an internship at AI2.

show abstract