A Deep Network with Visual Text Composition Behavior

Guo, Hongyu

doi:10.18653/v1/p17-2059

“…As discussed in §1, our method is inspired by the approach of Schwartz et al (2020); Xin et al (2020a), where they preempt computation if the softmax value of any early classifier is above a predefined threshold. Unlike our approach, however, their model is not guaranteed to be accurate-even after softmax calibration (Guo, 2017). Several approaches to early exiting also include fine-tuning stages to improve efficiency (Liu et al, 2020;Geng et al, 2021;.…”

Section: Related Workmentioning

confidence: 99%

“…Following Schwartz et al (2020), we exit on the first layer where p max k ≥ 1 − , where p max k denotes the maximum softmax response of our early classifier. Softmax values are calibrated using temperature scaling (Guo, 2017) on another held-out data split, D scale .…”

Section: Baselinesmentioning

confidence: 99%

Consistent Accelerated Inference via Confident Adaptive Transformers

Schuster¹,

Fisch²,

Jaakkola³

et al. 2021

Preprint

View full text Add to dashboard Cite

We develop a novel approach for confidently accelerating inference in the large and expensive multilayer Transformers that are now ubiquitous in natural language processing (NLP). Amortized or approximate computational methods increase efficiency, but can come with unpredictable performance costs. In this work, we present CATs-Confident Adaptive Transformers-in which we simultaneously increase computational efficiency, while guaranteeing a specifiable degree of consistency with the original model with high confidence. Our method trains additional prediction heads on top of intermediate layers, and dynamically decides when to stop allocating computational effort to each input using a meta consistency classifier. To calibrate our early prediction stopping rule, we formulate a unique extension of conformal prediction. We demonstrate the effectiveness of this approach on four classification and regression tasks. 1 * The first two authors contributed equally. 1 https://github.com/TalSchuster/CATs 2 We simply define the final F l as F l (x) F(x) ∀x.

show abstract

“…Following Schwartz et al (2020), we exit on the first layer where p max k ≥ 1 − , where p max k denotes the maximum softmax response of our early classifier. Softmax values are calibrated using temperature scaling (Guo, 2017) on another held-out (labeled) data split, D scale .…”

Section: Baselinesmentioning

confidence: 99%

Consistent Accelerated Inference via Confident Adaptive Transformers

Schuster¹,

Fisch²,

Jaakkola³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

We develop a novel approach for confidently accelerating inference in the large and expensive multilayer Transformers that are now ubiquitous in natural language processing (NLP). Amortized or approximate computational methods increase efficiency, but can come with unpredictable performance costs. In this work, we present CATs-Confident Adaptive Transformers-in which we simultaneously increase computational efficiency, while guaranteeing a specifiable degree of consistency with the original model with high confidence. Our method trains additional prediction heads on top of intermediate layers, and dynamically decides when to stop allocating computational effort to each input using a meta consistency classifier. To calibrate our early prediction stopping rule, we formulate a unique extension of conformal prediction. We demonstrate the effectiveness of this approach on four classification and regression tasks. 1

show abstract

“…As (1) involves population quantities, we usually adopt empirical approximations (Guo, 2017) to estimate the calibration error. Specifically, we partition all data points into M bins of equal size according to their prediction confidences.…”

Section: Preliminariesmentioning

confidence: 99%

“…• Temperature Scaling (TS) (Guo, 2017) is a postprocessing calibration method that learns a single parameter to rescale the logits on the development set after the model is fine-tuned.…”

Section: Baselinesmentioning

confidence: 99%

Calibrated Language Model Fine-Tuning for In- and Out-of-Distribution Data

Kong

¹

,

Jiang

²

,

Zhuang³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Fine-tuned pre-trained language models can suffer from severe miscalibration for both in-distribution and out-of-distribution (OOD) data due to over-parameterization. To mitigate this issue, we propose a regularized fine-tuning method.Our method introduces two types of regularization for better calibration: (1) On-manifold regularization, which generates pseudo on-manifold samples through interpolation within the data manifold. Augmented training with these pseudo samples imposes a smoothness regularization to improve in-distribution calibration. (2) Off-manifold regularization, which encourages the model to output uniform distributions for pseudo off-manifold samples to address the over-confidence issue for OOD data. Our experiments demonstrate that the proposed method outperforms existing calibration methods for text classification in terms of expectation calibration error, misclassification detection, and OOD detection on six datasets. Our code can be found at https://github.com/Lingkai-Kong/ Calibrated-BERT-Fine-Tuning.

show abstract

A Deep Network with Visual Text Composition Behavior

Cited by 3 publications

References 16 publications

Consistent Accelerated Inference via Confident Adaptive Transformers

Consistent Accelerated Inference via Confident Adaptive Transformers

Consistent Accelerated Inference via Confident Adaptive Transformers

Calibrated Language Model Fine-Tuning for In- and Out-of-Distribution Data

Contact Info

Product

Resources

About