How Certain is Your Transformer?

Shelmanov, Artem; Tsymbalov, Evgenii; Puzyrev, Dmitri; Fedyanin, Kirill; Panchenko, Alexander; Panov, Maxim

doi:10.18653/v1/2021.eacl-main.157

Cited by 23 publications

(15 citation statements)

References 23 publications

(20 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(He et al 2020) combined mix-up, selfensembling and dropout to achieve more accurate uncertainty score for text classification. (Shelmanov et al 2021) proposed to incorporate determinantal point process (DPP) to MC dropout to quantify the uncertainty of transformers. Different to the above-mentioned approaches, we inject stochasticity into the vanilla transformer with Gumbel-Softmax tricks.…”

Section: Related Workmentioning

confidence: 99%

Transformer Uncertainty Estimation with Hierarchical Stochastic Attention

Pei

Wang

Szarvas

2022

AAAI

View full text Add to dashboard Cite

Transformers are state-of-the-art in a wide range of NLP tasks and have also been applied to many real-world products. Understanding the reliability and certainty of transformer models is crucial for building trustable machine learning applications, e.g., medical diagnosis. Although many recent transformer extensions have been proposed, the study of the uncertainty estimation of transformer models is under-explored. In this work, we propose a novel way to enable transformers to have the capability of uncertainty estimation and, meanwhile, retain the original predictive performance. This is achieved by learning hierarchical stochastic self-attention that attends to values and a set of learnable centroids, respectively. Then new attention heads are formed with a mixture of sampled centroids using the Gumbel-Softmax trick. We theoretically show that the self-attention approximation by sampling from a Gumbel distribution is upper bounded. We empirically evaluate our model on two text classification tasks with both in-domain (ID) and out-of-domain (OOD) datasets. The experimental results demonstrate that our approach: (1) achieves the best predictive-uncertainty trade-off among compared methods; (2) exhibits very competitive (in most cases, better) predictive performance on ID datasets; (3) is on par with Monte Carlo dropout and ensemble methods in uncertainty estimation on OOD datasets.

show abstract

Section: Related Workmentioning

confidence: 99%

Transformer Uncertainty Estimation with Hierarchical Stochastic Attention

Pei

Wang

Szarvas

2022

AAAI

View full text Add to dashboard Cite

show abstract

“…Although transformers show excellent capability in processing long sequences of data, one of their main drawbacks is that they are not able to provide mathematicallygrounded estimates of their uncertainty for predictions. To address this issue, Bayesian transformers have been proposed [22,25,32] with the ability to quantify their uncertainty. Among various Bayesian approaches, Monte Carlo Dropout (MCD) [9] has become a wide-spread Bayesian inference scheme [7,22,25,27].…”

Section: Background and Related Work 21 Bayesian Transformermentioning

confidence: 99%

“…Invalid uncertainty estimates can result in overconfident and uncalibrated decisions, which present hazards for deploying NNs in safety-critical applications such as in healthcare or autonomous driving [12,16]. To overcome this drawback, Bayesian transformers [11,22,32] have been introduced with the mathematical grounding for reliable uncertainty estimation. An illustrative example is presented in Figure 1.…”

Section: Introductionmentioning

confidence: 99%

“…Among various Bayesian transformers, Monte Carlo Dropout (MCD)-based transformers have become the mainstream approach for providing reliable uncertainty estimation [9,22]. However, the repeated Monte Carlo (MC) sampling and the compute-intensive attention mechanism deteriorate their hardware performance, limiting their deployment in real-world applications.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Proceedings of the 59th ACM/IEEE Design Automation Conference

Shaklee

Newton

2022

View full text Add to dashboard Cite

Quantifying the uncertainty of neural networks (NNs) has been required by many safety-critical applications such as autonomous driving or medical diagnosis. Recently, Bayesian transformers have demonstrated their capabilities in providing high-quality uncertainty estimates paired with excellent accuracy. However, their real-time deployment is limited by the compute-intensive attention mechanism that is core to the transformer architecture, and the repeated Monte Carlo sampling to quantify the predictive uncertainty. To address these limitations, this paper accelerates Bayesian transformers via both algorithmic and hardware optimizations. On the algorithmic level, an evolutionary algorithm (EA)-based framework is proposed to exploit the sparsity in Bayesian transformers and ease their computational workload. On the hardware level, we demonstrate that the sparsity brings hardware performance improvement on our optimized CPU and GPU implementations. An adaptable hardware architecture is also proposed to accelerate Bayesian transformers on an FPGA. Extensive experiments demonstrate that the EA-based framework, together with hardware optimizations, reduce the latency of Bayesian transformers by up to 13, 12 and 20 times on CPU, GPU and FPGA platforms respectively, while achieving higher algorithmic performance.

show abstract

“…The other popular alternative is dropout which adds stochasticity to a standard neural network via randomly setting some of the weights to zero. This technique leads to the regularization of training [22] and can provide uncertainty estimates if applied at prediction time [6,24,23,21].…”

Section: Introduction and Related Workmentioning

confidence: 99%

Scalable computation of prediction intervals for neural networks via matrix sketching

Fishkov¹,

Panov²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Accounting for the uncertainty in the predictions of modern neural networks is a challenging and important task in many domains. Existing algorithms for uncertainty estimation require modifying the model architecture and training procedure (e.g., Bayesian neural networks) or dramatically increase the computational cost of predictions such as approaches based on ensembling. This work proposes a new algorithm that can be applied to a given trained neural network and produces approximate prediction intervals. The method is based on the classical delta method in statistics but achieves computational efficiency by using matrix sketching to approximate the Jacobian matrix. The resulting algorithm is competitive with state-of-the-art approaches for constructing predictive intervals on various regression datasets from the UCI repository.

show abstract

How Certain is Your Transformer?

Cited by 23 publications

References 23 publications

Transformer Uncertainty Estimation with Hierarchical Stochastic Attention

Transformer Uncertainty Estimation with Hierarchical Stochastic Attention

Proceedings of the 59th ACM/IEEE Design Automation Conference

Scalable computation of prediction intervals for neural networks via matrix sketching

Contact Info

Product

Resources

About