Predicting with Confidence on Unseen Distributions

Guillory, Devin; Shankar, Venkatesh; Ebrahimi, Sayna; Darrell, Trevor; Schmidt, Ludwig

doi:10.1109/iccv48922.2021.00117

Cited by 48 publications

(42 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…( 1), when the model is well-calibrated, the average of the calibrated MSP [Hendrycks and Gimpel, 2016] TS [Guo et al, 2017] MD-TS MSP [Hendrycks and Gimpel, 2016] TS [Guo et al, 2017] confidence is close to the average accuracy, i.e., Conf(D) ≈ Acc(D). 2 Meanwhile, predicting model performance accurately is an essential ingredient in developing reliable machine learning systems, especially under distributional shifts [Guillory et al, 2021]. As shown in Table 1, we find that our proposed method produces well-calibrated confidence values on both InD and OOD domains.…”

Section: Predicting Generalizationmentioning

confidence: 77%

Robust Calibration with Multi-domain Temperature Scaling

Yu¹,

Bates²,

Ma³

et al. 2022

Preprint

View full text Add to dashboard Cite

Uncertainty quantification is essential for the reliable deployment of machine learning models to high-stakes application domains. Uncertainty quantification is all the more challenging when training distribution and test distribution are different, even the distribution shifts are mild. Despite the ubiquity of distribution shifts in real-world applications, existing uncertainty quantification approaches mainly study the in-distribution setting where the train and test distributions are the same. In this paper, we develop a systematic calibration model to handle distribution shifts by leveraging data from multiple domains. Our proposed method-multidomain temperature scaling-uses the heterogeneity in the domains to improve calibration robustness under distribution shift. Through experiments on three benchmark data sets, we find our proposed method outperforms existing methods as measured on both in-distribution and out-of-distribution test sets.

show abstract

Section: Predicting Generalizationmentioning

confidence: 77%

Robust Calibration with Multi-domain Temperature Scaling

Yu¹,

Bates²,

Ma³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…A separate line of work departs from complexity measures altogether and directly predicts OOD generalization from unlabelled test data. These methods either predict the correctness of the model directly on individual examples [14,32,15] or directly estimate the total error [19,24,9,10,68]. Although these methods work well in practice, they do not provide any insight into the underlying mechanism of generalization since they act only on the output layer of the network.…”

Section: Related Workmentioning

confidence: 99%

Predicting Out-of-Domain Generalization with Local Manifold Smoothness

Ng¹,

Cho²,

Hulkund³

et al. 2022

Preprint

View full text Add to dashboard Cite

Understanding how machine learning models generalize to new environments is a critical part of their safe deployment. Recent work has proposed a variety of complexity measures that directly predict or theoretically bound the generalization capacity of a model. However, these methods rely on a strong set of assumptions that in practice are not always satisfied. Motivated by the limited settings in which existing measures can be applied, we propose a novel complexity measure based on the local manifold smoothness of a classifier. We define local manifold smoothness as a classifier's output sensitivity to perturbations in the manifold neighborhood around a given test point. Intuitively, a classifier that is less sensitive to these perturbations should generalize better. To estimate smoothness we sample points using data augmentation and measure the fraction of these points classified into the majority class. Our method only requires selecting a data augmentation method and makes no other assumptions about the model or data distributions, meaning it can be applied even in out-of-domain (OOD) settings where existing methods cannot. In experiments on robustness benchmarks in image classification, sentiment analysis, and natural language inference, we demonstrate a strong and robust correlation between our manifold smoothness measure and actual OOD generalization on over 3,000 models evaluated on over 100 train/test domain pairs. Preprint. Under review.

show abstract

“…Unsupervised model performance evaluation: Model performance evaluation without labels has received relatively limited attention. Domain-specific models' performance can be estimated via certain statistics, such as confidence score [17], rotation prediction [10] and feature statistics of the datasets sampled from a meta-dataset [11] for image recognition. General model evaluation often relies on different assumptions and accessibility [6,7,8,9,13,18,20,38].…”

Section: Introductionmentioning

confidence: 99%

“…Domain-specific models' performance can be estimated via certain statistics, such as confidence score [17], rotation prediction [10] and feature statistics of the datasets sampled from a meta-dataset [11] for image recognition. General model evaluation often relies on different assumptions and accessibility [6,7,8,9,13,18,20,38]. For example, [8] assumes covariate shift and requires users to provide an approximation (slice) of the shifted features, while [6] needs white-box access to the ML models to train an ensemble as a reference.…”

Section: Introductionmentioning

confidence: 99%

Estimating and Explaining Model Performance When Both Covariates and Labels Shift

Chen¹,

Zaharia²,

Zou³

2022

Preprint

View full text Add to dashboard Cite

Deployed machine learning (ML) models often encounter new user data that differs from their training data. Therefore, estimating how well a given model might perform on the new data is an important step toward reliable ML applications. This is very challenging, however, as the data distribution can change in flexible ways, and we may not have any labels on the new data, which is often the case in monitoring settings. In this paper, we propose a new distribution shift model, Sparse Joint Shift (SJS), which considers the joint shift of both labels and a few features. This unifies and generalizes several existing shift models including label shift and sparse covariate shift, where only marginal feature or label distribution shifts are considered. We describe mathematical conditions under which SJS is identifiable. We further propose SEES, an algorithmic framework to characterize the distribution shift under SJS and to estimate a model's performance on new data without any labels. We conduct extensive experiments on several real-world datasets with various ML models. Across different datasets and distribution shifts, SEES achieves significant (up to an order of magnitude) shift estimation error improvements over existing approaches.Preprint. Under review.

show abstract

Predicting with Confidence on Unseen Distributions

Cited by 48 publications

References 15 publications

Robust Calibration with Multi-domain Temperature Scaling

Robust Calibration with Multi-domain Temperature Scaling

Predicting Out-of-Domain Generalization with Local Manifold Smoothness

Estimating and Explaining Model Performance When Both Covariates and Labels Shift

Contact Info

Product

Resources

About