Learning to Validate the Predictions of Black Box Classifiers on Unseen Data

Schelter, Sebastian; Rukat, Tammo; Bießmann, Felix

doi:10.1145/3318464.3380604

Cited by 25 publications

(11 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We here aim for realistic modeling of these missingness patterns inspired by observations in large-scale real-world datasets as investigated in the work of Biessmann et al (2018) . We use an implementation proposed in the work of Schelter et al (2020) and Schelter et al (2021) , which selects two random percentiles of the values in a column, one for the lower and the other for the upper bound of the value range considered. In the MAR condition, we discard values if values in a random other column fall in that percentile.…”

Section: Methodsmentioning

confidence: 99%

A Benchmark for Data Imputation Methods

2021

View full text Add to dashboard Cite

With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning (ML) applications, high data quality standards are crucial to ensure robust predictive performance and responsible usage of automated decision making. One of the most frequent data quality problems is missing values. Incomplete datasets can break data pipelines and can have a devastating impact on downstream ML applications when not detected. While statisticians and, more recently, ML researchers have introduced a variety of approaches to impute missing values, comprehensive benchmarks comparing classical and modern imputation approaches under fair and realistic conditions are underrepresented. Here, we aim to fill this gap. We conduct a comprehensive suite of experiments on a large number of datasets with heterogeneous data and realistic missingness conditions, comparing both novel deep learning approaches and classical ML imputation methods when either only test or train and test data are affected by missing data. Each imputation method is evaluated regarding the imputation quality and the impact imputation has on a downstream ML task. Our results provide valuable insights into the performance of a variety of imputation methods under realistic conditions. We hope that our results help researchers and engineers to guide their data preprocessing method selection for automated data quality improvement.

show abstract

Section: Methodsmentioning

confidence: 99%

A Benchmark for Data Imputation Methods

2021

View full text Add to dashboard Cite

show abstract

“…More recently, Gopakumar et al (2018) suggested to search for the worst case model performance using limited labeled data, however, we posit that using worst case to assess the goodness of a model-under-test is an overkill because the worst case is often just an outlier. The work in Schelter et al (2020) learns to validate the model without labeled data by generating a synthetic dataset representative of the deployment data. The restrictive assumption is that it requires domain experts to provide a set of data generators, a task usually infeasible in reality.…”

Section: A Related Workmentioning

confidence: 99%

ALT-MAS: A Data-Efficient Framework for Active Testing of Machine Learning Algorithms

Ha,

Gupta,

Rana

et al. 2021

Preprint

View full text Add to dashboard Cite

Machine learning models are being used extensively in many important areas, but there is no guarantee a model will always perform well or as its developers intended. Understanding the correctness of a model is crucial to prevent potential failures that may have significant detrimental impact in critical application areas. In this paper, we propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data. The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN). We develop a novel data augmentation method helping to train the BNN to achieve high accuracy. We also devise a theoretic information based sampling strategy to sample data points so as to achieve accurate estimations for the metrics of interest. Finally, we conduct an extensive set of experiments to test various machine learning models for different types of metrics. Our experiments show that the metrics estimations by our method are significantly better than existing baselines.

show abstract

“…Further, it does not differentiate between different types of uncertainty. Schelter et al [32] proposed a model-agnostic validation approach to detect data-related errors at serving time. However, this work focuses on errors arising from dataprocessing issues, such as missing values or incorrectly entered values, and relies on programmatic specification of typical data errors.…”

Section: Related Workmentioning

confidence: 99%

Detecting and Mitigating Test-time Failure Risks via Model-agnostic Uncertainty Learning

Lahoti¹,

Gummadi²,

Weikum³

2021

Preprint

View full text Add to dashboard Cite

Reliably predicting potential failure risks of machine learning (ML) systems when deployed with production data is a crucial aspect of trustworthy AI. This paper introduces Risk Advisor, a novel post-hoc meta-learner for estimating failure risks and predictive uncertainties of any already-trained black-box classification model. In addition to providing a risk score, the Risk Advisor decomposes the uncertainty estimates into aleatoric and epistemic uncertainty components, thus giving informative insights into the sources of uncertainty inducing the failures. Consequently, Risk Advisor can distinguish between failures caused by data variability, data shifts and model limitations and advise on mitigation actions (e.g., collecting more data to counter data shift). Extensive experiments on various families of black-box classification models and on real-world and synthetic datasets covering common ML failure scenarios show that the Risk Advisor reliably predicts deployment-time failure risks in all the scenarios, and outperforms strong baselines.

show abstract

Learning to Validate the Predictions of Black Box Classifiers on Unseen Data

Cited by 25 publications

References 13 publications

A Benchmark for Data Imputation Methods

A Benchmark for Data Imputation Methods

ALT-MAS: A Data-Efficient Framework for Active Testing of Machine Learning Algorithms

Detecting and Mitigating Test-time Failure Risks via Model-agnostic Uncertainty Learning

Contact Info

Product

Resources

About