2022
DOI: 10.48550/arxiv.2202.01327
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Adaptive Sampling Strategies to Construct Equitable Training Datasets

Abstract: In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities, often performing worse for members of traditionally underserved groups. One factor contributing to these performance gaps is a lack of representation in the data the models are trained on. It is often unclear, however, how to operationalize representativeness in specific applications. Here we formalize the problem of creating equitable training datasets, and propose a st… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(5 citation statements)
references
References 19 publications
0
5
0
Order By: Relevance
“…Inspired by literature on active learning for fairness outcomes, [18,19], we also consider adaptively constructing the training dataset by increasing the sampling weights for the points from the demographic categories with the lowest accuracy (accuracy-weighted), highest mean-squared-error (MSE-weighted), or highest exclusion/inclusion error rates (disparity-weighted). In these group-based strategies, each member of the demographic category is assigned the same weight, but each group's weight differs by group-specific model performance in each round of simulated data collection.…”
Section: Methodsmentioning
confidence: 99%
“…Inspired by literature on active learning for fairness outcomes, [18,19], we also consider adaptively constructing the training dataset by increasing the sampling weights for the points from the demographic categories with the lowest accuracy (accuracy-weighted), highest mean-squared-error (MSE-weighted), or highest exclusion/inclusion error rates (disparity-weighted). In these group-based strategies, each member of the demographic category is assigned the same weight, but each group's weight differs by group-specific model performance in each round of simulated data collection.…”
Section: Methodsmentioning
confidence: 99%
“…Specifically, there are often artifacts in the data, such as correlations between attributes like race or gender and the target variable, that we do not want our predictive model to learn. One way to mitigate this is to try to balance the data set to remove these correlations within the data by adding new data points or editing existing ones (data curation) [44,46,137].…”
Section: Mitigationsmentioning
confidence: 99%
“…However, the obtained labels are still assumed to be the "gold standard". Similarly, other very recent works also incorporate the fairness notion in active learning strategy design (Abernethy et al 2020;Sharaf and Daumé III 2020;Cai et al 2022). All these new approaches inherit from classical active learning the assumption that the acquired label is a perfect match with the label of interest, but this assumption does not hold in a wide range of practical scenarios.…”
Section: Related Workmentioning
confidence: 99%