A Methodology for Controlling Bias and Fairness in Synthetic Data Generation

Barbierato, Enrico; Vedova, Marco L. Della; Tessera, Daniele; Toti, Daniele; Vanoli, Nicola

doi:10.3390/app12094619

Cited by 10 publications

(3 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, considering the growing concerns around data privacy and fairness, it is imperative to thoroughly explore the ethical implications of synthetic data generation techniques. The development of algorithms based on machine learning techniques must take into account concepts such as data bias and fairness [44]. While the scientific literature proposes numerous techniques to detect and evaluate these problems in real datasets, less attention has been dedicated to methods generating intentionally biased datasets, which could be used by data scientists to develop and validate unbiased and fair decision-making algorithms [45].…”

Section: Discussionmentioning

confidence: 99%

Exploring Innovative Approaches to Synthetic Tabular Data Generation

Papadaki,

Vrahatis,

Kotsiantis

2024

Electronics

View full text Add to dashboard Cite

The rapid advancement of data generation techniques has spurred innovation across multiple domains. This comprehensive review delves into the realm of data generation methodologies, with a keen focus on statistical and machine learning-based approaches. Notably, novel strategies like the divide-and-conquer (DC) approach and cutting-edge models such as GANBLR have emerged to tackle a spectrum of challenges, spanning from preserving intricate data relationships to enhancing interpretability. Furthermore, the integration of generative adversarial networks (GANs) has sparked a revolution in data generation across sectors like healthcare, cybersecurity, and retail. This review meticulously examines how these techniques mitigate issues such as class imbalance, data scarcity, and privacy concerns. Through a meticulous analysis of evaluation metrics and diverse applications, it underscores the efficacy and potential of synthetic data in refining predictive models and decision-making software. Concluding with insights into prospective research trajectories and the evolving role of synthetic data in propelling machine learning and data-driven solutions across disciplines, this work provides a holistic understanding of the transformative power of contemporary data generation methodologies.

show abstract

Section: Discussionmentioning

confidence: 99%

Exploring Innovative Approaches to Synthetic Tabular Data Generation

Papadaki,

Vrahatis,

Kotsiantis

2024

Electronics

View full text Add to dashboard Cite

show abstract

“…The use of balanced synthetic datasets created by GANs to augment classification training has demonstrated the benefits for reducing disparate impact due to minoritized subgroup imbalance [112][113][114]. [115] models bias using a probabilistic network exploiting structural equation modeling as the preprocessing to generate a fairness-aware synthetic dataset. Authors in [116] leverage GAN as the pre-processing for fair data generation that ensures the generated data is discrimination free while maintaining high data utility.…”

Section: Fairnessmentioning

confidence: 99%

Machine Learning for Synthetic Data Generation: a Review

Lu¹,

Wang²,

Wei³

2023

Preprint

View full text Add to dashboard Cite

Data plays a crucial role in machine learning. However, in real-world applications, there are several problems with data, e.g., data are of low quality; a limited number of data points lead to under-fitting of the machine learning model; it is hard to access the data due to privacy, safety and regulatory concerns. Synthetic data generation offers a promising new avenue, as it can be shared and used in ways that real-world data cannot. This paper systematically reviews the existing works that leverage machine learning models for synthetic data generation. Specifically, we discuss the synthetic data generation works from several perspectives: (i) applications, including computer vision, speech, natural language, healthcare, and business; (ii) machine learning methods, particularly neural network architectures and deep generative models; (iii) privacy and fairness issue. In addition, we identify the challenges and opportunities in this emerging field and suggest future research directions.

show abstract

“…Biased synthetic data, when it contains demographic biases, can exacerbate downstream equity concerns. Careful study, design, 16 and testing are needed to determine if synthetic data are helpful in mitigating bias for each individual task and does not introduce new biases.…”

Section: Bias Category I: Data Collectionmentioning

confidence: 99%

Toward fairness in artificial intelligence for medical image analysis: identification and mitigation of potential biases in the roadmap from data collection to model deployment

et al. 2023

View full text Add to dashboard Cite

Purpose: To recognize and address various sources of bias essential for algorithmic fairness and trustworthiness and to contribute to a just and equitable deployment of AI in medical imaging, there is an increasing interest in developing medical imaging-based machine learning methods, also known as medical imaging artificial intelligence (AI), for the detection, diagnosis, prognosis, and risk assessment of disease with the goal of clinical implementation. These tools are intended to help improve traditional human decision-making in medical imaging. However, biases introduced in the steps toward clinical deployment may impede their intended function, potentially exacerbating inequities. Specifically, medical imaging AI can propagate or amplify biases introduced in the many steps from model inception to deployment, resulting in a systematic difference in the treatment of different groups.Approach: Our multi-institutional team included medical physicists, medical imaging artificial intelligence/machine learning (AI/ML) researchers, experts in AI/ML bias, statisticians, physicians, and scientists from regulatory bodies. We identified sources of bias in AI/ML, mitigation strategies for these biases, and developed recommendations for best practices in medical imaging AI/ML development.Results: Five main steps along the roadmap of medical imaging AI/ML were identified: (1) data collection, (2) data preparation and annotation, (3) model development, (4) model evaluation, and (5) model deployment. Within these steps, or bias categories, we identified 29 sources of potential bias, many of which can impact multiple steps, as well as mitigation strategies.Conclusions: Our findings provide a valuable resource to researchers, clinicians, and the public at large.

show abstract

A Methodology for Controlling Bias and Fairness in Synthetic Data Generation

Cited by 10 publications

References 17 publications

Exploring Innovative Approaches to Synthetic Tabular Data Generation

Exploring Innovative Approaches to Synthetic Tabular Data Generation

Machine Learning for Synthetic Data Generation: a Review

Toward fairness in artificial intelligence for medical image analysis: identification and mitigation of potential biases in the roadmap from data collection to model deployment

Contact Info

Product

Resources

About