Representation Bias in Data: A Survey on Identification and Resolution Techniques

Shahbazi, Nima; Yin, Lin; Asudeh, Abolfazl; Jagadish, H. V.

doi:10.1145/3588433

Cited by 21 publications

(43 citation statements)

References 78 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Training data bias Biases in training data are reflected in downstream models. Under-represented subgroups can suffer lower accuracy due to insufficient weight in the training data (Buolamwini & Gebru, 2018; Chen et al, 2018; Kleinberg et al, 2022; Shahbazi et al, 2023), and socially undesirable biases in data are often amplified by models (Bolukbasi et al, 2016; Caliskan et al, 2017; Taori & Hashimoto, 2023). Various papers have studied how re-weighting or curating datasets can mitigate these biases (Zhao et al, 2017; Ryu et al, 2017; Tschandl et al, 2018; Yang et al, 2020), even finding that overall performance is improved by over-weighting minority groups and actively increasing diversity in datasets (Gao et al, 2020; Rolf et al, 2021; Lee et al, 2022).…”

Section: Related Workmentioning

confidence: 99%

Protein language models are biased by unequal sequence sampling across the tree of life

Ding,

Steinhardt

2024

Preprint

View full text Add to dashboard Cite

Protein language models (pLMs) trained on large protein sequence databases have been used to understand disease and design novel proteins. In design tasks, the likelihood of a protein sequence under a pLM is often used as a proxy for protein fitness, so it is critical to understand what signals likelihoods capture. In this work we find that pLM likelihoods unintentionally encode a species bias: likelihoods of protein sequences from certain species are systematically higher, independent of the protein in question. We quantify this bias and show that it arises in large part because of unequal species representation in popular protein sequence databases. We further show that the bias can be detrimental for some protein design applications, such as enhancing thermostability. These results highlight the importance of understanding and curating pLM training data to mitigate biases and improve protein design capabilities in under-explored parts of sequence space.

show abstract

Section: Related Workmentioning

confidence: 99%

Protein language models are biased by unequal sequence sampling across the tree of life

Ding,

Steinhardt

2024

Preprint

View full text Add to dashboard Cite

show abstract

“…The underrepresentation of racial and ethnic minoritized groups in research can perpetuate representation bias in data collection, discrimination, and disparities. 25,26 Broadly, racial bias can be described as preconceptions, unconscious ideas, or experiences that make people think and act in a prejudiced manner. 27 Bias in data indicates errors that arise when certain elements of a database get more attention or overrepresented.…”

Section: Racial Bias In Survey Researchmentioning

confidence: 99%

“…Health care organizations using such data to inform protocols, program screenings, models, or algorithms risk having inherent bias in generated results. 26 For example, incomplete risk scores used to inform resource allocation (ie, before/during/after disasters) could perpetuate racial disparities rather than eliminate them. 24 Prioritizing the perspectives and contributions of minoritized groups who have been disproportionately harmed by disasters can help address representation bias in the data collection process and facilitate more equitable knowledge construction and survey tools.…”

Section: Racial Bias In Survey Researchmentioning

confidence: 99%

See 1 more Smart Citation

Cognitive Interview Validation of a Novel Household Hazard Vulnerability Assessment Instrument

Amberson,

Ndayishimiye,

Cloud

et al. 2023

West J Nurs Res

View full text Add to dashboard Cite

Background: Weather and climate disasters are responsible for over 13 000 US deaths, worsened morbidity, and $1.7 trillion in additional costs over the last 40 years with profound racial disparities. Objectives: This project empirically generated items for a novel survey instrument of household hazard vulnerability with initial construct validation while addressing racial bias in the data collection process. Methods: Cognitive interviews facilitated understanding regarding the performance of drafted survey questions with transdisciplinary expert panelists from diverse US regions on unique hazard/disaster/event items. To prevent representation bias in data collection, those with Black and/or African American racial, biracial, or multiracial identities were over-sampled. Interview video recordings were qualitatively analyzed using thematic and pattern coding. Results: A cognitive process mapped to themes of disaster characteristics, resources, individual life facets, and felt effects was revealed. We identified 379 unique instances of linked terms as synonyms, co-occurring, compounding, or cascading events. Potential for racial bias in data collection was elucidated. Analysis of radiation exposure, trauma, and criminal acts of intent items revealed participants may not interpret survey items with these terms as intended. Conclusion: Potential for racial bias exists relative to water dam failure, evacuation, external flood, suspicious packages/substances, and transportation failure. Hazard terms that were not interpreted as intended require further revision in the validation process of individual or household disaster vulnerability assessments. Several commonalities in the cognitive process and mapping of disaster terms may be utilized in disaster and climate change research aimed at the individual and household unit of analysis.

show abstract

“…The impact of Artificial Intelligence (AI) has been significant across nearly every application domain. However, the quality of the AI models largely depends on the quality of the datasets used to train them [1][2][3][4][5]. Moreover, several past incidents highlight the devastating consequences of using biased and erroneous datasets for training AI models, such as discriminatory treatment of users based on demographic characteristics like gender, age, race, and religion by AI systems [1, [6][7][8][9][10].…”

mentioning

confidence: 99%

Representation Debiasing of Generated Data Involving Domain Experts

Bhattacharya,

Stumpf,

Verbert

2024

Preprint

View full text Add to dashboard Cite

Biases in Artificial Intelligence (AI) or Machine Learning (ML) systems due to skewed datasets problematise the application of prediction models in practice. Representation bias is a prevalent form of bias found in the majority of datasets. This bias arises when training data inadequately represents certain segments of the data space, resulting in poor generalisation of prediction models. Despite AI practitioners employing various methods to mitigate representation bias, their effectiveness is often limited due to a lack of thorough domain knowledge. To address this limitation, this paper introduces human-in-the-loop interaction approaches for representation debiasing of generated data involving domain experts. Our work advocates for a controlled data generation process involving domain experts to effectively mitigate the effects of representation bias. We argue that domain experts can leverage their expertise to assess how representation bias affects prediction models. Moreover, our interaction approaches can facilitate domain experts in steering data augmentation algorithms to produce debiased augmented data and validate or refine the generated samples to reduce representation bias. We also discuss how these approaches can be leveraged for designing and developing user-centred AI systems to mitigate the impact of representation bias through effective collaboration between domain experts and AI.

show abstract

Representation Bias in Data: A Survey on Identification and Resolution Techniques

Cited by 21 publications

References 78 publications

Protein language models are biased by unequal sequence sampling across the tree of life

Protein language models are biased by unequal sequence sampling across the tree of life

Cognitive Interview Validation of a Novel Household Hazard Vulnerability Assessment Instrument

Representation Debiasing of Generated Data Involving Domain Experts

Contact Info

Product

Resources

About