Mitigating Language-Dependent Ethnic Bias in BERT

Ahn, Jaimeen; Oh, Alice

doi:10.18653/v1/2021.emnlp-main.42

Cited by 36 publications

(36 citation statements)

References 26 publications

(46 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is important to note the difference between our work and other work that has been carried out on ethnic bias in NLP models, e.g., Ahn and Oh (2021) and Nadeem et al (2021). The concern of these studies is stereotypes that are expressed about members of ethnic minorities.…”

Section: Discussionmentioning

confidence: 80%

Doing not Being: Concrete Language as a Bridge from Language Technology to Ethnically Inclusive Job Ads

Adams¹,

Poelmans²,

Hendrickx³

et al. 2022

Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

View full text Add to dashboard Cite

This paper makes the case for studying concreteness in language as a bridge that will allow language technology to support the understanding and improvement of ethnic inclusivity in job advertisements. We propose an annotation scheme that guides the assignment of sentences in job ads to classes that reflect concrete actions, i.e., what the employer needs people to do, and abstract dispositions, i.e., who the employer expects people to be. Using an annotated dataset of Dutch-language job ads, we demonstrate that machine learning technology is effectively able to distinguish these classes.

show abstract

Section: Discussionmentioning

confidence: 80%

Doing not Being: Concrete Language as a Bridge from Language Technology to Ethnically Inclusive Job Ads

Adams¹,

Poelmans²,

Hendrickx³

et al. 2022

Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

View full text Add to dashboard Cite

show abstract

“…In prior work on MLMs, social biases for languages other than English have rarely been investigated. Ahn and Oh (2021) investigated ethnic bias in monolingual MLM in six languages by extending the templates to other languages using machine translation. The biases of MLMs have been evaluated using templates for English and Chinese (Liang et al, 2020) and for English and German (Bartl et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Gender Bias in Masked Language Models for Multiple Languages

Kaneko¹,

Imankulova²,

Bollegala³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Masked Language Models (MLMs) pre-trained by predicting masked tokens on large corpora have been used successfully in natural language processing tasks for a variety of languages. Unfortunately, it was reported that MLMs also learn discriminative biases regarding attributes such as gender and race. Because most studies have focused on MLMs in English, the bias of MLMs in other languages has rarely been investigated. Manual annotation of evaluation data for languages other than English has been challenging due to the cost and difficulty in recruiting annotators. Moreover, the existing bias evaluation methods require the stereotypical sentence pairs consisting of the same context with attribute words (e.g. He/She is a nurse). We propose Multilingual Bias Evaluation (MBE) score, to evaluate bias in various languages using only English attribute word lists and parallel corpora between the target language and English without requiring manually annotated data. We evaluated MLMs in eight languages using the MBE and confirmed that gender-related biases are encoded in MLMs for all those languages. We manually created datasets for gender bias in Japanese and Russian to evaluate the validity of the MBE. The results show that the bias scores reported by the MBE significantly correlates with that computed from the above manually created datasets and the existing English datasets for gender bias. * Danushka Bollegala holds concurrent appointments as a Professor at University of Liverpool and as an Amazon Scholar. This paper describes work performed at the University of Liverpool and is not associated with Amazon.

show abstract

“…Similar to any AI model, existing inequalities in big models may compound historical discrimination [1103], by producing unfair results, information cocoon, and disproportionately negative consequences to minorities [1104,1105,1106]. Since big models may affect downstream applications, understanding how biases produce in big models and their harms has attracted attention recently [1107,20,1108,1109,1110,1111,1112,1113].…”

Section: Fairnessmentioning

confidence: 99%

“…However, even if the social bias is eliminated at the word level, the sentence-level bias can still exist due to the imbalanced combination of words. Recently, there have been several studies on how to measure sentence-level bias [1136,1137,1109]. Moreover, Xu et al [1111] showed that detoxification techniques, which are useful in language models, may hurt equity.…”

Section: Fairnessmentioning

confidence: 99%

A Roadmap for Big Model

Yuan¹,

Zhao²,

Jiahong³

et al. 2022

Preprint

View full text Add to dashboard Cite

domains indexed by Google News. It contains 31 million documents with an average length of 793 BPE tokens. Like C4, it excludes examples with duplicate URLs. News dumps from December 2016 through March 2019 were used as training data, articles published in April 2019 from the April 2019 dump were used for evaluation. OpenWebText2(OWT2). OWT2 is an enhanced version of the original OpenWebTextCorpus, including content from multiple languages, document metadata, multiple dataset versions, and open source replication code, covering all Reddit submissions from 2005 up until April 2020. PubMed Central(PMC). PMC is a free full-text archive of biomedical and life sciences journal literature from the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). The dataset is updated daily. In addition to full-text articles, they contain corrections, retractions, and expressions of concern, as well as file lists that include metadata for articles in each dataset.PMC obtained by open registration in Amazon Web Services (AWS) includes The PMC Open Access Subset and The Author Manuscript Dataset. The PMC Open Access Subset includes all articles and preprints in PMC with a machine-readable Creative Commons license that allows reuse. The Author Manuscript Dataset includes accepted author manuscripts collected under a funder policy in PMC and made available in machine-readable formats for text mining. ArXiv. ArXiv is a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more. It provides open access to academic articles, covering many subdisciplines from vast branches of physics to computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics, which is helpful to the potential downstream applications of the research field. In addition, the writing language of LaTeX also contributes to the study of language models. Colossal Clean Crawled Corpus(C4). C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It is based on Common Crawl dataset and was used to train the T5 text-to-text Transformer models. The cleaned English version of C4 has 364,868,901 training examples and 364,608 validation examples, while the uncleaned English version has 1,063,805,324 training examples and 1,065,029 validation examples; the realnewslike version has 13,799,838 training examples and 13,863 validation examples, while the webtextlike version has 4,500,788 training examples and 4,493 validation examples. Wiki-40B. Wikipedia (Wiki-40B) is a clean-up text collection containing more than 40 Wikipedia language editions of pages corresponding to entities. The dataset is split into train/validation/test sets for each language. The training set has 2,926,536 examples, the validation set has 163,597 examples, and the test set has 162,274 examples. Wiki-40B is cleaned by a page filter to remove ambiguous, redirected, deleted, and non-physical pages. CLUECorpus2020. CLUECorpus2020 ...

show abstract

Mitigating Language-Dependent Ethnic Bias in BERT

Cited by 36 publications

References 26 publications

Doing not Being: Concrete Language as a Bridge from Language Technology to Ethnically Inclusive Job Ads

Doing not Being: Concrete Language as a Bridge from Language Technology to Ethnically Inclusive Job Ads

Gender Bias in Masked Language Models for Multiple Languages

A Roadmap for Big Model

Contact Info

Product

Resources

About