Sociolectal Analysis of Pretrained Language Models

Zhang, Sheng; Zhang, Xin; Zhang, Weimin; Søgaard, Anders

doi:10.18653/v1/2021.emnlp-main.375

Cited by 9 publications

(9 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Additionally, many of the language models used today carry assumptions about the acoustic tones and rhythms found in “typical” conversational speech. This could negatively impact pain patients with speech impediments ( 164 ) or those whose language patterns do not match that of model creators (typically White, English-speaking men) ( 165 , 166 ) 9 , and could exclude or delegitimize the use of lyrics, rhymes, or singing, all of which can also be used to communicate pain ( 167 ).…”

Section: State-of-the-art In Pain Methodsmentioning

confidence: 99%

Assessing Pain Research: A Narrative Review of Emerging Pain Methods, Their Technosocial Implications, and Opportunities for Multidisciplinary Approaches

Berger

Baria

2022

Front. Pain Res.

View full text Add to dashboard Cite

Pain research traverses many disciplines and methodologies. Yet, despite our understanding and field-wide acceptance of the multifactorial essence of pain as a sensory perception, emotional experience, and biopsychosocial condition, pain scientists and practitioners often remain siloed within their domain expertise and associated techniques. The context in which the field finds itself today—with increasing reliance on digital technologies, an on-going pandemic, and continued disparities in pain care—requires new collaborations and different approaches to measuring pain. Here, we review the state-of-the-art in human pain research, summarizing emerging practices and cutting-edge techniques across multiple methods and technologies. For each, we outline foreseeable technosocial considerations, reflecting on implications for standards of care, pain management, research, and societal impact. Through overviewing alternative data sources and varied ways of measuring pain and by reflecting on the concerns, limitations, and challenges facing the field, we hope to create critical dialogues, inspire more collaborations, and foster new ideas for future pain research methods.

show abstract

Section: State-of-the-art In Pain Methodsmentioning

confidence: 99%

Assessing Pain Research: A Narrative Review of Emerging Pain Methods, Their Technosocial Implications, and Opportunities for Multidisciplinary Approaches

Berger

Baria

2022

Front. Pain Res.

View full text Add to dashboard Cite

show abstract

“…Due to these variations, treating "a language" as a homogeneous mass limits cultural adaptation, and runs the risk of privileging certain cultures over others. Zhang et al (2021) find that pretrained language models (PLMs; see §6) reflect certain sociolects more than others. For example, there are considerable morphosyntactic variations between Spanish spoken in Spain and Argentina (Bentivoglio and Sedano, 2011), but they are not considered separately in a Spanish PLM (Cañete et al, 2020).…”

Section: Linguistic Form and Stylementioning

confidence: 99%

“…The common methodology for training machine learning models (e.g., empirical loss minimisation) relies on maximising average performance across training examples (instead of groups, e.g., languages), which often leads to low minority performance, a phenomenon named representation disparity (Hashimoto et al, 2018). Model performance for minorities is often disregarded in favour of majority groups, as shown for race (Blodgett and O'Connor, 2017), gender (Jørgensen and Søgaard, 2021), and age (Zhang et al, 2021). Deriving fair models from biased data is a promising countermeasure (Mehrabi et al, 2021).…”

Section: Model Trainingmentioning

confidence: 99%

Challenges and Strategies in Cross-Cultural NLP

Hershcovich¹,

Frank²,

Lent³

et al. 2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

Various efforts in the Natural Language Processing (NLP) community have been made to accommodate linguistic diversity and serve speakers of many different languages. However, it is important to acknowledge that speakers and the content they produce and require, vary not just by language, but also by culture. Although language and culture are tightly linked, there are important differences. Analogous to cross-lingual and multilingual NLP, cross-cultural and multicultural NLP considers these differences in order to better serve users of NLP systems. We propose a principled framework to frame these efforts, and survey existing and potential strategies.

show abstract

“…Regions with smaller populations than others contain fewer online users and are hence underrepresented in the training data. Particularly, Zhang et al (2021) show that most of the word embeddings reflect more of the language habits of European-educated males, neglecting other subsets of the population. This constitutes a biased selection of the population (Hershcovich et al, 2022;Ma et al, 2022) and raises concerns about the non-selected groups' representation within the dataset (Hershcovich et al, 2022;Wolfe and Caliskan, 2021), which will probably cause harms in applications.…”

Section: Introductionmentioning

confidence: 98%

Regional Bias in Monolingual English Language Models

Lyu,

Dost,

Koh

et al. 2024

Preprint

View full text Add to dashboard Cite

In Natural Language Processing (NLP), pre-trained language models (LLMs) are widely employed and refined for various tasks. These models have shown considerable social and geographic biases creating skewed or even unfair representations of certain groups.Research focuses on biases toward L2 (English as a second language) regions but neglects bias within L1 (first language) regions.In this work, we ask if there is regional bias within L1 regions already inherent in pre-trained LLMs and, if so, what the consequences are in terms of downstream model performance.We contribute an investigation framework specifically tailored for low-resource regions, offering a method to identify bias without imposing strict requirements for labeled datasets. Our research reveals subtle geographic variations in the word embeddings of BERT, even in cultures traditionally perceived as similar. These nuanced features, once captured, have the potential to significantly impact downstream tasks. Generally, models exhibit comparable performance on datasets that share similarities, and conversely, performance may diverge when datasets differ in their nuanced features embedded within the language. It is crucial to note that estimating model performance solely based on standard benchmark datasets may not necessarily apply to the datasets with distinct features from the benchmark datasets. Our proposed framework plays a pivotal role in identifying and addressing biases detected in word embeddings, particularly evident in low-resource regions such as New Zealand.

show abstract

Sociolectal Analysis of Pretrained Language Models

Cited by 9 publications

References 15 publications

Assessing Pain Research: A Narrative Review of Emerging Pain Methods, Their Technosocial Implications, and Opportunities for Multidisciplinary Approaches

Assessing Pain Research: A Narrative Review of Emerging Pain Methods, Their Technosocial Implications, and Opportunities for Multidisciplinary Approaches

Challenges and Strategies in Cross-Cultural NLP

Regional Bias in Monolingual English Language Models

Contact Info

Product

Resources

About