BBQ: A hand-built bias benchmark for question answering

Parrish, Alicia; Chen, Angelica; Nangia, Nikita; Padmakumar, Vishakh; Phang, Jason; Thompson, Jana; Htut, Phu Mon; Bowman, Samuel R.

doi:10.18653/v1/2022.findings-acl.165

Cited by 23 publications

(38 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Researchers in NLP have an ethical obligation to inform (and if necessary, pressure) stakeholders about how to avoid or mitigate the negative impacts while realizing the positive ones. Most prominently, typical applied NLP models show serious biases with respect to legally protected attributes like race and gender (Bolukbasi et al, 2016;Rudinger et al, 2018;Parrish et al, 2021). We have no reliable mechanisms to mitigate these biases and no reason to believe that they will be satisfactorily resolved with larger scale.…”

Section: Present-day Impact Mitigationmentioning

confidence: 95%

The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail

Bowman¹

2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Self Cite

View full text Add to dashboard Cite

Researchers in NLP often frame and discuss research results in ways that serve to deemphasize the field's successes, often in response to the field's widespread hype. Though wellmeaning, this has yielded many misleading or false claims about the limits of our best technology. This is a problem, and it may be more serious than it looks: It harms our credibility in ways that can make it harder to mitigate present-day harms, like those involving biased systems for content moderation or resume screening. It also limits our ability to prepare for the potentially enormous impacts of more distant future advances. This paper urges researchers to be careful about these claims and suggests some research directions and communication strategies that will make it easier to avoid or rebut them.

show abstract

Section: Present-day Impact Mitigationmentioning

confidence: 95%

The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail

Bowman¹

2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Dataset We now explore additional social dimensions using BBQ (Parrish et al, 2022), which tests social biases against people from nine protected classes (age, disability status, gender identity, nationality, physical appearance, race, religion, socio-economic status, sexual orientation). BBQ examples are in sets of four multiple-choice questions.…”

Section: Broader Social Dimensions: Bbqmentioning

confidence: 99%

“…Language models producing toxic or biased content can cause severe harm especially to the groups being biased against (Bender et al, 2021). A series of benchmarks have been developed to show that LLMs can generate toxic outputs (Gehman et al, 2020), contain gender biases Zhao et al, 2018) and other categories of social biases (Nangia et al, 2020;Nadeem et al, 2021;Parrish et al, 2022), perform poorly against minority demographic groups ( Koh et al, 2021;Harris et al, 2022) or dialectical variations (Ziems et al, 2022;Tan et al, 2020). Ideally, LLMs should not exhibit biased behaviors and not discriminate against any group.…”

Section: Appendix a More Related Workmentioning

confidence: 99%

Prompting GPT-3 To Be Reliable

Si¹,

Gan²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Large language models (LLMs) show impressive abilities via few-shot prompting. Commercialized APIs such as OpenAI GPT-3 further increase their use in real-world language applications. However, existing research focus on models' accuracy on standard benchmarks and largely ignore their reliability, which is crucial for avoiding catastrophic real-world harms. While reliability is a broad and vaguely defined term, this work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality. We establish simple and effective prompts to demonstrate GPT-3's reliability in these four aspects: 1) generalize out-of-domain, 2) balance demographic distribution to reduce social biases, 3) calibrate language model probabilities, and 4) update the LLM's knowledge. We find that by employing appropriate prompts, GPT-3 outperforms smaller-scale supervised models by large margins on all these facets. We release all processed datasets, evaluation scripts, and model predictions to facilitate future analysis. 1 Our findings not only shed new insights on the reliability of prompting LLMs, but more importantly, our prompting strategies can help practitioners more reliably use large language models like GPT-3. * Work done during internship at Microsoft. 1 https://github.com/NoviScl/GPT3-Reliability 2 https://openai.com/api/ 3 By default, we use the CODE-DAVINCI-002 model in our experiments unless otherwise specified. This choice is because our preliminary results show that this is the best-performing model on most NLP datasets we have tried, and more closely represents the state-of-the-art few-shot model.

show abstract

“…For this reason, we instead turn to the more recently introduced BBQ dataset of Parrish et al (2022). We note that the BBQ dataset may still suffer from some of the concerns discussed by Blodgett et al ( 2021), but we expect it is comparatively better than the other options.…”

Section: Biasmentioning

confidence: 99%

Holistic Evaluation of Language Models

Liang¹,

Bommasani²,

Lee³

et al. 2022

Preprint

View full text Add to dashboard Cite

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios to the extent possible (87.5% of the time), ensuring that metrics beyond accuracy don't fall to the wayside, and that trade-offs across models and metrics are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to more deeply analyze specific aspects (e.g. knowledge, reasoning, memorization/copyright, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, including 21 scenarios that were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on a set of core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings concerning the interplay between different scenarios, metrics, and models. For full transparency, we release all raw model prompts and completions publicly 3 for further analysis, as well as a general modular toolkit for easily adding new scenarios, models, metrics, and prompting strategies. 4 We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.

show abstract

BBQ: A hand-built bias benchmark for question answering

Cited by 23 publications

References 29 publications

The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail

The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail

Prompting GPT-3 To Be Reliable

Holistic Evaluation of Language Models

Contact Info

Product

Resources

About