Few-shot Instruction Prompts for Pretrained Language Models to Detect Social Biases

Prabhumoye, Shrimai; Kocielnik, Rafał; Shoeybi, Mohammad; Anandkumar, Anima; Catanzaro, Bryan

doi:10.48550/arxiv.2112.07868

Cited by 3 publications

(5 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Additionally, these models also possess known (pre-deployment) safety issues for which we lack robust solutions [33] (e.g, How do you ensure the system does not generate inappropriate and harmful outputs, such as making overtly sexist or racist comments [65]? How do you identify bias issues in the system prior to deployment [8,53]? How do you ensure that when the model outputs a claim, it isn't making up facts [10]?, etc.…”

Section: Safetymentioning

confidence: 99%

Predictability and Surprise in Large Generative Models

Ganguli¹,

Hernandez²,

Lovitt³

et al. 2022

Preprint

View full text Add to dashboard Cite

Large-scale pre-training has recently emerged as a technique for creating capable, generalpurpose, generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many others. In this paper, we highlight a counterintuitive property of such models and discuss the policy implications of this property. Namely, these generative models have an unusual combination of predictable loss on a broad training distribution (as embodied in their "scaling laws"), and unpredictable specific capabilities, inputs, and outputs. We believe that the high-level predictability and appearance of useful capabilities drives rapid development of such models, while the unpredictable qualities make it difficult to anticipate the consequences of model deployment. We go through examples of how this combination can lead to socially harmful behavior with examples from the literature and real world observations, and we also perform two novel experiments to illustrate our point about harms from unpredictability. Furthermore, we analyze how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment. We conclude with a list of possible interventions the AI community may take to increase the chance of these models having a beneficial impact. We intend this paper to be useful to policymakers who want to understand and regulate AI systems, technologists who care about the potential policy impact of their work, and academics who want to analyze, critique, and potentially develop large generative models.

show abstract

Section: Safetymentioning

confidence: 99%

Predictability and Surprise in Large Generative Models

Ganguli¹,

Hernandez²,

Lovitt³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…To draw a connection to computational ethics, Schick et al (2021) prompts GPT-2 and T5 for automated bias detection. (Prabhumoye et al, 2021) extends this line of research using more structured prompts and performed few-shot experiments across different classes of LLMs with varying sizes. In this work, we extend the zero-shot toxicity detection approach explore in previous work (Schick et al, 2021) to its generative variant and demonstrate its greater competence.…”

Section: Related Workmentioning

confidence: 88%

“…Automatic toxicity detection facilitates online moderation, which is an important venue for NLP research to positively impact the society. Recent works (Prabhumoye et al, 2021;Schick et al, 2021) demonstrate that large-scale pre-trained language models are able to detect toxic contents without fine-tuning. Prompts can be carefully designed to harness the implicit knowledge about harmful text learned by MLM pre-training.…”

Section: Discussionmentioning

confidence: 99%

“…A harmful text can be harmless in terms of not containing profanity or slur words, but tends to spread hate and stereotypes by influencing people's judgments about others. This makes it hard both for data collection, since no explicit rules exist for crawling social media posts exhibiting implicit toxicity, and for modelling, because implicit toxicity is often beyond the reach of keyword-based methods (Prabhumoye et al, 2021). Second, mentions of minority groups often co-occur with toxicity labels scraped from online platform.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Toxicity Detection with Generative Prompt-based Inference

Wang¹,

Chang²

2022

Preprint

View full text Add to dashboard Cite

Warning: this paper contains content that may be upsetting or offensive. Due to the subtleness, implicity, and different possible interpretations perceived by different people, detecting undesirable content from text is a nuanced difficulty. It is a longknown risk that language models (LMs), once trained on corpus containing undesirable content, have the power to manifest biases and toxicity. However, recent studies imply that, as a remedy, LMs are also capable of identifying toxic content without additional fine-tuning. Prompt-methods have been shown to effectively harvest this surprising self-diagnosing capability. However, existing prompt-based methods usually specify an instruction to a language model in a discriminative way. In this work, we explore the generative variant of zero-shot prompt-based toxicity detection with comprehensive trials on prompt engineering. We evaluate on three datasets with toxicity labels annotated in social media posts. Our analysis highlights the strengths of our generative classification approach both quantitatively and qualitatively. Interesting aspects of selfdiagnosis and its ethical implications are discussed.

show abstract

“…We follow the setting ofPrabhumoye et al (2022), which found that leveraging semantically close examples is effective for in-context learning. We use the vector space of all problem statements computed by Term Frequency-Inverse Document Frequency (TF-IDF) implemented in scikit-learn: https://tinyurl.com/scikitlearn-TF-IDF-vectorizer.…”

mentioning

confidence: 99%

OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples

Koike,

Kaneko,

Okazaki

2024

AAAI

View full text Add to dashboard Cite

Large Language Models (LLMs) have achieved human-level fluency in text generation, making it difficult to distinguish between human-written and LLM-generated texts. This poses a growing risk of misuse of LLMs and demands the development of detectors to identify LLM-generated texts. However, existing detectors lack robustness against attacks: they degrade detection accuracy by simply paraphrasing LLM-generated texts. Furthermore, a malicious user might attempt to deliberately evade the detectors based on detection results, but this has not been assumed in previous studies. In this paper, we propose OUTFOX, a framework that improves the robustness of LLM-generated-text detectors by allowing both the detector and the attacker to consider each other's output. In this framework, the attacker uses the detector's prediction labels as examples for in-context learning and adversarially generates essays that are harder to detect, while the detector uses the adversarially generated essays as examples for in-context learning to learn to detect essays from a strong attacker. Experiments in the domain of student essays show that the proposed detector improves the detection performance on the attacker-generated texts by up to +41.3 points F1-score. Furthermore, the proposed detector shows a state-of-the-art detection performance: up to 96.9 points F1-score, beating existing detectors on non-attacked texts. Finally, the proposed attacker drastically degrades the performance of detectors by up to -57.0 points F1-score, massively outperforming the baseline paraphrasing method for evading detection.

show abstract

Few-shot Instruction Prompts for Pretrained Language Models to Detect Social Biases

Cited by 3 publications

References 37 publications

Predictability and Surprise in Large Generative Models

Predictability and Surprise in Large Generative Models

Toxicity Detection with Generative Prompt-based Inference

OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples

Contact Info

Product

Resources

About