Unveiling the Implicit Toxicity in Large Language Models

Wen, Jiaxin; Ke, Pei; Sun, Hao; Zhang, Zhexin; Li, Chengfei; Bai, Jinfeng; Huang, Minlie

doi:10.18653/v1/2023.emnlp-main.84

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing 2023

DOI: 10.18653/v1/2023.emnlp-main.84

|View full text |Cite

Unveiling the Implicit Toxicity in Large Language Models

Jiaxin Wen,

Pei Ke,

Hao Sun

et al.

Abstract: The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use. While recent studies primarily focus on probing toxic outputs that can be easily detected with existing toxicity classifiers, we show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. Moreover, we propose a reinforcement learning (RL) based attacking method to further in… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2024

Publication Types

Select...

Preprint1

Relationship

Self Cite0

Independent1

Authors

Journals

Cited by 1 publication

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

A Systematic Review of Toxicity in Large Language Models: Definitions, Datasets, Detectors, Detoxification Methods and Challenges

Villate-Castillo,

Lorente,

Urquijo

2024

Preprint

View full text Add to dashboard Cite

The emergence of the transformer architecture has ushered in a new era of possibilities, showcasing remarkable capabilities in generative tasks exemplified by models like GPT4o, Claude 3, and Llama 3. However, these advancements come with a caveat: predominantly trained on data gleaned from social media platforms, these systems inadvertently perpetuate societal biases and toxicity. Recognizing the paramount importance of AI Safety and Alignment, our study embarks on a thorough exploration through a comprehensive literature review focused on toxic language. Delving into various definitions, detection methodologies, and mitigation strategies, we aim to shed light on the complexities of this issue. While our focus primarily centres on transformer-based architectures, we also acknowledge and incorporate existing research within the realm of deep learning. Through our investigation, we uncover a multitude of challenges inherent in toxicity mitigation and detection models. These challenges range from inherent biases and generalization issues to the necessity for standardized definitions of toxic language and the quality assurance of dataset annotations. Furthermore, we emphasize the significance of transparent annotation processes, resolution of annotation disagreements, and the enhancement of Large Language Models (LLMs) robustness. Additionally, we advocate for the creation of standardized benchmarks to gauge the effectiveness of toxicity mitigation and detection methods. Addressing these challenges is not just imperative, but pivotal in advancing the development of safer and more ethically aligned AI systems.

show abstract

A Systematic Review of Toxicity in Large Language Models: Definitions, Datasets, Detectors, Detoxification Methods and Challenges

Villate-Castillo,

Lorente,

Urquijo

2024

Preprint

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Unveiling the Implicit Toxicity in Large Language Models

Cited by 1 publication

References 26 publications

A Systematic Review of Toxicity in Large Language Models: Definitions, Datasets, Detectors, Detoxification Methods and Challenges

A Systematic Review of Toxicity in Large Language Models: Definitions, Datasets, Detectors, Detoxification Methods and Challenges

Contact Info

Product

Resources

About