Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing 2023
DOI: 10.18653/v1/2023.emnlp-main.84
|View full text |Cite
|
Sign up to set email alerts
|

Unveiling the Implicit Toxicity in Large Language Models

Jiaxin Wen,
Pei Ke,
Hao Sun
et al.

Abstract: The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use. While recent studies primarily focus on probing toxic outputs that can be easily detected with existing toxicity classifiers, we show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. Moreover, we propose a reinforcement learning (RL) based attacking method to further in… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
references
References 26 publications
0
0
0
Order By: Relevance