2022
DOI: 10.48550/arxiv.2212.10440
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 0 publications
0
1
0
Order By: Relevance
“…Guided by the aforementioned capabilities, we propose a pragmatic third-party detection method called LLMDet. Our approach is inspired by the observation that perplexity serves as a reliable signal for distinguishing the source of generated text, a finding that has been validated in previous work (Solaiman et al, 2019;Jansen et al, 2022;Mitchell et al, 2023). However, directly calculating perplexity requires access to LLMs, which compromises both safety and efficiency.…”
Section: Introductionmentioning
confidence: 94%
“…Guided by the aforementioned capabilities, we propose a pragmatic third-party detection method called LLMDet. Our approach is inspired by the observation that perplexity serves as a reliable signal for distinguishing the source of generated text, a finding that has been validated in previous work (Solaiman et al, 2019;Jansen et al, 2022;Mitchell et al, 2023). However, directly calculating perplexity requires access to LLMs, which compromises both safety and efficiency.…”
Section: Introductionmentioning
confidence: 94%
“…We use perplexity (Jansen et al 2022) as a proxy to measure the linguistic quality of the generated CN. We use the XLMR model 11 to calculate the perplexity of generated CNs.…”
Section: Metricsmentioning
confidence: 99%
“…The aforementioned properties allow for perplexity to be used for automatically distinguishing between the high-and low-quality data [20], with one of the motives being the selection of data used to train new language models [21]. Perplexity can also be used for text classification based on language [22], the detection of harmful content [23], and fact checking [24].…”
Section: Definitionmentioning
confidence: 99%