Pars-OFF: A Benchmark for Offensive Language Detection on Farsi Social Media

ataei, Taha Shangipour; Darvishi, Kamyar; Javdan, Soroush; Pourdabiri, Amin; Minaei-Bidgoli, Behrouz; Pilehvar, Mohammad Taher

doi:10.1109/taffc.2022.3219229

Cited by 5 publications

(6 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Persian is one of the low-resource languages in this regard. The existing datasets for Persian hate speech detection include Pars-OFF (Ataei et al 2022), and two other non-public datasets introduced by Mozafari, Farahbakhsh, andCrespi (2022), andAlavi, Nikvand, andShamsfard (2021). Pars-OFF comprises 7,381 normal and 3,182 offensive Persian tweets, organized into a three-level hierarchy as outlined in Zampieri et al (2019).The process of collecting tweets employed a combination of similarity-based and keyword-based data selection strategies.…”

Section: Hate Speech Datasets In Other Languagesmentioning

confidence: 99%

“…The chosen approach for data selection can introduce biases or limitations to the datasets. Common approaches include searching for lists of slurs and derogatory keywords (Waseem and Hovy 2016; Kurrek, Saleem, and Ruths 2020), focusing on specific events or contexts (Grimminger and Klinger 2021), or adopting a mixture of strategies (Basile et al 2019;Ataei et al 2022;Fersini, Nozza, and Rosso 2018).…”

Section: Data Collectionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

“…To our knowledge, there are only three datasets for hate speech detection for this language (Ataei et al 2022;Mozafari, Farahbakhsh, and Crespi 2022;Alavi, Nikvand, and Shamsfard 2021). Unfortunately, among these, only Pars-OFF (Ataei et al 2022) is accessible to the public. While the dataset remains a valuable resource, it is essential to acknowledge its reliance on a weak heuristic during the data collection stage, resulting in a relatively trivial set of instances.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Spanning the Spectrum of Hatred Detection: A Persian Multi-Label Hate Speech Dataset with Annotator Rationales

Delbari,

Moosavi,

Pilehvar

2024

AAAI

View full text Add to dashboard Cite

With the alarming rise of hate speech in online communities, the demand for effective NLP models to identify instances of offensive language has reached a critical point. However, the development of such models heavily relies on the availability of annotated datasets, which are scarce, particularly for less-studied languages. To bridge this gap for the Persian language, we present a novel dataset specifically tailored to multi-label hate speech detection. Our dataset, called Phate, consists of an extensive collection of over seven thousand manually-annotated Persian tweets, offering a rich resource for training and evaluating hate speech detection models on this language. Notably, each annotation in our dataset specifies the targeted group of hate speech and includes a span of the tweet which elucidates the rationale behind the assigned label. The incorporation of these information expands the potential applications of our dataset, facilitating the detection of targeted online harm or allowing the benchmark to serve research on interpretability of hate speech detection models. The dataset, annotation guideline, and all associated codes are accessible at https://github.com/Zahra-D/Phate.

show abstract

Section: Hate Speech Datasets In Other Languagesmentioning

confidence: 99%

Section: Data Collectionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Spanning the Spectrum of Hatred Detection: A Persian Multi-Label Hate Speech Dataset with Annotator Rationales

Delbari,

Moosavi,

Pilehvar

2024

AAAI

View full text Add to dashboard Cite

show abstract

“…Given the staggering volume of at least 500 million tweets being sent daily, manual detection of such content has become an unfeasible task. Consequently, researchers have turned to leveraging NLP and learning techniques to effectively address this issue [7]- [18], [50].…”

mentioning

confidence: 99%

Pars-HaO: Hate and Offensive Language Detection on Persian Tweets Using Machine Learning and Deep Learning

Karami Sheykhlan,

Abdoljabbar,

Karimpour

2023

Preprint

View full text Add to dashboard Cite

<p>As social networks continue to gain widespread popularity, an urgent requirement arises to automatically identify and detect offensive language and hate speech. While there is a wealth of research and datasets available for English in this domain, there is currently a scarcity of research and datasets focused on identifying hate speech and offensive language in Persian text. This article introduces a 3-class dataset named Pars-HaO, consisting of 8013 tweets, to fill the gap in existing research. We collected the dataset by combining comments from pages that are more exposed to hate speech and using a keyword-based approach. Three annotators then labeled the tweets. In this study, we employed a combination of the Convolutional Neural Network (CNN) model and four widely recognized machine learning models, namely Support Vector Machine (SVM) and Logistic Regression (LR), Random Forest (RF), and Decision Tree (DT) as a baseline. Then, we compared the base models with Long Short-Term Memory(LSTM), Bidirectional LSTM (BiLSTM), and CNN models, each trained using the output of the last hidden state of Bidirectional Encoder Representations from Transformers (BERT). Experimental results on the Pars HaO dataset demonstrated that the BERT with BiLSTM technique yielded the best outcome, achieving a macro F1-score of 70%. <br> </p>

show abstract