The design, construction and evaluation of annotated Arabic cyberbullying corpus

Shannag, Fatima; Hammo, Bassam; Faris, Hossam

doi:10.1007/s10639-022-11056-x

Cited by 10 publications

(11 citation statements)

References 60 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…as (Abainia, 2020;Alsafari et al, 2020;Boucherit & Abainia, 2022;Mubarak et al, 2021;Shannag et al, 2022). Furthermore, some datasets were annotated by their respective authors (Alshehri et al, 2020;Badri et al, 2022;Khairy et al, 2022), while for others, no information was provided regarding the criteria used for selecting the annotators (Alam et al, 2022;De Smedt et al, 2018;Mohdeb et al, 2022;Obeidat et al, 2022;Raïdy & Harmanani, 2023).…”

Section: Notementioning

confidence: 99%

“…In some works, the crowdworkers are evaluated, without notifying them, by incorporating texts from the pre-annotated sample into each crowdworker task. Less accurate crowdworkers are disqualified based on comparison with the expert labels (Albadi et al, 2018(Albadi et al, , 2022Alhelbawy et al, 2016;Chowdhury et al, 2020;Mubarak, Hassan, & Chowdhury, 2022;Shannag et al, 2022). Another approach is selecting crowdworkers with good reputation scores, which are provided on the crowdsourcing platform (Ousidhoum et al, 2019).…”

Section: Quality and Validationmentioning

confidence: 99%

“…Conversely, Shannag et al (2022), excluded all the labels from the annotator who had the lowest inter-annotation agreement with the rest of the annotators, aiming to ensure a unanimous annotation. Some studies chose to delete the cases where one or more annotators failed to select a label (Alakrot et al, 2018;Albadi et al, 2018;Shannag et al, 2022). Others excluded the cases with different labels (Haddad et al, 2019;Mulki et al, 2019;Mulki & Ghanem, 2021b;Raïdy & Harmanani, 2023).…”

Section: Quality and Validationmentioning

confidence: 99%

“…Only 11 out of the analysed works published their annotation guideline, either as a separate document alongside the dataset, in the annotation platform, or on the project website (see Table 11). Besides, while the complete guideline given to the annotators was not shared, 7 works (Al Bayari & Abdallah, 2022;Albadi et al, 2018Albadi et al, , 2019Albadi et al, , 2022Alsafari et al, 2020;Mubarak, Rashed, et al, 2020;Shannag et al, 2022) did provide in the dataset-related paper, an extensive description of the annotation process.…”

Section: Guideline Availabilitymentioning

confidence: 99%

See 3 more Smart Citations

Toxic language detection: A systematic review of Arabic datasets

Bensalem,

Rosso,

Zitouni

2024

Expert Systems

View full text Add to dashboard Cite

The detection of toxic language in the Arabic language has emerged as an active area of research in recent years, and reviewing the existing datasets employed for training the developed solutions has become a pressing need. This paper offers a comprehensive survey of Arabic datasets focused on online toxic language. We systematically gathered a total of 54 available datasets and their corresponding papers and conducted a thorough analysis, considering 18 criteria across four primary dimensions: availability details, content, annotation process, and reusability. This analysis enabled us to identify existing gaps and make recommendations for future research works. For the convenience of the research community, the list of the analysed datasets is maintained in a GitHub repository.

show abstract

Section: Notementioning

confidence: 99%

Section: Quality and Validationmentioning

confidence: 99%

Section: Quality and Validationmentioning

confidence: 99%

Section: Guideline Availabilitymentioning

confidence: 99%

See 2 more Smart Citations

Toxic language detection: A systematic review of Arabic datasets

Bensalem,

Rosso,

Zitouni

2024

Expert Systems

View full text Add to dashboard Cite

show abstract

“…Arabic Cyberbullyig Corpus (ArCybC) [52] is the first publicly available cyberbullying dataset for the Arabic language. Researchers can use it to classify tweets annotated as Cyberbullying (CB), Non-Cyberbullying (Non-CB), Offensive (Off), and Non-Offensive (Non-Off).…”

Section: Dataset Preparationmentioning

confidence: 99%

Offensive Language Detection in Arabic Social Networks Using Evolutionary-Based Classifiers Learned From Fine-Tuned Embeddings

et al. 2022

Self Cite

View full text Add to dashboard Cite

Social networks facilitate communication between people from all over the world. Unfortunately, the excessive use of social networks leads to the rise of antisocial behaviors such as the spread of online offensive language, cyberbullying (CB), and hate speech (HS). Therefore, abusive\offensive and hate detection become a crucial part of cyberharassment. Manual detection of cyberharassment is cumbersome, slow, and not even feasible in rapidly growing data. In this study, we addressed the challenges of automatic detection of the offensive tweets in the Arabic language. The main contribution of this study is to design and implement an intelligent prediction system encompassing a two-stage optimization approach to identify and classify the offensive from the non-offensive text. In the first stage, the proposed approach fine-tuned the pre-trained word embedding models by training them for several epochs on the training dataset. The embeddings of the vocabularies in the new dataset are trained and added to the old embeddings. While in the second stage, it employed a hybrid approach of two classifiers, namely XGBoost and SVM, and a genetic algorithm (GA) to mitigate the drawback of the classifiers in finding the optimal hyperparameter values to run the proposed approach. We tested the proposed approach on Arabic Cyberbullying Corpus (ArCybC), which contains tweets collected from four Twitter domains: gaming, sports, news, and celebrities. The ArCybC dataset has four categories: sexual, racial, intelligence, and appearance. The proposed approach produced superior results, in which the SVM algorithm with the Aravec SkipGram word embedding model achieved an accuracy rate of 88.2% and an F1-score rate of 87.8%.

show abstract