LIIR at SemEval-2020 Task 12: A Cross-Lingual Augmentation Approach for Multilingual Offensive Language Identification

Ghadery, Erfan; Moens, Marie‐Francine

doi:10.18653/v1/2020.semeval-1.274

Cited by 8 publications

(2 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As a result, many studies employed Pre-trained multilingual word embeddings like FastText ( Bigoulaeva, Hangya & Fraser, 2021 ), MUSE ( Pamungkas & Patti, 2019 ; Deshpande, Farris & Kumar, 2022 ; Aluru et al, 2020 ; Bigoulaeva, Hangya & Fraser, 2021 ), or LASER ( Deshpande, Farris & Kumar, 2022 , Aluru et al, 2020 , Pelicon et al, 2021a ), and Vitiugin, Senarath & Purohit (2021) . Moreover, most of the research studies has focused on the use of pre-trained language models LLMs (basically as classifiers): BERT ( Vashistha & Zubiaga, 2021 , zahra El-Alami, Ouatik El Alaoui & En Nahnahi, 2022 ; Zia et al, 2022 ; Pamungkas, Basile & Patti, 2021a ), AraBERT (for Arabic data) ( zahra El-Alami, Ouatik El Alaoui & En Nahnahi, 2022 ), CseBERT (for English, Croatian and Slovenian data) ( Pelicon et al, 2021b ), as well as multilingual BERT models: ( Shi et al, 2022 ; Bhatia et al, 2021 ; Deshpande, Farris & Kumar, 2022 ; Aluru et al, 2020 ; zahra El-Alami, Ouatik El Alaoui & En Nahnahi, 2022 ; De la Peña Sarracén & Rosso, 2022 ; Tita & Zubiaga, 2021 ; Eronen et al, 2022 ; Ranasinghe & Zampieri, 2021a ; Ghadery & Moens, 2020 ; Pelicon et al, 2021b ; Awal et al, 2024 ; Montariol, Riabi & Seddah, 2022 ; Ahn et al, 2020a ; Bigoulaeva et al, 2022 , 2023 ; Pamungkas, Basile & Patti, 2021a ; Pelicon et al, 2021a ), DistilmBERT model ( Vitiugin, Senarath & Purohit, 2021 ), and RoBERTa ( Zia et al, 2022 ).…”

Section: Approaches On Multilingual Hate Speech Detectionmentioning

confidence: 99%

A survey on multi-lingual offensive language detection

Mnassri,

Farahbakhsh,

Chalehchaleh

et al. 2024

PeerJ Computer Science

View full text Add to dashboard Cite

The prevalence of offensive content on online communication and social media platforms is growing more and more common, which makes its detection difficult, especially in multilingual settings. The term “Offensive Language” encompasses a wide range of expressions, including various forms of hate speech and aggressive content. Therefore, exploring multilingual offensive content, that goes beyond a single language, focus and represents more linguistic diversities and cultural factors. By exploring multilingual offensive content, we can broaden our understanding and effectively combat the widespread global impact of offensive language. This survey examines the existing state of multilingual offensive language detection, including a comprehensive analysis on previous multilingual approaches, and existing datasets, as well as provides resources in the field. We also explore the related community challenges on this task, which include technical, cultural, and linguistic ones, as well as their limitations. Furthermore, in this survey we propose several potential future directions toward more efficient solutions for multilingual offensive language detection, enabling safer digital communication environment worldwide.

show abstract

Section: Approaches On Multilingual Hate Speech Detectionmentioning

confidence: 99%

A survey on multi-lingual offensive language detection

Mnassri,

Farahbakhsh,

Chalehchaleh

et al. 2024

PeerJ Computer Science

View full text Add to dashboard Cite

show abstract

“…While there are a few studies published on languages such as Arabic [29] and Greek [35], most studies and datasets created thus far have focused on English. Data augmentation [15] and multilingual word embeddings [31] have been applied to take advantage of existing English datasets to improve the performance in systems dealing with languages other than English. To the best of our knowledge, however, state-of-the-art cross-lingual contextual embeddings such as XLM-R [11] have not yet been applied to offensive language identification.…”

Section: Introductionmentioning

confidence: 99%

Multilingual Offensive Language Identification for Low-resource Languages

Ranasinghe

Zampieri

2021

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

View full text Add to dashboard Cite

Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g., hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this article, we take advantage of available English datasets by applying cross-lingual contextual word embeddings and transfer learning to make predictions in low-resource languages. We project predictions on comparable data in Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish. We report results of 0.8415 F1 macro for Bengali in TRAC-2 shared task [23], 0.8532 F1 macro for Danish and 0.8701 F1 macro for Greek in OffensEval 2020 [58], 0.8568 F1 macro for Hindi in HASOC 2019 shared task [27], and 0.7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) [7], showing that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages. Additionally, we report competitive performance on Arabic and Turkish using the training and development sets of OffensEval 2020 shared task. The results for all languages confirm the robustness of cross-lingual contextual embeddings and transfer learning for this task.

show abstract

Predicting the type and target of offensive social media posts in Marathi

Zampieri

Ranasinghe

Mrinal

et al. 2022

Soc. Netw. Anal. Min.

View full text Add to dashboard Cite

The presence of offensive language on social media is very common motivating platforms to invest in strategies to make communities safer. This includes developing robust machine learning systems capable of recognizing offensive content online. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English and a few other high resource languages such as French, German, and Spanish. In this paper we address this gap by tackling offensive language identification in Marathi, a lowresource Indo-Aryan language spoken in India. We introduce the Marathi Offensive Language Dataset v.2.0 or MOLD 2.0 and present multiple experiments on this dataset. MOLD 2.0 is a much larger version of MOLD with expanded annotation to the levels B (type) and C (target) of the popular OLID taxonomy.MOLD 2.0 is the first hierarchical offensive language dataset compiled for Marathi, thus opening new avenues for research in low-resource Indo-Aryan languages. Finally, we also introduce SeMOLD, a larger dataset annotated following the semi-supervised methods presented in SOLID [1].

show abstract

LIIR at SemEval-2020 Task 12: A Cross-Lingual Augmentation Approach for Multilingual Offensive Language Identification

Cited by 8 publications

References 20 publications

A survey on multi-lingual offensive language detection

A survey on multi-lingual offensive language detection

Multilingual Offensive Language Identification for Low-resource Languages

Predicting the type and target of offensive social media posts in Marathi

Contact Info

Product

Resources

About