2021
DOI: 10.1145/3457610
|View full text |Cite
|
Sign up to set email alerts
|

Multilingual Offensive Language Identification for Low-resource Languages

Abstract: Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g., hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this article, we take advantage of available English datasets by applying cross-lingual contextual wo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 40 publications
(18 citation statements)
references
References 36 publications
0
18
0
Order By: Relevance
“…Ranasinghe et al [16,17] showed the e ectiveness of cross-lingual transfer in o ensive language identification in Hindi, Spanish, Danish, Greek and Bengali. Their work showed that multilingual transformer models like mBert and XLM-R can use the knowledge gained from higher resource languages to gain an improved performance on a lowresource target.…”
Section: Abusive Language Detectionmentioning
confidence: 99%
See 1 more Smart Citation
“…Ranasinghe et al [16,17] showed the e ectiveness of cross-lingual transfer in o ensive language identification in Hindi, Spanish, Danish, Greek and Bengali. Their work showed that multilingual transformer models like mBert and XLM-R can use the knowledge gained from higher resource languages to gain an improved performance on a lowresource target.…”
Section: Abusive Language Detectionmentioning
confidence: 99%
“…To get around this problem, it has been shown that with cross-lingual transfer, the performance on lowresource languages can be improved by leveraging knowledge from other higher resource languages. This has also been demonstrated to be an e ective technique in improving o ensive content detection in low resource languages by using cross-lingual word embeddings and multilingual transformer models [16,17,18,19].…”
Section: Introductionmentioning
confidence: 99%
“…1 sarcastic adaptation of the French madame; the word suggests the person does not deserve the title of lady, madam 2 a loud lower class woman who is unrefined 3 slang specifically used to address women directly, similar to lady, but it implies the woman is lower class 4 it suggests that the woman addressed is older, unattractive and unrefined 5 of one of the politicians 6 the polite second person singular or plural 7 the polite second person singular or plural, or plain form of second person plural Multilingual BERT As opposed to the BOW and TF-IDF word representations, which do not contain any information about the context, sentence representations as given by modern transformer networks (Reimers and Gurevych, 2019) offer richer semantic information and have been successfully used in low-resource scenarios (Ranasinghe and Zampieri, 2021). As such, we use Sentence Transformer (Reimers and Gurevych, 2019) to extract embeddings from BERT-based models.…”
Section: Text Representationmentioning
confidence: 99%
“…Substantial experiments by Fortuna et al [34] showed that training with one data set and testing with another one can decrease the performance by over 30%. Many potential reasons can be seen as obstacles for the generalisability [35,36,37,38] such as dataset size and annotation quality. However, little is known about their effects.…”
Section: Reliability Of Data Setsmentioning
confidence: 99%