Proceedings of the 13th International Workshop on Semantic Evaluation 2019
DOI: 10.18653/v1/s19-2087
|View full text |Cite
|
Sign up to set email alerts
|

The binary trio at SemEval-2019 Task 5: Multitarget Hate Speech Detection in Tweets

Abstract: The massive growth of user-generated web content through blogs, online forums and most notably, social media networks, led to a large spreading of hatred or abusive messages which have to be moderated. This paper proposes a supervised approach to hate speech detection towards immigrants and women in English tweets. Several models have been developed ranging from feature-engineering approaches to neural ones. We also carried out a detailed error analysis to show main causes of misclassification.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
13
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 14 publications
(16 citation statements)
references
References 13 publications
0
13
0
Order By: Relevance
“…With the development of large pre-trained transformer models such as BERT and XLNET (Devlin et al, 2019;Yang et al, 2019), several studies have explored the use of general pre-trained transformers in offensive language identification (Liu et al, 2019;Bucur et al, 2021) as well retrained or fine-tuned models on offensive language corpora such as HateBERT (Caselli et al, 2020). While the vast majority of studies address offensive language identification using English data (Yao et al, 2019;Ridenhour et al, 2020), several recent studies have created new datasets for various languages and applied computational models to identify such content in Arabic (Mubarak et al, 2021), Dutch (Tulkens et al, 2016), French (Chiril et al, 2019), German (Wiegand et al, 2018), Greek (Pitenis et al, 2020), Hindi (Bohra et al, 2018), Italian (Poletto et al, 2017), Portuguese (Fortuna et al, 2019), Slovene (Fišer et al, 2017), Spanish (Plazadel Arco et al, 2021), and Turkish (C ¸öltekin, 2020. A recent trend is the use of pre-trained multilingual models such as XLM-R (Conneau et al, 2019) to leverage available English resources to make predictions in languages with less resources (Plaza-del Arco et al, 2021;Zampieri, 2020, 2021c,b;Sai and Sharma, 2021).…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…With the development of large pre-trained transformer models such as BERT and XLNET (Devlin et al, 2019;Yang et al, 2019), several studies have explored the use of general pre-trained transformers in offensive language identification (Liu et al, 2019;Bucur et al, 2021) as well retrained or fine-tuned models on offensive language corpora such as HateBERT (Caselli et al, 2020). While the vast majority of studies address offensive language identification using English data (Yao et al, 2019;Ridenhour et al, 2020), several recent studies have created new datasets for various languages and applied computational models to identify such content in Arabic (Mubarak et al, 2021), Dutch (Tulkens et al, 2016), French (Chiril et al, 2019), German (Wiegand et al, 2018), Greek (Pitenis et al, 2020), Hindi (Bohra et al, 2018), Italian (Poletto et al, 2017), Portuguese (Fortuna et al, 2019), Slovene (Fišer et al, 2017), Spanish (Plazadel Arco et al, 2021), and Turkish (C ¸öltekin, 2020. A recent trend is the use of pre-trained multilingual models such as XLM-R (Conneau et al, 2019) to leverage available English resources to make predictions in languages with less resources (Plaza-del Arco et al, 2021;Zampieri, 2020, 2021c,b;Sai and Sharma, 2021).…”
Section: Related Workmentioning
confidence: 99%
“…Even though thousands of languages and dialects are widely used in social media, most studies on the automatic identification of such content consider English only, a language for which datasets and other resources such as pre-trained models exist (Rosenthal et al, 2021). In the past few years researchers have studied this problem on languages such as Arabic (Mubarak et al, 2021), French (Chiril et al, 2019), and Turkish (C ¸öltekin, 2020) to name a few. In doing so, they have created new datasets for each of these languages.…”
Section: Introductionmentioning
confidence: 99%
“…In terms of languages, the majority of studies on this topic deal with English (Malmasi and Zampieri, 2017;Yao et al, 2019;Ridenhour et al, 2020;Rosenthal et al, 2020) due to the the wide availability of language resources such as corpora and pre-trained models. In recent years, several studies have been published on identifying offensive content in other languages such as Arabic (Mubarak et al, 2020), Dutch (Tulkens et al, 2016), French (Chiril et al, 2019), Greek (Pitenis et al, 2020), Italian (Poletto et al, 2017), Portuguese (Fortuna et al, 2019), and Turkish (Çöltekin, 2020). Most of these studies have created new datasets and resources for these languages opening avenues for multilingual models as those presented in Ranasinghe and .…”
Section: Related Workmentioning
confidence: 99%
“…The dataset contained a 3-class classification problem (hate-speech, offensive, or neither), a targeted community, as well as the spans that make the text hateful or offensive. Furthermore, offensive language datasets have been annotated in other languages such as Arabic (Mubarak et al, 2017), Danish (Sigurbergsson and Derczynski, 2020), Dutch (Tulkens et al, 2016), French (Chiril et al, 2019), Greek (Pitenis et al, 2020), Portuguese (Fortuna et al, 2019), Spanish (Basile et al, 2019b), andTurkish (Çöltekin, 2020).…”
Section: Related Workmentioning
confidence: 99%