HurtBERT: Incorporating Lexical Features with BERT for the Detection of Abusive Language

Koufakou, Anna; Pamungkas, Endang Wahyu; Basile, Valerio; Patti, Viviana

doi:10.18653/v1/2020.alw-1.5

Cited by 56 publications

(42 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Building upon BERT, a handful of recent studies suggest that additional hate-specific knowledge from outside the fine-tuning dataset might help with generalisation. Such knowledge can come from further masked language modelling pre-training on an abusive corpus ( Caselli et al, 2021 ), or features from a hate speech lexicon ( Koufakou et al, 2020 ).…”

Section: Generalisation Studies In Hate Speech Detectionmentioning

confidence: 99%

Towards generalisable hate speech detection: a review on obstacles and solutions

Yin

Zubiaga

2021

PeerJ Computer Science

101

View full text Add to dashboard Cite

Hate speech is one type of harmful online content which directly attacks or promotes hate towards a group or an individual member based on their actual or perceived aspects of identity, such as ethnicity, religion, and sexual orientation. With online hate speech on the rise, its automatic detection as a natural language processing task is gaining increasing interest. However, it is only recently that it has been shown that existing models generalise poorly to unseen data. This survey paper attempts to summarise how generalisable existing hate speech detection models are and the reasons why hate speech models struggle to generalise, sums up existing attempts at addressing the main obstacles, and then proposes directions of future research to improve generalisation in hate speech detection.

show abstract

Section: Generalisation Studies In Hate Speech Detectionmentioning

confidence: 99%

Towards generalisable hate speech detection: a review on obstacles and solutions

Yin

Zubiaga

2021

PeerJ Computer Science

101

View full text Add to dashboard Cite

show abstract

“…Some other studies adopted several neural-based models, including convolutional neural networks (CNN) [75,141], long short-term memory (LSTM) [8,75,92,94,145], bidirectional LSTM (Bi-LSTM) [115], and gated recurrent unit (GRU) [27]. The most recent works focus more on investigating transferability or generalizability of stateof-the-art transformer-based models such as Bidirectional Encoder Representations from Transformers (BERT) [19,48,66,79,83,90,92,134] and its variant like RoBERTa [48] in the cross-domain abusive language detection task.…”

Section: Modelsmentioning

confidence: 99%

“…Transformer based Infused specific hateful lexicon called HurtLex into BERT model to transfer knowledge across domains. [66] Multiple models Besides experimented with a wide coverage of models including traditional (linear SVM), (LSTM), and (BERT), they also exploited HurtLex as domain-independent features for knowledge transfer between domains. [92] Neural based Experimented with augmenting all training data from different domains, resulting in the performance improvement of the models based on BERT and RoBERTa representation.…”

Section: Modelsmentioning

confidence: 99%

“…Several pre-trained models were exploited, such as FastText [27,92,94], GloVe [75,134] and ELMo [115]. Finally, the transformer-based models use pre-trained models based on a very big corpus such as BERT [19,48,66,79,83,90,92,134] and RoBERTa [48]. However, we also observe a study that proposes to re-train the BERT representation on a specific corpus related to abusive language [19].…”

Section: Feature Representationmentioning

confidence: 99%

“…These techniques are usually called domain-adaptation or domaintransfer, a specific approach to allow the model to learn domain-independent features, intersecting between two or more different domains. In the abusive language detection task, several features could represent an important signal for knowledge transfer between domains, such as the use of abusive words [147], emotional information [109,119], and some other linguistic features [27,66,92,94] Table 2 shows that studies have different approaches to cope with the domain-shift problem. Some works proposed to combine the training sets from several different domains dataset [27,48,90,92].…”

Section: Domain Transfermentioning

confidence: 99%

See 2 more Smart Citations

Towards multidomain and multilingual abusive language detection: a survey

Pamungkas

Basile

Patti

2021

Pers Ubiquit Comput

Self Cite

View full text Add to dashboard Cite

Abusive language is an important issue in online communication across different platforms and languages. Having a robust model to detect abusive instances automatically is a prominent challenge. Several studies have been proposed to deal with this vital issue by modeling this task in the cross-domain and cross-lingual setting. This paper outlines and describes the current state of this research direction, providing an overview of previous studies, including the available datasets and approaches employed in both cross-domain and cross-lingual settings. This study also outlines several challenges and open problems of this area, providing insights and a useful roadmap for future work.

show abstract

Mind Your Tweet: Abusive Tweet Detection

Tiwari

Rai

2021

Speech and Computer

View full text Add to dashboard Cite

HurtBERT: Incorporating Lexical Features with BERT for the Detection of Abusive Language

Cited by 56 publications

References 26 publications

Towards generalisable hate speech detection: a review on obstacles and solutions

Towards generalisable hate speech detection: a review on obstacles and solutions

Towards multidomain and multilingual abusive language detection: a survey

Mind Your Tweet: Abusive Tweet Detection

Contact Info

Product

Resources

About