Overview of the HASOC Track at FIRE 2020: Hate Speech and Offensive Language Identification in Tamil, Malayalam, Hindi, English and German

Mandl, Thomas; Modha, Sandip; M, Anand Kumar; Chakravarthi, Bharathi Raja

doi:10.1145/3441501.3441517

Cited by 103 publications

(78 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Initially, a set of words was used to collect tweets, and then some keywords that were not frequent in offensive content were excluded during the trail annotation. Similarly, for the dataset of the HASOC track [18], the data were acquired using hashtags and keywords with offensive content. Here, we aim to use a keyword-based technique to evaluate our keyword extraction method by analyzing how these keywords can influence the detection of offensive language.…”

Section: Abstract In Offensive Language Detectionmentioning

confidence: 99%

Offensive keyword extraction based on the attention mechanism of BERT and the eigenvector centrality using a graph representation

Sarracén

Rosso

2021

Pers Ubiquit Comput

View full text Add to dashboard Cite

The proliferation of harmful content on social media affects a large part of the user community. Therefore, several approaches have emerged to control this phenomenon automatically. However, this is still a quite challenging task. In this paper, we explore the offensive language as a particular case of harmful content and focus our study in the analysis of keywords in available datasets composed of offensive tweets. Thus, we aim to identify relevant words in those datasets and analyze how they can affect model learning. For keyword extraction, we propose an unsupervised hybrid approach which combines the multi-head self-attention of BERT and a reasoning on a word graph. The attention mechanism allows to capture relationships among words in a context, while a language model is learned. Then, the relationships are used to generate a graph from what we identify the most relevant words by using the eigenvector centrality. Experiments were performed by means of two mechanisms. On the one hand, we used an information retrieval system to evaluate the impact of the keywords in recovering offensive tweets from a dataset. On the other hand, we evaluated a keyword-based model for offensive language detection. Results highlight some points to consider when training models with available datasets.

show abstract

Section: Abstract In Offensive Language Detectionmentioning

confidence: 99%

Offensive keyword extraction based on the attention mechanism of BERT and the eigenvector centrality using a graph representation

Sarracén

Rosso

2021

Pers Ubiquit Comput

View full text Add to dashboard Cite

show abstract

“…In terms of computational methods, recent work has employed deep neural models such as convolutional neural networks (CNNs) and long, short-term memory (LSTM). With the introduction of transformer-based models, most notably BERT [23], neural transformer models [24] have been widely applied in offensive language identification, topping the leaderboards of competitions such as HatEval [3], HASOC [25], OffensEval [2], and TRAC [18].…”

Section: Related Workmentioning

confidence: 99%

“…The HASOC shared task, which stands for "hate speech and offensive content identification", in Indo-European Languages is arguably the most well-known series of competitions including languages from India [25,31]. It has been organized in 2019 and 2020 at the Forum for Information Retrieval (FIRE).…”

Section: Offensive Language Identification In Languages From Indiamentioning

confidence: 99%

An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of India

Ranasinghe

Zampieri

2021

Information

View full text Add to dashboard Cite

The pervasiveness of offensive content in social media has become an important reason for concern for online platforms. With the aim of improving online safety, a large number of studies applying computational models to identify such content have been published in the last few years, with promising results. The majority of these studies, however, deal with high-resource languages such as English due to the availability of datasets in these languages. Recent work has addressed offensive language identification from a low-resource perspective, exploring data augmentation strategies and trying to take advantage of existing multilingual pretrained models to cope with data scarcity in low-resource scenarios. In this work, we revisit the problem of low-resource offensive language identification by evaluating the performance of multilingual transformers in offensive language identification for languages spoken in India. We investigate languages from different families such as Indo-Aryan (e.g., Bengali, Hindi, and Urdu) and Dravidian (e.g., Tamil, Malayalam, and Kannada), creating important new technology for these languages. The results show that multilingual offensive language identification models perform better than monolingual models and that cross-lingual transformers show strong zero-shot and few-shot performance across languages.

show abstract

“…Prior work has either designed methods for identifying conversations that are likely to go awry (Zhang WARNING: This paper contains text excerpts and words that are offensive in nature. Chang et al, 2020) or detecting offensive content and labelling posts at the instances level -this has been the focus in the recent shared tasks like HASOC at FIRE 2019 (Mandl et al, 2019a) and FIRE 2020 (Mandl et al, 2020), Ger-mEval 2019 Task 2 (Struß et al, 2019), TRAC (Kumar et al, 2018, HatEval (Basile et al, 2019a), OffensEval at SemEval-2019 (Zampieri et al, 2019b) and SemEval-2020 .…”

Section: Introductionmentioning

confidence: 99%

WLV-RIT at SemEval-2021 Task 5: A Neural Transformer Framework for Detecting Toxic Spans

Ranasinghe¹,

Sarkar²,

Zampieri³

et al. 2021

Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

View full text Add to dashboard Cite

In recent years, the widespread use of social media has led to an increase in the generation of toxic and offensive content on online platforms. In response, social media platforms have worked on developing automatic detection methods and employing human moderators to cope with this deluge of offensive content. While various state-of-the-art statistical models have been applied to detect toxic posts, there are only a few studies that focus on detecting the words or expressions that make a post offensive. This motivates the organization of the SemEval-2021 Task 5: Toxic Spans Detection competition, which has provided participants with a dataset containing toxic spans annotation in English posts. In this paper, we present the WLV-RIT entry for the SemEval-2021 Task 5. Our best performing neural transformer model achieves an 0.68 F1-Score. Furthermore, we develop an open-source framework for multilingual detection of offensive spans, i.e., MUDES, based on neural transformers that detect toxic spans in texts.

show abstract

Overview of the HASOC Track at FIRE 2020: Hate Speech and Offensive Language Identification in Tamil, Malayalam, Hindi, English and German

Cited by 103 publications

References 36 publications

Offensive keyword extraction based on the attention mechanism of BERT and the eigenvector centrality using a graph representation

Offensive keyword extraction based on the attention mechanism of BERT and the eigenvector centrality using a graph representation

An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of India

WLV-RIT at SemEval-2021 Task 5: A Neural Transformer Framework for Detecting Toxic Spans

Contact Info

Product

Resources

About