Hateful Speech Detection in Public Facebook Pages for the Bengali Language

Ishmam, Alvi Md.; Sharmin, Sadia

doi:10.1109/icmla.2019.00104

Cited by 61 publications

(24 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Nevertheless, with the advances in multilingual parsers and deep learning technology, together with increasing pressures from policy-makers to handle hate speech issues at local resources, non-English HS detection toolkits have seen a steady increase. The figure indicates that about 51% of all works in this field are performed on English dataset, with an increase of proportion of other languages as well where Arabic (13% ) [93,59,12,143], Turkish (6%) [143,104], Greek (4%) [143,6,136], Danish (5%) [106,143], Hindi (4%) [121,22,88], German (4% ) [72,120], Malayalam (3%) [130,109], Tamil (3%) [130,20], Chinese (1%) [138,139,155], Italian (2%) [116], Urdu (1%) [126,95,7], Russian(1%) [17], Bengali (1% ) [62,127,69], Korean (1%) [91], French (1%) [16,102,50], Indonesian (1%) [14], Portuguese (1%) [14], Spanish (1%) [56] and Polish (1%) [118] seem to dominate the rest of the languages in this field.…”

Section: Statistical Trends Of Resultsmentioning

confidence: 99%

A systematic review of Hate Speech automatic detection using Natural Language Processing

Saroar¹,

Oussalah²

2021

Preprint

View full text Add to dashboard Cite

With the multiplication of social media platforms, which offer anonymity, easy access and online community formation and online debate, the issue of hate speech detection and tracking becomes a growing challenge to society, individual, policy-makers and researchers. Despite efforts for leveraging automatic techniques for automatic detection and monitoring, their performances are still far from satisfactory, which constantly calls for future research on the issue. This paper provides a systematic review of literature in this field, with a focus on natural language processing and deep learning technologies, highlighting the terminology, processing pipeline, core methods employed, with a focal point on deep learning architecture. From a methodological perspective, we adopt PRISMA guideline of systematic review of the last 10 years literature from ACM Digital Library and Google Scholar. In the sequel, existing surveys, limitations, and future research directions are extensively discussed.

show abstract

Section: Statistical Trends Of Resultsmentioning

confidence: 99%

A systematic review of Hate Speech automatic detection using Natural Language Processing

Saroar¹,

Oussalah²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The main challenge is the lack of sufficient data. To the best of our knowledge, many of the datasets were around 5000 corpora [5], [8] and [12]. There was a publicly available corpus containing around 10000 corpora, which were annotated into five different classes [2].…”

Section: Literature Reviewmentioning

confidence: 99%

Hate Speech Detection in the Bengali Language: A Dataset and Its Baseline Evaluation

Romim

Ahmed

Talukder

et al. 2021

Algorithms for Intelligent Systems

View full text Add to dashboard Cite

Social media sites such as YouTube and Facebook have become an integral part of everyone's life and in the last few years, hate speech in the social media comment section has increased rapidly. Detection of hate speech on social media websites faces a variety of challenges including small imbalanced data sets, the finding of an appropriate model and also the choice of feature analysis method. Furthermore, this problem is more severe for the Bengali speaking community due to the lack of gold standard labelled datasets. This paper presents a new dataset of 30,000 user comments tagged by crowdsourcing and verified by expert. All the user comments collected from YouTube and Facebook comment section and to classified into seven categories: sports, entertainment, religion, politics, crime, celebrity, and TikTok & meme. A total of 50 annotators annotated each comment three times, and the majority vote was taken as the final annotation. Nevertheless, we have conducted baseline experiments and several deep learning models along with extensive pretrained Bengali word embedding such as Word2Vec, FastTest, and BengFastText on this dataset to facilitate future research opportunities. The experiment illustrated that although all the deep learning model performed well, SVM achieved the best result with 87.5% accuracy. Our core contribution is to make this benchmark dataset available and accessible to facilitate further research in the field of Bengali hate speech detection.

show abstract

“…In Bengali, several works investigated the presence of abusive language in social media data by leveraging supervised ML classifiers and labeled data (Ishmam & Sharmin, 2019;Banik & Rahman, 2019). Sazzed (2021) annotated 3,000 transliterated Bengali comments into two classes, abusive and non-abusive, 1,500 comments for each.…”

Section: Related Workmentioning

confidence: 99%

Identifying vulgarity in Bengali social media textual content

Sazzed¹

2021

PeerJ Computer Science

View full text Add to dashboard Cite

The presence of abusive and vulgar language in social media has become an issue of increasing concern in recent years. However, research pertaining to the prevalence and identification of vulgar language has remained largely unexplored in low-resource languages such as Bengali. In this paper, we provide the first comprehensive analysis on the presence of vulgarity in Bengali social media content. We develop two benchmark corpora consisting of 7,245 reviews collected from YouTube and manually annotate them into vulgar and non-vulgar categories. The manual annotation reveals the ubiquity of vulgar and swear words in Bengali social media content (i.e., in two corpora), ranging from 20% to 34%. To automatically identify vulgarity, we employ various approaches, such as classical machine learning (CML) classifiers, Stochastic Gradient Descent (SGD) optimizer, a deep learning (DL) based architecture, and lexicon-based methods. Although small in size, we find that the swear/vulgar lexicon is effective at identifying the vulgar language due to the high presence of some swear terms in Bengali social media. We observe that the performances of machine leanings (ML) classifiers are affected by the class distribution of the dataset. The DL-based BiLSTM (Bidirectional Long Short Term Memory) model yields the highest recall scores for identifying vulgarity in both datasets (i.e., in both original and class-balanced settings). Besides, the analysis reveals that vulgarity is highly correlated with negative sentiment in social media comments.

show abstract

Hateful Speech Detection in Public Facebook Pages for the Bengali Language

Cited by 61 publications

References 10 publications

A systematic review of Hate Speech automatic detection using Natural Language Processing

A systematic review of Hate Speech automatic detection using Natural Language Processing

Hate Speech Detection in the Bengali Language: A Dataset and Its Baseline Evaluation

Identifying vulgarity in Bengali social media textual content

Contact Info

Product

Resources

About