Hate Speech Detection in the Bengali Language: A Dataset and Its Baseline Evaluation

Romim, Nauros; Ahmed, Mosahed; Talukder, Hriteshwar; Islam, Md. Saiful

doi:10.1007/978-981-16-0586-4_37

Cited by 63 publications

(30 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As reported, the proposed CNN + BiLSTM + CNN model frequently outperforms baseline (Das et al, 2021). In another dataset of Bengali hate speech detection (Romim et al, 2021), the fusion model with self-attention CNN + attn. + BiLSTM + CNN outperforms all the previous DNN and ML implementations, as evident from Table 4.…”

Section: Glue Benchmark With Artificial Data Scarcitymentioning

confidence: 51%

Alternative non-BERT model choices for the textual classification in low-resource languages and environments

Maheen¹,

Faisal²,

Karim³

2022

Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

View full text Add to dashboard Cite

Natural Language Processing (NLP) tasks in non-dominant and low-resource languages have not experienced significant progress. Although pre-trained BERT models are available, GPU-dependency, large memory requirement, and data scarcity often limit their applicability. As a solution, this paper proposes a fusion chain architecture comprised of one or more layers of CNN, LSTM, and BiLSTM and identifies precise configuration and chain length.The study shows that a simpler, CPU-trainable non-BERT fusion CNN + BiLSTM + CNN is sufficient to surpass the textual classification performance of the BERT-related models in resource-limited languages and environments. The fusion architecture competitively approaches the state-of-the-art accuracy in several Bengali NLP tasks and a six-class emotion detection task for a newly developed Bengali dataset. Interestingly, the performance of the identified fusion model, for instance, CNN + BiLSTM + CNN, also holds for other lowresource languages and environments. Efficacy study shows that the CNN + BiLSTM + CNN model outperforms BERT implementation for Vietnamese languages and performs almost equally in English NLP tasks experiencing artificial data scarcity. For the GLUE benchmark and other datasets such as Emotion, IMDB, and Intent classification, the CNN + BiLSTM + CNN model often surpasses or competes with BERT-base, TinyBERT, DistilBERT, and mBERT. Besides, a position-sensitive selfattention layer role further improves the fusion models' performance in the Bengali emotion classification. The models are also compressible to as low as ≈ 5× smaller through pruning and retraining, making them more viable for resource-constrained environments. Together, this study may help NLP practitioners and serve as a blueprint for NLP model choices in textual classification for low-resource languages and environments.

show abstract

Section: Glue Benchmark With Artificial Data Scarcitymentioning

confidence: 51%

Alternative non-BERT model choices for the textual classification in low-resource languages and environments

Maheen¹,

Faisal²,

Karim³

2022

Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…The study considers datasets across different languages and contexts for the efficacy demonstration of CNN + BiLSTM + CNN fusion. We developed a new Bengali corpus for 6-class emotion classification, as well as used other previously developed Bengali datasets for different NLP tasks-i) Sixclass emotion Bengali dataset (Das et al, 2021), ii) Hate Speech Bengali dataset (Romim et al, 2021), and iii) DeepHateExplainer Bengali dataset (Karim et al, 2020). As examples of non-Bengali languages that relate the low-resource contexts, we consider the Vietnamese (Ho et al, 2019) and Indonesian (Saputri et al, 2018) datasets.…”

Section: Datasetsmentioning

confidence: 99%

Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

2022

View full text Add to dashboard Cite

The lack of resources for languages in the Americas has proven to be a problem for the creation of digital systems such as machine translation, search engines, chat bots, and more. The scarceness of digital resources for a language causes a higher impact on populations where the language is spoken by millions of people. We introduce the first official large combined corpus for deep learning of an indigenous South American low-resource language spoken by millions called Quechua. Specifically, our curated corpus is created from text gathered from the southern region of Peru where a dialect of Quechua is spoken that has not traditionally been used for digital systems as a target dialect in the past. In order to make our work repeatable by others, we also offer a public, pre-trained, BERT model called Qu-BERT which is the largest linguistic model ever trained for any Quechua type, not just the southern region dialect. We furthermore test our corpus and its corresponding BERT model on two major tasks: (1) named-entity recognition (NER) and (2) part-of-speech (POS) tagging by using state-of-the-art techniques where we achieve results comparable to other work on higher-resource languages. In this article, we describe the methodology, challenges, and results from the creation of QuBERT which is on on par with other state-of-the-art multilingual models for natural language processing achieving between 71 and 74% F1 score on NER and 84-87% on POS tasks. ReferencesWillem FH Adelaar. 2004. The languages of the Andes.

show abstract

“…But, despite being the world's seventh most spoken language with 240 million native speakers [4], research on sarcasm detection in the Bengali language is unexplored and overlooked. Due to the limited resources and the scarcity of large-scale sarcasm data, identifying sarcasm from Bengali text is currently a difficult challenge for the researchers of NLP [5].…”

Section: Introductionmentioning

confidence: 99%

Ben-Sarc: A Corpus for Sarcasm Detection from Bengali Social Media Comments and Its Baseline Evaluation

Lora¹,

M.²,

Nazmin³

et al. 2022

Preprint

View full text Add to dashboard Cite

Sarcasm detection research of the Bengali language so far can be considered to be narrow due to the unavailability of resources. In this paper, we introduce alarge-scale self annotated Bengali corpus for sarcasm detection research problem in the Bengali language named ’Ben-Sarc’ containing 25,636 comments, manually collected from different public Facebook pages and evaluated by external evaluators. Then we present a complete strategy to utilize different models of traditional machine learning, deep learning, and transfer learning to detect sarcasm from text using the Ben-Sarc corpus. Finally, we demonstrate a comparison between the performance of traditional machine learning, deep learning, and transfer learning models on our Ben-Sarc corpus. Transfer learning using Indic-Transformers Bengali BERT as a pre-trained source model has achievedthe highest accuracy of 75.05%. The second highest accuracy is obtained by the LSTM model with 72.48% and Multinomial Naive Bayes is acquired the third highest with 72.36% accuracy for deep learning and machine learning, respectively. The Ben-Sarc corpus is made publicly available in the hope of advancing the Bengali Natural Language Processing community.

show abstract

Hate Speech Detection in the Bengali Language: A Dataset and Its Baseline Evaluation

Cited by 63 publications

References 16 publications

Alternative non-BERT model choices for the textual classification in low-resource languages and environments

Alternative non-BERT model choices for the textual classification in low-resource languages and environments

Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

Ben-Sarc: A Corpus for Sarcasm Detection from Bengali Social Media Comments and Its Baseline Evaluation

Contact Info

Product

Resources

About