B-NER: A Novel Bangla Named Entity Recognition Dataset With Largest Entities and Its Baseline Evaluation

Haque, Md. Zahidul; Zaman, Sakib; Saurav, Jillur Rahman; Haque, Summit; Islam, Md Saiful; Amin, Mohammad Ruhul

doi:10.1109/access.2023.3267746

Cited by 4 publications

(3 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…After closely examining the samples in the dataset, we found that some of them contained digits and numbers with no apparent semantic meaning. Phone numbers, currencies, and percentages are examples of numerical entities that are commonly identified and categorized using Named Entity Recognition (NER) tools [48]. However, there is currently no NER tool available for Chittagonian text.…”

Section: Removing English Digitsmentioning

confidence: 99%

Exhaustive Study into Machine Learning and Deep Learning Methods for Multilingual Cyberbullying Detection in Bangla and Chittagonian Texts

Mahmud,

Ptaszynski,

Masui

2024

Electronics

View full text Add to dashboard Cite

Cyberbullying is a serious problem in online communication. It is important to find effective ways to detect cyberbullying content to make online environments safer. In this paper, we investigated the identification of cyberbullying contents from the Bangla and Chittagonian languages, which are both low-resource languages, with the latter being an extremely low-resource language. In the study, we used both traditional baseline machine learning methods, as well as a wide suite of deep learning methods especially focusing on hybrid networks and transformer-based multilingual models. For the data, we collected over 5000 both Bangla and Chittagonian text samples from social media. Krippendorff’s alpha and Cohen’s kappa were used to measure the reliability of the dataset annotations. Traditional machine learning methods used in this research achieved accuracies ranging from 0.63 to 0.711, with SVM emerging as the top performer. Furthermore, employing ensemble models such as Bagging with 0.70 accuracy, Boosting with 0.69 accuracy, and Voting with 0.72 accuracy yielded promising results. In contrast, deep learning models, notably CNN, achieved accuracies ranging from 0.69 to 0.811, thus outperforming traditional ML approaches, with CNN exhibiting the highest accuracy. We also proposed a series of hybrid network-based models, including BiLSTM+GRU with an accuracy of 0.799, CNN+LSTM with 0.801 accuracy, CNN+BiLSTM with 0.78 accuracy, and CNN+GRU with 0.804 accuracy. Notably, the most complex model, (CNN+LSTM)+BiLSTM, attained an accuracy of 0.82, thus showcasing the efficacy of hybrid architectures. Furthermore, we explored transformer-based models, such as XLM-Roberta with 0.841 accuracy, Bangla BERT with 0.822 accuracy, Multilingual BERT with 0.821 accuracy, BERT with 0.82 accuracy, and Bangla ELECTRA with 0.785 accuracy, which showed significantly enhanced accuracy levels. Our analysis demonstrates that deep learning methods can be highly effective in addressing the pervasive issue of cyberbullying in several different linguistic contexts. We show that transformer models can efficiently circumvent the language dependence problem that plagues conventional transfer learning methods. Our findings suggest that hybrid approaches and transformer-based embeddings can effectively tackle the problem of cyberbullying across online platforms.

show abstract

Section: Removing English Digitsmentioning

confidence: 99%

Exhaustive Study into Machine Learning and Deep Learning Methods for Multilingual Cyberbullying Detection in Bangla and Chittagonian Texts

Mahmud,

Ptaszynski,

Masui

2024

Electronics

View full text Add to dashboard Cite

show abstract

“…It will also help relieve the strain on healthcare resources and the increased demand for medical consultations. Haque et al [96] proposed the unique dataset B-NER, the biggest fine-grained Bangla NER dataset by employing the BIO tagging approach which have been produced by using 22,144 sentences that have been directly annotated and gathered from Bangla newspapers and Bangla Wikipedia. There are 9,895 separate phrases in this dataset that have been manually classified into eight different categories, including organizations, events, people, time, artifacts, markers, geopolitical entities, geographic locations and natural phenomena.…”

Section: (C) Deep Learning Approachmentioning

confidence: 99%

State-of-art approach for Indian Language based on NER: Comprehensive Review

Pandey,

Nathani

2024

Preprint

View full text Add to dashboard Cite

Named Entity Recognition (NER) is a fundamental task of natural language processing (NLP) that focuses on the identification and classification of named entities such as name of individual persons, location, organization and dates within the text. NER plays a pivotal role in various NLP applications, including information extraction, question answering, text summarization and sentiment analysis. Natural language processing's (NLPs) fundamental issue is named entity recognition (NER). While extensive research has been conducted on NER for English and Hindi, the complexities of Indian languages present unique challenges that require customized solutions. Working with NER for Indian languages is a difficult endeavor with limited resources available. This article provides a comprehensive review of NER approaches tailored for Indian languages. Indian languages pose unique challenges to NER due to their rich morphological and syntactic variations, script diversity and limited annotated data availability. This paper reviews the various techniques and methodologies employed in NER for Indian languages, including rule-based, machine learning and deep learning approaches. It analyzes the strengths and limitations of each approach. Additionally, this article examines the recent advancements in transfer learning and multilingual models, showcasing their potential in improving NER performance across Indian languages. This paper aims to guide researchers and practitioners in the development of NER systems for Indian languages and foster further advancements in this field. This article also provides a comprehensive review of the diverse approaches employed for NER in Indian languages, highlighting the strength and limitations as well.

show abstract

“…Removing English digits: Upon thorough examination of the dataset samples, we observed the presence of digits and numbers that did not carry specific semantic meaning. In standard practice, named entity recognition (NER) [61] tools are employed to identify and categorize such numerical entities, such as phone numbers, percentages, and currencies. However, for the Chittagonian dialect, no NER tool is currently available.…”

Section: Data Preprocessingmentioning

confidence: 99%

Automatic Vulgar Word Extraction Method with Application to Vulgar Remark Detection in Chittagonian Dialect of Bangla

Mahmud,

Ptaszynski,

Masui

2023

Applied Sciences

View full text Add to dashboard Cite

The proliferation of the internet, especially on social media platforms, has amplified the prevalence of cyberbullying and harassment. Addressing this issue involves harnessing natural language processing (NLP) and machine learning (ML) techniques for the automatic detection of harmful content. However, these methods encounter challenges when applied to low-resource languages like the Chittagonian dialect of Bangla. This study compares two approaches for identifying offensive language containing vulgar remarks in Chittagonian. The first relies on basic keyword matching, while the second employs machine learning and deep learning techniques. The keyword-matching approach involves scanning the text for vulgar words using a predefined lexicon. Despite its simplicity, this method establishes a strong foundation for more sophisticated ML and deep learning approaches. An issue with this approach is the need for constant updates to the lexicon. To address this, we propose an automatic method for extracting vulgar words from linguistic data, achieving near-human performance and ensuring adaptability to evolving vulgar language. Insights from the keyword-matching method inform the optimization of machine learning and deep learning-based techniques. These methods initially train models to identify vulgar context using patterns and linguistic features from labeled datasets. Our dataset, comprising social media posts, comments, and forum discussions from Facebook, is thoroughly detailed for future reference in similar studies. The results indicate that while keyword matching provides reasonable results, it struggles to capture nuanced variations and phrases in specific vulgar contexts, rendering it less robust for practical use. This contradicts the assumption that vulgarity solely relies on specific vulgar words. In contrast, methods based on deep learning and machine learning excel in identifying deeper linguistic patterns. Comparing SimpleRNN models using Word2Vec and fastText embeddings, which achieved accuracies ranging from 0.84 to 0.90, logistic regression (LR) demonstrated remarkable accuracy at 0.91. This highlights a common issue with neural network-based algorithms, namely, that they typically require larger datasets for adequate generalization and competitive performance compared to conventional approaches like LR.

show abstract

B-NER: A Novel Bangla Named Entity Recognition Dataset With Largest Entities and Its Baseline Evaluation

Cited by 4 publications

References 44 publications

Exhaustive Study into Machine Learning and Deep Learning Methods for Multilingual Cyberbullying Detection in Bangla and Chittagonian Texts

Exhaustive Study into Machine Learning and Deep Learning Methods for Multilingual Cyberbullying Detection in Bangla and Chittagonian Texts

State-of-art approach for Indian Language based on NER: Comprehensive Review

Automatic Vulgar Word Extraction Method with Application to Vulgar Remark Detection in Chittagonian Dialect of Bangla

Contact Info

Product

Resources

About