BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla

Bhattacharjee, Abhik; Hasan, Tahmid; Ahmad, Wasi Uddin; Samin, Kazi; Islam, Md. Saiful; Iqbal, Anindya; Rahman, M. Sohel; Shahriyar, Rifat

doi:10.48550/arxiv.2101.00204

Cited by 7 publications

(16 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our experiment results can serve as benchmarks for future work. We hope the present study will encourage researchers to make use of said models for various tasks in Bangla 53 As indicated in [28]. NLP, and serve as a stepping stone for future endeavors that will contribute to enriching BNLP research.…”

Section: Discussionmentioning

confidence: 94%

“…The dataset has 1, 313 parallel sentences, in which English sentences were collected from the Penn Treebank corpus. ( 7) Global Voices: 28 The Global Voices corpus consists of the translations of spoken languages.…”

Section: Machinementioning

confidence: 99%

“…It is only very recently that a small number of studies have explored deep learning-based approaches [15,28,85,92], which include Long Short Term Memory (LSTM) neural networks [100] and Gated Recurrent Unit (GRU) [43] and a combination of LSTM, Convolution Neural Networks (CNN) [129] and CRFs [103,128,135]. Typically, these algorithms are used with distributed words and character representations called "word embeddings" and "character embeddings, " respectively.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models

Alam¹,

Hasan²,

Alam³

et al. 2021

Preprint

View full text Add to dashboard Cite

Bangla -ranked as the 6 𝑡ℎ most widely spoken language across the world, 1 with 230 million native speakersis still considered as a low-resource language in the natural language processing (NLP) community. With three decades of research, Bangla NLP (BNLP) is still lagging behind mainly due to the scarcity of resources and the challenges that come with it. There is sparse work in different areas of BNLP; however, a thorough survey reporting previous work and recent advances is yet to be done. In this study, we first provide a review of Bangla NLP tasks, resources, and tools available to the research community; we benchmark datasets collected from various platforms for nine NLP tasks using current state-of-the-art algorithms (i.e., transformer-based models). We provide comparative results for the studied NLP tasks by comparing monolingual vs. multilingual models of varying sizes. We report our results using both individual and consolidated datasets and provide data splits for future research. We reviewed a total of 108 papers and conducted 175 sets of experiments. Our results show promising performance using transformer-based models while highlighting the trade-off with computational costs. We hope that such a comprehensive survey will motivate the community to build on and further advance the research on Bangla NLP.

show abstract

Section: Discussionmentioning

confidence: 94%

Section: Machinementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models

Alam¹,

Hasan²,

Alam³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…• We have built an annotation management system from scratch to annotate Bangla SA data. We have made both the annotation management system and SentiGOLD dataset publicly available upon request 4 . • To establish a benchmark, we have investigated different architectures and training methodologies on this dataset and achieved 0.62 macro f1 for 5 classes with BanglaBert [4].…”

Section: Introductionmentioning

confidence: 99%

“…We have made both the annotation management system and SentiGOLD dataset publicly available upon request 4 . • To establish a benchmark, we have investigated different architectures and training methodologies on this dataset and achieved 0.62 macro f1 for 5 classes with BanglaBert [4]. • We employ cross-dataset testing to showcase the generalization capability of the proposed dataset.…”

Section: Introductionmentioning

confidence: 99%

SentiGOLD: A Large Bangla Gold Standard Multi-Domain Sentiment Analysis Dataset and Its Evaluation

Islam¹,

Chowdhury²,

Khan³

et al. 2023

Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

In this study, we present a Bangla multi-domain sentiment analysis dataset, named as SentiGOLD, developed using 70,000 samples, which was compiled from a variety of sources and annotated by a gender-balanced team of linguists. This dataset was created in accordance with a standard set of linguistic conventions that were established after multiple meetings between the Government of Bangladesh and a nationally recognized Bangla linguistics committee. Although there are standard sentiment analysis datasets available for English and other rich languages, there are not any such datasets in Bangla, especially because, there was no standard linguistics framework agreed upon by national stakeholders. Senti-GOLD derives its raw data from online video comments, social media posts and comments, blog posts and comments, news and numerous other sources. Throughout the development of this dataset, domain distribution and class distribution were rigorously maintained. SentiGOLD was created using data from a total of 30 domains (e.g. politics, entertainment, sports, etc.) and was labeled using 5 classes (e.g. strongly negative, weakly negative, neutral, weakly positive, and strongly positive). In order to maintain annotation quality, the national linguistics committee approved an annotation scheme to ensure a rigorous Inter Annotator Agreement (IAA) in a multi-annotator annotation scenario. This procedure yielded an IAA score of 0.88 using Fleiss' kappa method, which is elaborated upon in the paper. A protocol for intra-and crossdataset evaluation was utilized in our efforts to develop a classification system as a standard. The cross-dataset evaluation was performed on the SentNoB dataset, which contains noisy Bangla text samples, thereby establishing a demanding test scenario. We Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. KDD

show abstract

Challenges and Opportunities of Text-Based Emotion Detection: A Survey

Maruf,

Khanam,

Haque

et al. 2024

IEEE Access

View full text Add to dashboard Cite

Emotion detection has become an intriguing issue for researchers because of its psychological, social, and commercial significance. People express their feelings directly or indirectly through facial expressions, language, writing, or behavior. An emotion detection tool is a critical and practical way of recognizing and categorizing moods with various applications. Artificial intelligence is often used in research to identify emotions. Machine learning and deep learning algorithms produce high-quality solutions for diagnosing emotional diseases in social media users. Numerous studies and survey articles have been published on emotion detection based on textual data. However, most of these studies did not comprehensively address emerging architectures and performance analysis in emotion detection. This paper provides an extensive survey of state-of-the-art systems, techniques, and datasets for textual emotion recognition. Another goal of this study is to emphasize the limitations and provide up-and-coming research directions to fill these gaps in this rapidly evolving field. This survey paper investigated the concepts and the performances of different categories of textual emotion detection models, approaches, and methodologies.

show abstract

BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla

Cited by 7 publications

References 0 publications

A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models

A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models

SentiGOLD: A Large Bangla Gold Standard Multi-Domain Sentiment Analysis Dataset and Its Evaluation

Challenges and Opportunities of Text-Based Emotion Detection: A Survey

Contact Info

Product

Resources

About