2022
DOI: 10.1109/access.2022.3195236
|View full text |Cite
|
Sign up to set email alerts
|

Compilation, Analysis and Application of a Comprehensive Bangla Corpus KUMono

Abstract: Research in Natural Language Processing (NLP) and computational linguistics highly depends on a good quality representative corpus of any specific language. Bangla is one of the most spoken languages in the world but Bangla NLP research is in its early stage of development due to the lack of quality public corpus. This article describes the detailed compilation methodology of a comprehensive monolingual Bangla corpus, KUMono. The newly developed corpus consists of more than 350 million word tokens and more tha… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(1 citation statement)
references
References 25 publications
(33 reference statements)
0
1
0
Order By: Relevance
“…Lexical metrics, such as the Total Word Count and Unique Word Count, offer insights into the textual richness and the diversity of the vocabulary. These metrics are crucial for evaluating the potential of datasets to provide varied linguistic input necessary for training robust NLP models [ 39 41 ]. By examining the total count of words alongside the count of unique words, the assessment not only evaluates the volume of linguistic content available but also its variety, which is indicative of the potential complexity and nuance that NLP models must grapple with.…”
Section: Resultsmentioning
confidence: 99%
“…Lexical metrics, such as the Total Word Count and Unique Word Count, offer insights into the textual richness and the diversity of the vocabulary. These metrics are crucial for evaluating the potential of datasets to provide varied linguistic input necessary for training robust NLP models [ 39 41 ]. By examining the total count of words alongside the count of unique words, the assessment not only evaluates the volume of linguistic content available but also its variety, which is indicative of the potential complexity and nuance that NLP models must grapple with.…”
Section: Resultsmentioning
confidence: 99%