2021
DOI: 10.7717/peerj-cs.412
|View full text |Cite
|
Sign up to set email alerts
|

Vector representation based on a supervised codebook for Nepali documents classification

Abstract: Document representation with outlier tokens exacerbates the classification performance due to the uncertain orientation of such tokens. Most existing document representation methods in different languages including Nepali mostly ignore the strategies to filter them out from documents before learning their representations. In this article, we propose a novel document representation method based on a supervised codebook to represent the Nepali documents, where our codebook contains only semantic tokens without o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
18
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
9

Relationship

1
8

Authors

Journals

citations
Cited by 14 publications
(18 citation statements)
references
References 31 publications
(42 reference statements)
0
18
0
Order By: Relevance
“…First, most of the existing works [ 2 , 4 , 5 , 12 ] on COVID-19-related tweets are performed in high-resource languages such as English and Arabic. The approach used by high-resource language might be inapplicable to low-resource languages such as Nepali, which is based on Devanagari script and has 36 consonants (33 are distinct consonants and 3 are combined consonants), 13 vowels, and 10 numerals ( Figure 1 ) [ 1 , 15 , 16 ]. Second, their investigation mainly targets either clustering the tweets into various themes/topics or classifying their polarity into three classes (negative, positive, or neutral) using the well-established feature extraction methods such as BERT, Word2Vec, and Glove.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…First, most of the existing works [ 2 , 4 , 5 , 12 ] on COVID-19-related tweets are performed in high-resource languages such as English and Arabic. The approach used by high-resource language might be inapplicable to low-resource languages such as Nepali, which is based on Devanagari script and has 36 consonants (33 are distinct consonants and 3 are combined consonants), 13 vowels, and 10 numerals ( Figure 1 ) [ 1 , 15 , 16 ]. Second, their investigation mainly targets either clustering the tweets into various themes/topics or classifying their polarity into three classes (negative, positive, or neutral) using the well-established feature extraction methods such as BERT, Word2Vec, and Glove.…”
Section: Introductionmentioning
confidence: 99%
“…For example, semantic features based on the COVID-19-related tweets could learn more informative features. For this, we employ the probabilistic feature extraction approach as suggested by Sitaula et al [ 1 ] recently, which calculates the probability of each input word across all categories and finally and attains the feature vector depending on the number of categories present in the dataset. Last, with the help of the domain-agnostic method, we capture the semantic information using the cross-domain approach, which means that we transfer the knowledge to current COVID-19 domain from another domain such as news categories.…”
Section: Introductionmentioning
confidence: 99%
“…During word representation learning, fastText considers not only the word itself but also groups of characters from that word and subword information such as character unigrams, bigrams, and trigrams [ 20 ]. However, GloVe and word2vec fail to provide any vector representation for words that are not in the model dictionary [ 21 ]. As a result, in this study, fastText is used as a word representation model.…”
Section: Proposed Methodsmentioning
confidence: 99%
“…Various research have also been conducted for the recognition of handwritten characters and texts from other languages such as recognizing Baybayin scripts using SVM ( Sitaula, Basnet & Aryal, 2021 ), analyzing handwritten Hebrew document ( Biller et al, 2016 ), recognizing English handwritings ( Pham et al, 2020 ), and many more. For English handwritten digit and character recognition tasks, CNN-based architectures have yielded better performance than other techniques ( Baldominos, Saez & Isasi, 2019 ; Ranzato et al, 2007 ; Cireşan et al, 2011 ), and so on.…”
Section: Related Workmentioning
confidence: 99%