Bengali word embeddings and it's application in solving document classification problem

Ahmad, Adnan; Amin, Mohammad Ruhul

doi:10.1109/iccitechn.2016.7860236

Cited by 27 publications

(10 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In future work, we would like to exploit the local contexts and topic structure of the n-grams to identify the dispersion and would like to develop a supervised information-theoretic technique to efficiently aggregate the dispersed features in a preprocessing step to the supervise dimension reduction step. The local context and topic structure has been successfully exploited previously in text analysis in various ways [1,2,6,22,25,34]. Bringing awareness of these structures to information extraction would be a valuable addition to this pipeline, which uses the interdependence between the n-gram features and class-labels to obtain a low-dimensional discriminatory document representation.…”

Section: Discussionmentioning

confidence: 99%

“…When p > n or the explanatory variables are highly collinear then the sample covariance matrix Σ is singular. The precursor work [11] uses a ridge regularization to overcome the rank-deficiency by adding a diagonal matrix sI p to the sample covariance matrix in (1), where s is the ridge regularization parameter and I p is the identity matrix. Also the precursor work [11] uses a variant of the regularized SIR method, in particular they used the localized sliced inverse regression (LSIR) [33].…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Extraction of Tumor Site from Cancer Pathology Reports using Deep Filters

Dubey

Hinkle

Christian

et al. 2019

Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

View full text Add to dashboard Cite

Purpose: Pathology reports are the primary source of information concerning the millions of cancer cases across the United States. Cancer registries manually process the pathology reports to extract the pertinent information including primary tumor site, behavior, histology, laterality, and grade. Processing a large volume of the pathology reports in a timely manner is a continuing challenge for cancer registries. The purpose of this study is to develop an information extraction pipeline to reliably and efficiently extract reportable information. Method: We have developed a novel inverse-regression (IR) based information extraction pipeline. The inverse-regression based supervised filter has been successfully applied to many application domains. However, its application to the information extraction from unstructured text is hindered primarily by the extreme highdimensionality of n-gram representations of text. In this study, we attempt to overcome the obstacles by a novel bootstrapping strategy. First, we use an information-theoretic mutual information based filter to discard the excessive and redundant n-gram features. This step reduces the size and potentially improves the condition number of the sample covariance matrix, thus reducing the computational cost and improving the numerical stability of the subsequent inverse-regression step. Then we use localized sliced inverse-regression (LSIR) to learn a low-dimensional discriminatory subspace for information inference. In particular, we use the k-nearest neighbors of an unlabeled pathology report in the learned representation to infer the desired information from the labeled data in a supervised manner. Results: The experiments were conducted on a set of de-identified pathology reports with human expert labels as the ground truth. Our pipeline consistently performed better than or comparable to the best performing state-of-the-art methods while reducing the training and inference times substantially.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Extraction of Tumor Site from Cancer Pathology Reports using Deep Filters

Dubey

Hinkle

Christian

et al. 2019

Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

View full text Add to dashboard Cite

show abstract

“…In another study, Amin et al [21] performed sentiment analysis on Bengali comments, which were analyzed using the Word2Vec approach and achieved 75.5% in each of the two classes. Ahmad et al [22] used Word2Vec for the Bengali document classification problem using Support Vector Machine (SVM) and obtained an F1-score of almost 91%.…”

Section: Bengali Word Embeddingmentioning

confidence: 99%

An Enhanced Neural Word Embedding Model for Transfer Learning

et al. 2022

View full text Add to dashboard Cite

Due to the expansion of data generation, more and more natural language processing (NLP) tasks are needing to be solved. For this, word representation plays a vital role. Computation-based word embedding in various high languages is very useful. However, until now, low-resource languages such as Bangla have had very limited resources available in terms of models, toolkits, and datasets. Considering this fact, in this paper, an enhanced BanglaFastText word embedding model is developed using Python and two large pre-trained Bangla models of FastText (Skip-gram and cbow). These pre-trained models were trained on a collected large Bangla corpus (around 20 million points of text data, in which every paragraph of text is considered as a data point). BanglaFastText outperformed Facebook’s FastText by a significant margin. To evaluate and analyze the performance of these pre-trained models, the proposed work accomplished text classification based on three popular textual Bangla datasets, and developed models using various machine learning classical approaches, as well as a deep neural network. The evaluations showed a superior performance over existing word embedding techniques and the Facebook Bangla FastText pre-trained model for Bangla NLP. In addition, the performance in the original work concerning these textual datasets provides excellent results. A Python toolkit is proposed, which is convenient for accessing the models and using the models for word embedding, obtaining semantic relationships word-by-word or sentence-by-sentence; sentence embedding for classical machine learning approaches; and also the unsupervised finetuning of any Bangla linguistic dataset.

show abstract

“…Word Embedding tools, technologies and pre-trained models are widely available for resource rich languages such as English (Mikolov et al, 2013;Pennington et al, 2014; and Chinese (Li et al, 2018;Chen et al, 2015). Due to the wide use of Word Embeddings, pre-trained models are increasingly available for resource poor languages such as Portuguese (Hartmann et al, 2017), Arabic (Elrazzaz et al, 2017;Soliman et al, 2017), and Bengali (Ahmad and Amin, 2016).…”

Section: Related Workmentioning

confidence: 99%

NPVec1: Word Embeddings for Nepali - Construction and Evaluation

Koirala¹,

Niraula²

2021

Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)

View full text Add to dashboard Cite

Word Embedding maps words to vectors of real numbers. It is derived from a large corpus and is known to capture semantic knowledge from the corpus. Word Embedding is a critical component of many state-of-the-art Deep Learning techniques. However, generating good Word Embeddings is a special challenge for low-resource languages such as Nepali due to the unavailability of large text corpus. In this paper, we present NPVec1 which consists of 25 state-of-art Word Embeddings for Nepali that we have derived from a large corpus using GloVe, Word2Vec, fastText, and BERT. We further provide intrinsic and extrinsic evaluations of these Embeddings using well established metrics and methods. These models are trained using 279 million word tokens and are the largest Embeddings ever trained for Nepali language. Furthermore, we have made these Embeddings publicly available to accelerate the development of Natural Language Processing (NLP) applications in Nepali.

show abstract

Bengali word embeddings and it's application in solving document classification problem

Cited by 27 publications

References 7 publications

Extraction of Tumor Site from Cancer Pathology Reports using Deep Filters

Extraction of Tumor Site from Cancer Pathology Reports using Deep Filters

An Enhanced Neural Word Embedding Model for Transfer Learning

NPVec1: Word Embeddings for Nepali - Construction and Evaluation

Contact Info

Product

Resources

About