Towards Bengali Word Embedding: Corpus Creation, Intrinsic and Extrinsic Evaluations

Hossain, Md. Rajib; Hoque, Mohammed Moshiul

doi:10.20944/preprints202012.0600.v1

Cited by 11 publications

(6 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The experiment was performed on 180 million Bengali words. Extrinsic performance provides better results than intrinsic performance, as discussed in a previous study [2]. A Bengali text document classifier was developed using GloVe embedding and a very deep convolution neural network (VDCNN).…”

Section: Introductionmentioning

confidence: 89%

Medical nearest-word embedding technique implemented using an unsupervised machine learning approach for Bengali language

Mandal,

Mukherjee,

Vishnu

et al. 2024

International Journal on Smart Sensing and Intelligent Systems

View full text Add to dashboard Cite

The rapid growth of natural language processing (NLP) applications, such as text summarization, speech recognition, information extraction, and machine translation, has led to the development of structured query language (SQL) for extracting information from structured data. However, due to limited resources, converting Natural Language (NL) queries to SQL in Bengali is challenging. This article proposes an unsupervised machine learning model to find semantically Bengali closed words that can generate SQL from NL queries in Bengali. The main objective of the proposed system is to provide support in the creation of patient-oriented explanations and educational resources by simplifying intricate medical terminology. The major findings of the proposed system are as follows: The use of machine translation in the field of medicine facilitates the dissemination of healthcare information to a diverse international audience and improves the performance of entity recognition tasks, including the identification of medical conditions, drugs, or procedures within clinical notes or electronic health data. This system allows a naive user to extract health-related information from a healthcare-structured database without any knowledge of SQL. The system accepts a query and generates a response according to the query in Bengali language. Query tokenization and stop word removal are carried out in the preprocessing stage, and unsupervised machine learning techniques are implemented to process the input query sentence. Tokenized words are converted into vectors using the skip-gram model, with noise-contrastive estimation (NCE) applied to discriminate between actual and irrelevant words. Stochastic gradient descent (SGD) optimizes the model by randomly choosing a small amount of data from the dataset and using cosine similarity to measure closer words. The semantically closer words are found using an unsupervised learning method to generate the SQL.

show abstract

Section: Introductionmentioning

confidence: 89%

Medical nearest-word embedding technique implemented using an unsupervised machine learning approach for Bengali language

Mandal,

Mukherjee,

Vishnu

et al. 2024

International Journal on Smart Sensing and Intelligent Systems

View full text Add to dashboard Cite

show abstract

“…One notable application of text classification is evaluating word vector representations trained on foreign languages, a domain where standardized intrinsic procedures are yet to be established. For instance, Hossain et al [117] utilized text classification to evaluate embeddings derived from a Bengali corpus encompassing 180 million word tokens. The authors developed their classification model using Convolutional Neural Network (CNN) architecture.…”

Section: Domain-knowledge Extrinsic Evaluation: Concepts and Model Ar...mentioning

confidence: 99%

Advancing language models through domain knowledge integration: a comprehensive approach to training, evaluation, and optimization of social scientific neural word embeddings

Stöhr

2024

J Comput Soc Sc

View full text Add to dashboard Cite

This article proposes a comprehensive strategy for training, evaluating, and optimizing domain-specific word2vec-based word embeddings, using social science literature as an example. Our primary objectives are: (1) to train the embeddings utilizing a corpus of social science text, (2) to test their performance against domain-unspecific embeddings using our developed intrinsic and extrinsic evaluation strategy, and (3) to enhance their performance even further by using domain knowledge. As an integral part of this approach, we present SociRel-461, a domain-knowledge dictionary designed for the intrinsic evaluation and subsequent refinement of social science word embeddings. Using a dataset of 100,000 full-text scientific articles in sociology, we train multiple vector space models, which we then benchmark against a larger, pre-trained general language embedding model as part of our extrinsic evaluation. Furthermore, we developed a transfer learning multi-label classification task for extrinsic evaluation. Our findings reveal that domain-specific embeddings outperform their domain-unspecific counterparts in both intrinsic and extrinsic evaluations. We also investigated the retrofitting post-processing method to enhance domain-unspecific embeddings with the domain knowledge embedded in SociRel-461. While retrofitting does not enhance our domain-specific vector space models, it significantly improves the performance of the domain-unspecific embeddings. This highlights the potential of retrofitting for the transfer of domain knowledge to domain-unspecific embeddings. Our results emphasize the importance of utilizing domain-specific word embeddings for better performance in domain specific transfer learning tasks, as they outperform conventional embeddings trained on everyday language.

show abstract

“…The proposed authorship classification system is evaluated in three ways: embedding model evaluation, training phase evaluation and testing phase evaluation. Embedding model evaluation is refers to the quality judgement of feature vectors which is an essential tasks for the low-resource languages [60]. The intrinsic and extrinsic evaluations are used for evaluating the embedding model.…”

Section: A Evaluation Measuresmentioning

confidence: 99%

“…Various combinations of hyperparameters of three embedding techniques (e.g., GloVe, FastText and Word2Vec) have generated 90 local contextual embedding models [18 for GloVe, 36 for FastText (Skip-gram CBOW), 36 for Word2Vec (Skip-gram CBOW)]. Intrinsic evaluators are used to evaluate a total of 90 models using syntactic and semantic similarity measures [60]. Based on the intrinsic evaluation performance, a total of 9 top-performing embedding models are selected to perform the downstream task (i.e., authorship classification).…”

Section: A Embedding Models Evaluationmentioning

confidence: 99%

Authorship Classification in a Resource Constraint Language Using Convolutional Neural Networks

et al. 2021

Self Cite

View full text Add to dashboard Cite

Authorship classification is a technique of automatically determining the appropriate author of an unknown linguistic text. Although research on authorship classification has significantly progressed in high-resource languages, it is at a primitive stage in the realm of resource-constraint languages like Bengali. This paper presents an authorship classification system made of Convolution Neural Networks (CNN) comprising four modules: embedding model generation, feature representation, classifier training and classifier testing. For this purpose, this work develops a new embedding corpus (named WEC) and a Bengali authorship classification corpus (called BACC-18), which are more robust in terms of authors' classes and unique words. Using three text embedding techniques (Word2Vec, GloVe and FastText) and combinations of different hyperparameters, 90 embedding models are created in this study. All the embedding models are assessed by intrinsic evaluators and selected the best 9 performing models out of the 90 models for the authorship classification. In total 36 classification models, including four classification models (CNN, LSTM, SVM, SGD) and three embedding techniques with 100, 200 and 250 embedding dimensions, are trained with optimized hyperparameters and tested on three benchmark datasets BAAD16 and LD). Among the models, the optimized CNN with GloVe model achieved the highest classification accuracies of 93.45%, 95.02%, and 98.67% for the datasets BACC-18, BAAD16, and LD, respectively.INDEX TERMS Natural language processing, Authorship classification, resource constraint language, semantic feature extraction, deep learning.

show abstract

Towards Bengali Word Embedding: Corpus Creation, Intrinsic and Extrinsic Evaluations

Cited by 11 publications

References 18 publications

Medical nearest-word embedding technique implemented using an unsupervised machine learning approach for Bengali language

Medical nearest-word embedding technique implemented using an unsupervised machine learning approach for Bengali language

Advancing language models through domain knowledge integration: a comprehensive approach to training, evaluation, and optimization of social scientific neural word embeddings

Authorship Classification in a Resource Constraint Language Using Convolutional Neural Networks

Contact Info

Product

Resources

About