Reducing the Effect of Imbalance in Text Classification Using SVD and GloVe with Ensemble and Deep Learning

Hossain, Tajbia; Mauni, Humaira Zahin; Rab, Raqeebir

doi:10.31577/cai_2022_1_98

Cited by 9 publications

(3 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…GloVe is used to represent words using an embedding matrix containing many words. Each of these words corresponds to several numerical values, representing the vectors embedding this word, which are then employed as the input layer for neural networks of deep learning classifiers [13], [14]. Recurrent neural network (RNN) is one type of deep learning classifier based on keeping the output of a certain layer and feeding it back to the input to predict the layer's output, but it suffers from the problem of vanishing and exploding gradients.…”

Section: Preliminariesmentioning

confidence: 99%

Word embedding for detecting cyberbullying based on recurrent neural networks

Shaker,

Dhannoon

2024

IJ-AI

View full text Add to dashboard Cite

<span lang="EN-US">The phenomenon of cyberbullying has spread and has become one of the biggest problems facing users of social media sites and generated significant adverse effects on society and the victim in particular. Finding appropriate solutions to detect and reduce cyberbullying has become necessary to mitigate its negative impacts on society and the victim. Twitter comments on two datasets are used to detect cyberbullying, the first dataset was the Arabic cyberbullying dataset, and the second was the English cyberbullying dataset. Three different pre-trained global vectors (GloVe) corpora with different dimensions were used on the original and preprocessed datasets to represent the words. Recurrent neural networks (RNN), long short-term memory (LSTM), Bidirectional LSTM (BiLSTM), gated recurrent unit (GRU), and Bidirectional GRU (BiGRU) classifiers utilized, evaluated and compared. The GRU outperform other classifiers on both datasets; its accuracy on the Arabic cyberbullying dataset using the Arabic GloVe corpus of dimension equal to 256D is 87.83%, while the accuracy on the English datasets using 100 D pre-trained GloVe corpus is 93.38%.</span>

show abstract

Section: Preliminariesmentioning

confidence: 99%

Word embedding for detecting cyberbullying based on recurrent neural networks

Shaker,

Dhannoon

2024

IJ-AI

View full text Add to dashboard Cite

show abstract

“…The relevance of the terms in the corpus texts is assessed using TF-IDF. The TF-IDF equation is shown below [14,44]. Here tf i,j = the total number of occurrences of i in j, df i = the total number of documents containing i, and N = the total number of documents.…”

Section: Feature Extractionmentioning

confidence: 99%

Machine Learning-Based Text Classification Comparison: Turkish Language Context

Alzoubi,

Topcu,

Erkaya

2023

Applied Sciences

View full text Add to dashboard Cite

The growth in textual data associated with the increased usage of online services and the simplicity of having access to these data has resulted in a rise in the number of text classification research papers. Text classification has a significant influence on several domains such as news categorization, the detection of spam content, and sentiment analysis. The classification of Turkish text is the focus of this work since only a few studies have been conducted in this context. We utilize data obtained from customers’ inquiries that come to an institution to evaluate the proposed techniques. Classes are assigned to such inquiries specified in the institution’s internal procedures. The Support Vector Machine, Naïve Bayes, Long Term-Short Memory, Random Forest, and Logistic Regression algorithms were used to classify the data. The performance of the various techniques was then analyzed after and before data preparation, and the results were compared. The Long Term-Short Memory technique demonstrated superior effectiveness in terms of accuracy, achieving an 84% accuracy rate, surpassing the best accuracy record of traditional techniques, which was 78% accuracy for the Support Vector Machine technique. The techniques performed better once the number of categories in the dataset was reduced. Moreover, the findings show that data preparation and coherence between the classes’ number and the number of training sets are significant variables influencing the techniques’ performance. The findings of this study and the text classification technique utilized may be applied to data in dialects other than Turkish.

show abstract

“…It is worth highlighting that the majority of systems for diagnosing thyroid disease relied on attribute selection whereas, model training was carried out using an imbalanced dataset. Numerous research demonstrated that skewed results are produced by imbalanced data [ 32 , 33 ]. Nevertheless, since they lack sufficient prior knowledge, they may even provide overfitted or under-fitted predictions [ 34 ].…”

Section: Introductionmentioning

confidence: 99%

SSC: The novel self-stack ensemble model for thyroid disease prediction

2024

PLoS ONE

View full text Add to dashboard Cite

Thyroid disease presents a significant health risk, lowering the quality of life and increasing treatment costs. The diagnosis of thyroid disease can be challenging, especially for inexperienced practitioners. Machine learning has been established as one of the methods for disease diagnosis based on previous studies. This research introduces a novel and more effective technique for predicting thyroid disease by utilizing machine learning methodologies, surpassing the performance of previous studies in this field. This study utilizes the UCI thyroid disease dataset, which consists of 9172 samples and 30 features, and exhibits a highly imbalanced target class distribution. However, machine learning algorithms trained on imbalanced thyroid disease data face challenges in reliably detecting minority data and disease. To address this issue, re-sampling is employed, which modifies the ratio between target classes to balance the data. In this study, the down-sampling approach is utilized to achieve a balanced distribution of target classes. A novel RF-based self-stacking classifier is presented in this research for efficient thyroid disease detection. The proposed approach demonstrates the ability to diagnose primary hypothyroidism, increased binding protein, compensated hypothyroidism, and concurrent non-thyroidal illness with an accuracy of 99.5%. The recommended model exhibits state-of-the-art performance, achieving 100% macro precision, 100% macro recall, and 100% macro F1-score. A thorough comparative assessment is conducted to demonstrate the viability of the proposed approach, including several machine learning classifiers, deep neural networks, and ensemble voting classifiers. The results of K-fold cross-validation provide further support for the efficacy of the proposed self-stacking classifier.

show abstract

Reducing the Effect of Imbalance in Text Classification Using SVD and GloVe with Ensemble and Deep Learning

Cited by 9 publications

References 0 publications

Word embedding for detecting cyberbullying based on recurrent neural networks

Word embedding for detecting cyberbullying based on recurrent neural networks

Machine Learning-Based Text Classification Comparison: Turkish Language Context

SSC: The novel self-stack ensemble model for thyroid disease prediction

Contact Info

Product

Resources

About