Overcoming Data Imbalance Problems in Sexual Harassment Classification with SMOTE

Putrada, Aji Gautama; Wijaya, Irfan Dwi; Oktaria, Dita

doi:10.21108/ijoict.v8i1.622

Cited by 12 publications

(2 citation statements)

References 27 publications

(29 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…where 𝑥 1𝑖 is the first variable with data item 𝑖, 𝑚 is the dataset size, 𝑥 1 ̅̅̅ is the average of the first variable, 𝑥 2𝑖 is the second variable with data item 𝑖 and 𝑥 2 ̅̅̅ is the average of the second variable. Furthermore, we apply random oversampling [14]. The application of random oversampling is for imbalanced data.…”

Section: A Chicken Egg Harvesting Data and Pre-processingmentioning

confidence: 99%

NS-SVM: Bolstering Chicken Egg Harvesting Prediction with Normalization and Standardization

Putrada

Alamsyah

Fauzan

et al. 2023

JUITA

View full text Add to dashboard Cite

Breeding chickens and chicken eggs are poignant, and recent studies have applied computer science to optimize this field, including chicken egg harvesting prediction. However, existing research does not emphasize the importance of data transformation to obtain optimum chicken egg harvesting prediction. This paper proposes the normalization and standardization-bolstered support vector machine (NS-SVM) method, namely normalization, and standardization, to improve the prediction of chicken egg harvest using SVM. First, we obtain the chicken egg dataset from Africa using Kaggle. The problem and solution become urgent, whereas chicken egg production can ease businesspeople to invest in chicken eggs. We adopt the normalization and standardization method from previous research. However, the notation is to differentiate the method from legacy SVM. The dataset has up to 13 features. Then we apply standard pre- processing such as label encoding and random oversampling. We also review the dataset feature using the Pearson correlation coefficient (PCC). We use two SVM kernels: radial basis function (RBF) and the 2nd-degree polynomial. Then we again apply the same model but by applying normalization and standardization. We use cross- validation with 𝑲 = 𝟏𝟎 to measure the Accuracy of the compared models. The results show that normalization and standardization positively affect the prediction model of the two SVM kernels. The model with the highest performance is NS-SVM with a 2nd-degree kernel, namely 𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚 = 𝟎. 𝟗𝟗𝟔. At the same time, the model with the lowest performance is SVM with RBF, namely𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚 = 𝟎. 𝟗𝟖𝟔. In addition, the results of ROC AUC analysis show that the performance of our model on the imbalanced dataset with a moderate degree is 𝑨𝑼𝑪 = 𝟎.𝟗𝟐𝟕 to 𝟎.𝟗𝟗𝟑.

show abstract

Section: A Chicken Egg Harvesting Data and Pre-processingmentioning

confidence: 99%

NS-SVM: Bolstering Chicken Egg Harvesting Prediction with Normalization and Standardization

Putrada

Alamsyah

Fauzan

et al. 2023

JUITA

View full text Add to dashboard Cite

show abstract

“…An imbalanced dataset is when, in a dataset, the number of one label (majority label) is far greater than the other (minority labels) [30]. Imbalanced datasets can affect the performance of machine learning models, then the validity of a measurement metric [31].…”

Section: Roc Threshold Selectionmentioning

confidence: 99%

ImbGAFS: GA Feature Selection for AUC in Bird Strike Prediction

Putrada,

Prabowo

2023

Machine Learning Techniques and NLP

View full text Add to dashboard Cite

Several studies discuss airplane failure prediction due to bird strikes. However, these studies need to analyze further the imbalance in their dataset. Our research aim is to create an airplane failure prediction by bird strike using a machine learning method optimized using GA feature selection. GA feature selection uses AUC maximization as the objective function to tackle imbalance problems in the bird strike dataset. First, we obtained the airplane bird strike dataset from Kaggle. We carry out preprocessing on the dataset.We then compared and chose one of four stateof-the-art machine learning methods: SVM, MLP, logistic regression, and random forest. The selection process involves oversampling methods, synthetic minority oversampling technique (SMOTE), and optimum threshold selection, which involves geometric mean (g- mean) and area under curve (AUC) values. Finally, we optimize airplane failure prediction by performing AUC maximization using GA feature selection. Our test results show that random forest is the best machinelearning method in airplane failure prediction compared to SVM, logistic regression, and MLP. SMOTE can increase random forest AUC from 0.845 to 0.878. Finally, the random forest model from ImbGAFS is better than the conventional method without feature selection. The increase in the AUC value is from 0.878 to 0.889. Then, after carrying out optimal threshold selection, ImbGAFS+random forest also has better sensitivity, specificity, and g-mean than conventional methods. The increase is from 0.7737, 0.8350, and 0.8037 to 0.8033, 0.8301, and 0.8166, respectively.

show abstract

Fine-tuned generative LLM oversampling can improve performance over traditional techniques on multiclass imbalanced text classification

Cloutier,

Japkowicz

2023

2023 IEEE International Conference on Big Data (BigData)

View full text Add to dashboard Cite

Overcoming Data Imbalance Problems in Sexual Harassment Classification with SMOTE

Cited by 12 publications

References 27 publications

NS-SVM: Bolstering Chicken Egg Harvesting Prediction with Normalization and Standardization

NS-SVM: Bolstering Chicken Egg Harvesting Prediction with Normalization and Standardization

ImbGAFS: GA Feature Selection for AUC in Bird Strike Prediction

Fine-tuned generative LLM oversampling can improve performance over traditional techniques on multiclass imbalanced text classification

Contact Info

Product

Resources

About