B 2 FSE framework for high dimensional imbalanced data: A case study for drug toxicity prediction

Hooda, Nishtha; Bawa, Seema; Rana, Prashant Singh

doi:10.1016/j.neucom.2017.04.081

Cited by 23 publications

(13 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Note that the dataset is common in both the criteria, giving us a total of 11 datasets. We choose these two categories because they are of special interest in research related to imbalanced datasets and have received extensive attention in this research area (Anand et al 2010;Hooda et al 2018;Jing et al 2019;Blagus and Lusa 2013).…”

Section: Datasets Used For Validationmentioning

confidence: 99%

LoRAS: an oversampling approach for imbalanced datasets

et al. 2020

View full text Add to dashboard Cite

The Synthetic Minority Oversampling TEchnique (SMOTE) is widely-used for the analysis of imbalanced datasets. It is known that SMOTE frequently over-generalizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model. In this article, we present an approach that overcomes this limitation of SMOTE, employing Localized Random Affine Shadowsampling (LoRAS) to oversample from an approximated data manifold of the minority class. We benchmarked our algorithm with 14 publicly available imbalanced datasets using three different Machine Learning (ML) algorithms and compared the performance of LoRAS, SMOTE and several SMOTE extensions that share the concept of using convex combinations of minority class data points for oversampling with LoRAS. We observed that LoRAS, on average generates better ML models in terms of F1-Score and Balanced accuracy. Another key observation is that while most of the extensions of SMOTE we have tested, improve the F1-Score with respect to SMOTE on an average, they compromise on the Balanced accuracy of a classification model. LoRAS on the contrary, improves both F1 Score and the Balanced accuracy thus produces better classification models. Moreover, to explain the success of the algorithm, we have constructed a mathematical framework to prove that LoRAS oversampling technique provides a better estimate for the mean of the underlying local data distribution of the minority class data space.

show abstract

Section: Datasets Used For Validationmentioning

confidence: 99%

LoRAS: an oversampling approach for imbalanced datasets

et al. 2020

View full text Add to dashboard Cite

show abstract

“…The main function of class balancing is to balance the class symmetry of instances. There are several conventional approaches to handle the class imbalance problem, which are undersampling, oversampling, and the synthetic minority oversampling technique (SMOTE) [17,18]. Here, the class imbalance problem is resolved by the ensemble learning method, as ensemble learning is more effective AATSC1i 49 nwHBa 79 minaaCH 109 ETA_Eta_R 139 GGI4 20 MATS1v 50 nHsNH2 80 mindssC 110 ETA_Eta_F 140 SpMax_D 21 MATS1p 51 nHdsCH 81 minaasC 111 ETA_Eta_F_L 141 SpDiam_D 22 MATS1i 52 nHaaCH 82 mindsN 112 FMF 142 SpAD_D 23 GATS1m 53 ndsCH 83 mindS 113 nHBDon_Lipinski 143 SpMAD_D 24 GATS1v 54 naaCH 84 maxwHBa 114 HybRatio 144 EE_D 25 GATS1p 55 ndssC 85 maxHdsCH 115 MIC4 145 VE1_D 26 GATS1i 56 naasC 86 maxHaaCH 116 MIC5 146 TopoPSA 27 nBondsS3 57 nsNH2 87 maxdsCH 117 nAtomP 147 AMW 28 nBondsD 58 ndsN 88 maxaaCH than data sampling techniques to enhance the classification performance of imbalanced data.…”

Section: S No Name Descriptionmentioning

confidence: 99%

“…If n is the number of records and d is the depth of the tree, then the time complexity of the random forest algorithm is O(ntree *mtry *d *n) and the space complexity of random forest algorithm is O(n*d). Therefore, we can say that the random forest model depends on the depth and size of the decision tree [17].…”

Section: Random Forest Modelmentioning

confidence: 99%

Toxicity prediction of small drug molecules of aryl hydrocarbon receptor using a proposed ensemble model

Gupta¹,

Rana²

2019

Turk J Elec Eng & Comp Sci

Self Cite

View full text Add to dashboard Cite

Quantitative structure-activity relationships and quantitative structure-property relationships have proved their usefulness for predicting toxicities of drug molecules regarding their biological activities. In silico toxicity prediction techniques are essential for reducing testing on rodents (in vivo) and for a less time-consuming and more cost-efficient alternative for the identification of toxic effects at an early stage of drug development. The authors aim to build a prediction model for better assessment of toxicity to quickly and efficiently test whether certain chemical compounds have the potential to disrupt the processes in the human body that may adversely affect human health. Here, we have proposed a computational method (in silico) for the toxicity prediction of small drug molecules using their various physicochemical properties (molecular descriptors) that can bind to the aryl hydrocarbon receptor. Pharmaceutical data exploration laboratory software is used for extracting the features of drug molecules. The dataset of the aryl hydrocarbon receptor contains 9008 drug molecules, where 1063 are active and 7945 are inactive, and each drug molecule contains 1444 features. It is a novel prediction model based on ensemble learning that can efficiently classify active (binding) and inactive (nonbinding) compounds of the dataset. In our proposed ensemble model, we primarily performed feature selection using the Boruta library in R, after which we resolved the class imbalance problem itself by ensemble learning where we divided the dataset into seven data frames, which have approximately equal numbers of active and inactive drug molecules. An ensemble model based upon the votes of seven random forest models is proposed, which gives an accuracy of 93.76%. K-fold cross-validation is conducted to measure the consistency of the model. Finally, the validity of the proposed ensemble model for some drug molecules of acquired immune deficiency syndrome therapy and androgen receptor has been proved.

show abstract

“…These metrics performed on different classifiers like Bayes Net (BN), Naive Bayes (NB), Logistic Regression (LR), SVM/SMO, Random Forest (RF), Adaboost, Adabag, and J48 [2]. In all these classifiers, it can be observed that Random Forest gives the highest accuracy and Adaboost has the lowest, which is 71%.…”

Section: Performance Evaluationmentioning

confidence: 99%

“…Artificial intelligence and machine learning noble techniques have helped many researchers in finding cost effective solution in diverse domains like drugs discovery, audits, etc. [2][3][4]. By using Artificial Intelligence in drug discovery, it increases the drugs market rapidly.…”

Section: Introductionmentioning

confidence: 99%

Optimized Ensemble Machine Learning Framework for High Dimensional Imbalanced Bio Assays

Sharma¹,

Hooda²

2019

RIA

Self Cite

View full text Add to dashboard Cite

In pharmaceutical research, a recent hotspot is the study of the activity of bioactive compounds and drugs with computational intelligence. The relevant studies often adopt machine learning techniques to speed up the modelling, and rely on bioassay to evaluate the effect and potency of a compound or drug. This paper aims to design an efficient and accurate method to assess the activity of bioactive compounds and drugs. First, the authors performed virtual screening on the data on bioactive compounds and drugs, eliminating the imbalanced classes and high dimensionality of drug descriptors. Next, eight machine learning algorithms, namely Bayes Net, Naive Bayes, SMO, J48, Random Forest, AdaBoost, AdaBag and logistic regression, were trained by the virtually screened data, and used to predict the activity or inactivity of a drug through bioassays. The synthetic minority oversampling technique (SMOTE) was employed to solve the numerous imbalanced datasets in bioassay. On this basis, the ensemble machine learning model of random forest was optimized. Experimental results show that the optimized random forest machine learning framework achieved better results than the other ensemblebased machine learning methods. The research provides an effective way to perform bioassays on high-dimensional imbalanced data.

show abstract

B 2 FSE framework for high dimensional imbalanced data: A case study for drug toxicity prediction

Cited by 23 publications

References 36 publications

LoRAS: an oversampling approach for imbalanced datasets

LoRAS: an oversampling approach for imbalanced datasets

Toxicity prediction of small drug molecules of aryl hydrocarbon receptor using a proposed ensemble model

Optimized Ensemble Machine Learning Framework for High Dimensional Imbalanced Bio Assays

Contact Info

Product

Resources

About