Using SMOTE to Deal with Class-Imbalance Problem in Bioactivity Data to Predict mTOR Inhibitors

Kumari, Chetna; Abulaish, Muhammad; Subbarao, Naidu

doi:10.1007/s42979-020-00156-5

Cited by 8 publications

(9 citation statements)

References 20 publications

(19 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For solving these problems, Chawla proposed the Synthetic Minority Over-sampling technique (SMOTE), which creates synthetic samples from the minority class. The SMOTE samples are linear combinations of two similar samples from the minority class [29].…”

Section: Data-level Methodsmentioning

confidence: 99%

Boosting methods for multi-class imbalanced data classification: an experimental review

et al. 2020

View full text Add to dashboard Cite

Imbalanced data set classification is a relatively new research line within the broader context of machine learning studies, which tries to learn from the skewed data distribution. A data set is imbalanced when the samples of one class consist of more instances than the rest of the classes in two-class and multi-class data sets [1]. Most of the standard machine learning algorithms show poor performance in this kind of datasets, because they tend to favor the majority class samples, resulting in poor predictive accuracy over the minority class [2]. Therefore, it becomes tough to learn the rare but

show abstract

Section: Data-level Methodsmentioning

confidence: 99%

Boosting methods for multi-class imbalanced data classification: an experimental review

et al. 2020

View full text Add to dashboard Cite

show abstract

“…To overcome this problem, Chawla [13] proposed the SMOTE technique, which generates synthetic samples from the minority class. Samples created by the SMOTE technique are a linear combination of two identical samples from the minority class [13, 26, 28]. The SMOTE over‐sampling algorithm works as follows: Let S represent the size of a small class, considering a sample j of the small size class, and x j denotes its feature vector such that, j ∈ {1, …, S }: Find k neighbours of the sample x j from all S (using the Euclidean Distance for example) and denoted it as x j ( near ) , near ∈ {1, …, k }. x j ( nn ) sample is selected randomly from the k neighbours, and the random number β 1 between 0 and 1 is generated to synthesise a new sample x j 1 as in the equation x j 1 = x j + β 1 ∗ ( x j ( nn ) − x j ). Step 2 is repeated M times to synthesise M new samples: x j ( new ) , new ∈ {1, …, M }. …”

Section: Proposed Methodsmentioning

confidence: 99%

“…Over-sampling techniques add some more instances to the minority class in the training set, the simplest method is random over-sampling [26] but the drawback is over-fitting [27]. To overcome this problem, Chawla [13] proposed the SMOTE technique, which generates synthetic samples from the minority class. Samples created by the SMOTE technique are a linear combination of two identical samples from the minority class [13,26,28].…”

Section: Samplingmentioning

confidence: 99%

See 1 more Smart Citation

Network anomaly detection using deep learning techniques

Hooshmand

Hosahalli

2022

CAAI Trans on Intel Tech

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) are the specific architecture of feed-forward artificial neural networks. It is the de-facto standard for various operations in machine learning and computer vision. To transform this performance towards the task of network anomaly detection in cyber-security, this study proposes a model using one-dimensional CNN architecture. The authors' approach divides network traffic data into transmission control protocol (TCP), user datagram protocol (UDP), and OTHER protocol categories in the first phase, then each category is treated independently. Before training the model, feature selection is performed using the Chisquare technique, and then, over-sampling is conducted using the synthetic minority over-sampling technique to tackle a class imbalance problem. The authors' method yields the weighted average f-score 0.85, 0.97, 0.86, and 0.78 for TCP, UDP, OTHER, and ALL categories, respectively. The model is tested on the UNSW-NB15 dataset.

show abstract

“…Moreover, Tox21 is severely imbalanced; the volume of the inactive (negative/nontoxic) data set is much larger than that of the active (positive/toxic) data set. As a result, multitask deep learning models are unable to thoroughly explore the essence of the minority class data set consisting of the positive compounds. , To address this issue, several data augmentation approaches have been introduced in previous studies, such as the resampling effect and the synthetic minority oversampling technique (SMOTE). , Manual preprocessing is an essential ingredient of these data augmentation strategies, which may impact the objectivity and performance of the models to some extent. Data augmentation technologies are also not suitable for direct use in chemicals.…”

Section: Introductionmentioning

confidence: 99%

Multitask CapsNet: An Imbalanced Data Deep Learning Method for Predicting Toxicants

Wang

Jie

et al. 2021

ACS Omega

View full text Add to dashboard Cite

Drug development has a high failure rate, with safety properties constituting a considerable challenge. To reduce risk, in silico tools, including various machine learning methods, have been applied for toxicity prediction. However, these approaches often confront a serious problem: the training data sets are usually biased (imbalanced positive and negative samples), which would result in model training difficulty and unsatisfactory prediction accuracy. Multitask networks obtained significantly better predictive accuracies than single-task methods, and capsule neural networks showed excellent performance in sparse data sets in previous studies. In this study, we developed a new multitask framework based on a capsule neural network (multitask CapsNet) to measure 12 different toxic effects simultaneously. We found that multitask CapsNet excelled in toxicity prediction and outperformed many other computational approaches using the multitask strategy. Only after training on biased data sets did multitask CapsNet achieve significantly improved prediction accuracy on the Tox21 Data Challenge, which gave the largest ratio of highest accuracy (8/12) among compared models. Our model gave a prediction accuracy of 96.6% for the target NR.PPAR.gamma, whose ratio of negative to positive samples was up to 36:1. These results suggested that multitask CapsNet could overcome the bias problems and would provide a novel, accurate, and efficient approach for predicting the toxicities of compounds.

show abstract

Using SMOTE to Deal with Class-Imbalance Problem in Bioactivity Data to Predict mTOR Inhibitors

Cited by 8 publications

References 20 publications

Boosting methods for multi-class imbalanced data classification: an experimental review

Boosting methods for multi-class imbalanced data classification: an experimental review

Network anomaly detection using deep learning techniques

Multitask CapsNet: An Imbalanced Data Deep Learning Method for Predicting Toxicants

Contact Info

Product

Resources

About