Oversampling to Overcome Overfitting: Exploring the Relationship between Data Set Composition, Molecular Descriptors, and Predictive Modeling Methods

Chang, Chien-Chih; Hsu, Ming-Tsung; Esposito, Emilio Xavier; Tseng, Yufeng J.

doi:10.1021/ci4000536

Cited by 42 publications

(51 citation statements)

References 44 publications

(85 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Strategies proposed for dealing with imbalanced dataset range mainly from affecting specific costs to training set 58,59 , re-sampling the training set, either by over-sampling the minority class 60,61 , and/or under-sampling the majority class 62,63 . Many variants of these techniques exist and have been reviewed by López et al .…”

Section: Introductionmentioning

confidence: 99%

“…Although imbalanced data has been used in many studies dealing with soil classification 34,39,66 no such method has, to our knowledge, been applied for legacy soil data from a tropical semi-arid environment. In addition, we compared this method with the random oversampling (ROS) approach 59,60 . Having considered the pruning approach, we hypothesized that instance selection on the majority soil group, along with model-based feature selection, would improve the performance of the RF models and result in a stronger response of the minority soil groups.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Predicting reference soil groups using legacy data: A data pruning and Random Forest approach for tropical environment (Dano catchment, Burkina Faso)

Hounkpatin

Schmidt

Stumpf

et al. 2018

Sci Rep

View full text Add to dashboard Cite

Predicting taxonomic classes can be challenging with dataset subject to substantial irregularities due to the involvement of many surveyors. A data pruning approach was used in the present study to reduce such source errors by exploring whether different data pruning methods, which result in different subsets of a major reference soil groups (RSG) – the Plinthosols – would lead to an increase in prediction accuracy of the minor soil groups by using Random Forest (RF). This method was compared to the random oversampling approach. Four datasets were used, including the entire dataset and the pruned dataset, which consisted of 80% and 90% respectively, and standard deviation core range of the Plinthosols data while cutting off all data points belonging to the outer range. The best prediction was achieved when RF was used with recursive feature elimination along with the non-oversampled 90% core range dataset. This model provided a substantial agreement to observation, with a kappa value of 0.57 along with 7% to 35% increase in prediction accuracy for smaller RSG. The reference soil groups in the Dano catchment appeared to be mainly influenced by the wetness index, a proxy for soil moisture distribution.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Predicting reference soil groups using legacy data: A data pruning and Random Forest approach for tropical environment (Dano catchment, Burkina Faso)

Hounkpatin

Schmidt

Stumpf

et al. 2018

Sci Rep

View full text Add to dashboard Cite

show abstract

“…Under-sampling is suitable for such applications where the number of majority samples is immense and decreasing the training samples will reduce the model training time. However, a drawback with under-sampling that discards samples leads to the loss of information for the majority class [17]. …”

Section: Introductionmentioning

confidence: 99%

“…Guha et al [8] constructed Random Forest (RF) ensemble models to classify the cell proliferation datasets in PubChem, producing classification rate on the prediction sets in a range between 70% to 85% depending on the nature of datasets and descriptors employed. Chang et al [17] applied the over-sampling technique to explore the relationship between dataset composition, molecular descriptor and predictive modeling method, concluding that SVM models constructed from over-sampled dataset exhibited better predictive ability for the training and external test sets compared to previous results in the literature. Though several proposed methods have successfully countered the imbalanced datasets in PubChem, however, many of the previous works were time consuming in calculation and little work explored the problem of enhancement in the computational efficiency in addition to the statistical performance, which in turn should be largely addressed in the era of big data.…”

Section: Introductionmentioning

confidence: 99%

An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data

Ming

Wang

Bryant

2014

Analytica Chimica Acta

View full text Add to dashboard Cite

It is common that imbalanced datasets are often generated from high-throughput screening (HTS). For a given dataset without taking into account the imbalanced nature, most classification methods tend to produce high predictive accuracy for the majority class, but significantly poor performance for the minority class. In this work, an efficient algorithm, GLMBoost, coupled with Synthetic Minority Over-sampling TEchnique (SMOTE) is developed and utilized to overcome the problem for several imbalanced datasets from PubChem BioAssay. By applying the proposed combinatorial method, those data of rare samples (active compounds), for which usually poor results are generated, can be detected apparently with high balanced accuracy (Gmean). As a comparison with GLMBoost, Random Forest (RF) combined with SMOTE is also adopted to classify the same datasets. Our results show that the former (GLMBoost + SMOTE) not only exhibits higher performance as measured by percentage correct classification for the rare samples (Sensitivity) and Gmean, but also demonstrates greater computational efficiency than the latter (RF + SMOTE). Therefore, we hope that the proposed combinatorial algorithm based on GLMBoost and SMOTE could be extensively used to tackle the imbalanced classification problem.

show abstract

“…In a study of Chang et al ., 92 the simple oversampling technique was used to develop SVM models that classify compounds according to predicted cytotoxicity against the Jurkat cell line. It was demonstrated that oversampling of the minority class (toxic compounds) leads to SVM models with better predictive ability for both the training and external test sets, compared to results reported in previous studies.…”

Section: Dealing With Data Imbalance Issues In Pubchem Datamentioning

confidence: 99%

Getting the most out of PubChem for virtual screening

Kim

2016

Expert Opinion on Drug Discovery

126

View full text Add to dashboard Cite

Introduction With the emergence of the “big data” era, the biomedical research community has great interest in exploiting publicly available chemical information for drug discovery. PubChem is an example of public databases that provide a large amount of chemical information free of charge. Areas covered This article provides an overview of how PubChem’s data, tools, and services can be used for virtual screening and reviews recent publications that discuss important aspects of exploiting PubChem for drug discovery. Expert opinion PubChem offers comprehensive chemical information useful for drug discovery. It also provides multiple programmatic access routes, which are essential to build automated virtual screening pipelines that exploit PubChem data. In addition, PubChemRDF allows users to download PubChem data and load them into a local computing facility, facilitating data integration between PubChem and other resources. PubChem resources have been used in many studies for developing bioactivity and toxicity prediction models, discovering polypharmacologic (multi-target) ligands, and identifying new macromolecule targets of compounds (for drug-repurposing or off-target side effect prediction). These studies demonstrate the usefulness of PubChem as a key resource for computer-aided drug discovery and related area.

show abstract

Oversampling to Overcome Overfitting: Exploring the Relationship between Data Set Composition, Molecular Descriptors, and Predictive Modeling Methods

Cited by 42 publications

References 44 publications

Predicting reference soil groups using legacy data: A data pruning and Random Forest approach for tropical environment (Dano catchment, Burkina Faso)

Predicting reference soil groups using legacy data: A data pruning and Random Forest approach for tropical environment (Dano catchment, Burkina Faso)

An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data

Getting the most out of PubChem for virtual screening

Contact Info

Product

Resources

About