Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset

Yen, Show‐Jane; Lee, Yue‐Shi

doi:10.1007/11816492_89

Cited by 60 publications

(36 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…SMOTE algorithm has been applied with several different classifiers and was also integrated with boosting and bagging . SMOTE generates synthetic examples with the positive class label disregarding the negative class examples which may lead to overgeneraliza-tion (Yen and Lee, 2006;Maciejewski and Stefanowski, 2011;Yen and Lee, 2009). This strategy may be specially problematic in the case of highly skewed class distributions where the minority class examples are very sparse, thus resulting in a greater chance of class mixture.…”

Section: Re-samplingmentioning

confidence: 99%

A Survey of Predictive Modeling on Imbalanced Domains

2016

View full text Add to dashboard Cite

Many real world data mining applications involve obtaining predictive models using data sets with strongly imbalanced distributions of the target variable. Frequently, the least common values of this target variable are associated with events that are highly relevant for end users (e.g. fraud detection, unusual returns on stock markets, anticipation of catastrophes, etc.). Moreover, the events may have different costs and benefits, which when associated with the rarity of some of them on the available training data creates serious problems to predictive modelling techniques. This paper presents a survey of existing techniques for handling these important applications of predictive analytics. Although most of the existing work addresses classification tasks (nominal target variables), we also describe methods designed to handle similar problems within regression tasks (numeric target variables). In this survey we discuss the main challenges raised by imbalanced distributions, describe the main approaches to these problems, propose a taxonomy of these methods and refer to some related problems within predictive modelling.

show abstract

Section: Re-samplingmentioning

confidence: 99%

A Survey of Predictive Modeling on Imbalanced Domains

2016

View full text Add to dashboard Cite

show abstract

“…Resolving the imbalanced data problem. To resolve the imbalanced data problem, we used the SMOTE method (see Methods for details) and the under-sampling method 29 on our genome data. The work flow of the mirexplorer classifier.…”

Section: Resultsmentioning

confidence: 99%

mirExplorer: Detecting microRNAs from genome and next generation sequencing data using the AdaBoost method with transition probability matrix and combined features

Guan¹,

Liao²,

Qu³

et al. 2011

RNA Biology

View full text Add to dashboard Cite

microRNAs (miRNAs) represent an abundant group of small regulatory non-coding RNAs in eukaryotes. The emergence of Next-generation sequencing (NGS) technologies has allowed the systematic detection of small RNAs (sRNAs) and de novo sequencing of genomes quickly and with low cost. As a result, there is an increased need to develop fast miRNA prediction tools to annotate miRNAs from various organisms with a high level of accuracy, using the genome sequence or the NGS data. Several miRNA predictors have been proposed to achieve this purpose. However, the accuracy and fitness for multiple species of existing predictors needed to be improved. Here, we present a novel prediction tool called mirExplorer, which is based on an integrated adaptive boosting method and contains two modules. The first module named mirExplorer-genome was designed to de novo predict pre-miRNAs from genome, and the second module named mirExplorer-NGS was used to discover miRNAs from NGS data. A set of novel features of pre-miRNA secondary structure and miRNA biogenesis has been extracted to distinguish real pre-miRNAs from pseudo ones. We used outer-ten-fold cross-validation to verify the mirExplorer-genome computation, which obtained a specificity of 95.03% and a sensitivity of 93.71% on human data. This computation was made on test data from 16 species, and it achieved an overall accuracy of 95.53%. Systematic outer-ten-fold cross-validation of the mirExplorer-NGS model achieved a specificity of 98.3% and a sensitivity of 97.72%. We found that the good performance of the mirExplorer-NGS model was upheld across species from vertebrates to plants in test datasets. The mirExplorer is available as both web server and software package at http://biocenter.sysu.edu.cn/mir/.

show abstract

“…The number of non-binding segments was much larger than that of the binding segments, which led to a heavy imbalance in the datasets ( Table 1). According to the methods of previous works (Yen and Lee, 2006;Roy et al, 2015), we took the number of positive samples as the standard and randomly extracted the equal number of negative samples. In this way, the negative samples were randomly selected 10 times to ensure the credibility of the results.…”

Section: Benchmark Datasetmentioning

confidence: 99%

The Identification of Metal Ion Ligand-Binding Residues by Adding the Reclassified Relative Solvent Accessibility

Hu¹,

Feng²,

Zhang³

et al. 2020

Front. Genet.

View full text Add to dashboard Cite

Many proteins realize their special functions by binding with specific metal ion ligands during a cell's life cycle. The ability to correctly identify metal ion ligand-binding residues is valuable for the human health and the design of molecular drug. Precisely identifying these residues, however, remains challenging work. We have presented an improved computational approach for predicting the binding residues of 10 metal ion ligands

show abstract

Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset

Cited by 60 publications

References 3 publications

A Survey of Predictive Modeling on Imbalanced Domains

A Survey of Predictive Modeling on Imbalanced Domains

mirExplorer: Detecting microRNAs from genome and next generation sequencing data using the AdaBoost method with transition probability matrix and combined features

The Identification of Metal Ion Ligand-Binding Residues by Adding the Reclassified Relative Solvent Accessibility

Contact Info

Product

Resources

About