Class prediction for high-dimensional class-imbalanced data

Blagus, Rok; Lusa, Lara

doi:10.1186/1471-2105-11-523

Cited by 197 publications

(130 citation statements)

References 40 publications

(45 reference statements)

Supporting

Mentioning

124

Contrasting

Unclassified

Order By: Relevance

“…Class imbalance occurs frequently in QSAR and drug discovery datasets 14,[65][66][67] . This could be for a number of reasons; however in this context it is due to lack of publically available data for the minority class, poorly-moderately absorbed compounds, in the literature.…”

Section: Resultsmentioning

confidence: 99%

“…Another problem with under-sampling is that in order to assess the predictability of the balanced training set fairly, the validation set will also have to be adjusted to mirror the training set in terms of distribution of the data, but again this reduces the dataset size in the validation set and increases the variability of the results 14 . However the models built using this equal distribution should be better models to predict both poorly and highly-absorbed compounds if a big enough dataset is used.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Coping with Unbalanced Class Data Sets in Oral Absorption Models

Newby

Freitas

Ghafourian

2013

J. Chem. Inf. Model.

View full text Add to dashboard Cite

Class imbalance occurs frequently in drug discovery datasets. In oral absorption datasets, in the literature, there are considerably more of highly-absorbed compounds compared with poorly-absorbed compounds. This produces models that are biased towards highly-absorbed compounds which lack generalization to industry settings where more early stage drug candidates are poorly-absorbed. This paper presents two strategies to cope with unbalanced class datasets: Under-sampling the majority high absorption class and misclassification costs using classification decision trees. The published dataset by Hou et al (2007), which contained percentage human intestinal absorption of 645 drug and drug-like compounds, was used for the development and validation of classification trees using C&RT analysis. The results indicate that under-sampling the majority class, highly-absorbed compounds, leads to a balanced distribution (50:50) training set which can achieve better accuracies for poorlyabsorbed compounds, whereas the biased training set achieved higher accuracies for highlyabsorbed compounds. The use of misclassification costs resulted in improved class predictions, when applied to reduce false positives or false negatives. Moreover, it was shown that the classical overall accuracy measure used in many publications is particularly misleading in the case of unbalanced datasets and more appropriate measures presented here may be used for a more realistic assessment of the classification models' performance. Thus, these strategies offer improvements to cope with unbalanced class datasets to obtain classification models applicable in industry.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Coping with Unbalanced Class Data Sets in Oral Absorption Models

Newby

Freitas

Ghafourian

2013

J. Chem. Inf. Model.

View full text Add to dashboard Cite

show abstract

“…However, considering it with class-imbalance presents an additional source of difficulties for prediction, as it biases classification towards majority class for most classifiers (see, e.g. experimental analyses from Blagus and Lusa (2010)). The attribute (feature) selection is often applied in standard balanced classification to enhance predictive performance.…”

Section: Feature Ensembles and Class Imbalancementioning

confidence: 99%

Multi-class and feature selection extensions of Roughly Balanced Bagging for imbalanced data

Lango

Stefanowski

2017

J Intell Inf Syst

View full text Add to dashboard Cite

Roughly Balanced Bagging is one of the most efficient ensembles specialized for class imbalanced data. In this paper, we study its basic properties that may influence its good classification performance. We experimentally analyze them with respect to bootstrap construction, deciding on the number of component classifiers, their diversity, and ability to deal with the most difficult types of the minority examples. Then, we introduce two generalizations of this ensemble for dealing with a higher number of attributes and for adapting it to handle multiple minority classes. Experiments with synthetic and real life data confirm usefulness of both proposals.

show abstract

“…Imbalanced datasets might lead to overfitting of the training algorithms to the most common class and many mistakes in the least common class, leading to a poor generalisation performance (Huang et al 2006;Blagus and Lusa 2010). A common solution to the overfitting problem in imbalanced datasets is using cost-sensitive learning ANN.…”

Section: Third Step: Optimisation Of the Cost-sensitive Learning Paramentioning

confidence: 99%

Automated early detection of drops in commercial egg production using neural networks

Ramírez-Morales

Fernández-Blanco

Rivero

2017

British Poultry Science

View full text Add to dashboard Cite

ABSTRACT1. The purpose of this work was to support decision-making in poultry farms by performing automatic early detection of anomalies in egg production. 2. Unprocessed data were collected from a commercial egg farm on a daily basis over 7 years. Records from a total of 24 flocks, each with approximately 20 000 laying hens, were studied. 3. Other similar works have required a prior feature extraction by a poultry expert, and this method is dependent on time and expert knowledge. 4. The present approach reduces the dependency on time and expert knowledge because of the automatic selection of relevant features and the use of artificial neural networks capable of costsensitive learning. 5. The optimum configuration of features and parameters in the proposed model was evaluated on unseen test data obtained by a repeated cross-validation technique. 6. The accuracy, sensitivity, specificity and positive predictive value are presented and discussed at 5 forecasting intervals. The accuracy of the proposed model was 0.9896 for the day before a problem occurs.

show abstract

Class prediction for high-dimensional class-imbalanced data

Cited by 197 publications

References 40 publications

Coping with Unbalanced Class Data Sets in Oral Absorption Models

Coping with Unbalanced Class Data Sets in Oral Absorption Models

Multi-class and feature selection extensions of Roughly Balanced Bagging for imbalanced data

Automated early detection of drops in commercial egg production using neural networks

Contact Info

Product

Resources

About