A selective Bayes Classifier for classifying incomplete data based on gain ratio

Chen, Jingnian; Huang, Houkuan; Tian, Fengzhan; Tian, Shengfeng

doi:10.1016/j.knosys.2008.03.013

Cited by 30 publications

(12 citation statements)

References 5 publications

(6 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, some other classifiers have a better performance compared with previously published results, for example, best‐first, J48, and REPT classifiers in the analysis of TH dataset; RF classifier in the analysis of BW dataset; RS classifier in the analysis of HC dataset; MLP and RBF classifiers in the analysis of HH dataset; RBF, SMO, and RF classifiers in the analysis of SH dataset; RBF and RAF classifiers in the analysis of HE dataset; SMO classifier in the analysis of LY dataset; J48 in the analysis of BC dataset; NBU classifier in the analysis of HS dataset; SC classifier in the analysis of HC dataset; RF classifier in the analysis of AR dataset; RS, STC, and zeroR classifiers in the analysis of PO dataset; and NBU in the analysis of PT dataset (Table ). Furthermore, some combinations of feature extraction and classification methods, hybrid‐based, and evolutionary learning‐based classification methods in the published literature have a better accuracy than the selected classification methods in the analysis of selected healthcare data, such as AISWNB, CFSWNB, GRWNB, MIWNB, ReFWNB, Tree‐WNB, and RMWNB (Wu et al, ), LWNB and AODE (Jiang, Zhang, et al, ), pedagogical, decompositional, SVMs, a combination of HC and pedagogical (Stoean & Stoean, ), PSO, ABC, and GSA (Bahrololoum, Nezamabadi‐Pour, Bahrololoum, & Saeed, ), and RBF‐MS, RBF‐HS, and RBF‐NNTS (Jaganathan & Kuppuchamy, ) in the analysis of BW dataset; a combination of SVM and feature selection methods (Sun et al, ) in the analysis of DE dataset; a combination of PPPCA and SVM (Shah et al, ) in the analysis of HH dataset; CDW‐NN (Paredes & Vidal, ) and ABC (Schiezaro & Pedrini, ) in the analysis of SH dataset; RF (Azar, Elshazly, Hassanien, & Elkorany, ) in the analysis of LY dataset; NPBC (Soria, Garibaldi, Ambrogi, Biganzoli, & Ellis, ) in the analysis of HS dataset; SRBC and SRBCBG (Chen et al, ) in the analysis of LC dataset; FND (Rodríguez, García‐Osorio, & Maudes, ), a combination of SVM and feature selection methods (Sun et al, ), and SRBC and SRBCBG (Chen et al, ) in the analysis of AR dataset; and PELM (Rong, Ong, Tan, & Zhu, ) in the analysis of LY dataset (Table ). It is also important to note that in the analysis of some datasets, such as TH, BW, and DE, most of the classification methods have a better performance, whereas for some other datasets, such as PO, LI, and PT, a poor performance is observed.…”

Section: Discussionmentioning

confidence: 99%

“…Combinations of feature extraction and classification methods and different classification methods have been also implemented in healthcare data classification, for instance, the combination of radial basis function (RBF) neural network, mean selection, half selection, and neural network for threshold selection (NNTS)‐based feature selection method in the classification of BW dataset (accuracy of 95.85–97.28%), PI dataset (accuracy of 73.83–76.04%), HS dataset (accuracy of 84.44–85.19%), HE dataset (accuracy of 82.58–85.16%), and HC dataset (accuracy of 81.75–84.46%; Jaganathan & Kuppuchamy, ); combination of random forest (RAF) and J48 in the classification of BW dataset (accuracy of 97.31%) and SH dataset (accuracy of 83.68%; Tan et al, ); combination of principal component analysis (PCA) and SVM, and kernel PCA and SVM in the classification of BW and PT datasets (maximum accuracy of 96.35% for BW dataset using PCA + SVM method; Li, Liu, & Hu, ); combination of probabilistic PCA and SVM in the classification of HC and HH datasets (maximum accuracy of 85.82% for HH dataset; Shah et al, ), and combinations of C4.5 with standard, ad hoc, and pairwise learning in the classification of BW and PI datasets (maximum accuracy of 44.89% for PO dataset; Fernández et al, 2013). Besides, some novel and hybrid classification methods have been developed in healthcare data classification, for example, artificial immune system‐based self‐adaptive attribute weighting method for NB in the classification of TH, AU, BC, BW, HC, HH, HS, HE, LY, and PT datasets (maximum accuracy of 75.49 ± 8% for AU dataset; Wu et al, ); superparent‐one‐dependence estimator method for classification of TH, AU, BC, BW, PI, HC, HH, HS, HE, LY, and PT datasets (maximum accuracy of 48.38% for PT dataset; Wu, Pan, Zhu, Zhang, & Zhang, ); a two‐stage evolutionary algorithm in the classification of DE, BC, LY, and PI datasets (maximum accuracy of 69.34 ± 2.30% in the classification of BC dataset; Gutiérrez, Hervás‐Martínez, Martínez‐Estudillo, & Carbonero, ); and selective robust Bayes classifier (SRBC) and SRBC for incomplete data based on a gain ratio (SRBCBG) in the classification of AR, AU, BC, and LC datasets (maximum accuracy of 76.08 ± 0.74% for AR dataset; Chen, Huang, Tian, & Tian, ).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A comprehensive search for expert classification methods in disease diagnosis and prediction

et al. 2018

View full text Add to dashboard Cite

Healthcare data analysis is currently a challenging and crucial research issue for the development of a robust disease diagnosis and prediction system. Many specific and a few common methods have been discussed in the literature for healthcare data classification. The present study implements 32 classification methods of six categories (Bayes, function‐based, lazy, meta, rule‐based, and tree‐based) with the objective of searching the best and common categories and methods in healthcare data mining. The performance of each classification method has been evaluated based on analysis time, classification accuracy, precision, recall, F‐measure, area under the receiver operating characteristic curve, root mean square error, kappa coefficient, Kulczynski's measure, and Fowlkes–Mallows index and compared with more than 90 classification methods used in past studies. Seventeen healthcare datasets related to thyroid, cancer, skin disease, heart disease, hepatitis, lymphography, audiology, diabetes, surgery, arrhythmia, postsurvival, liver, and tumour have been used in the performance assessment of the classification methods. The tree‐based classification methods have a better performance (with an average classification accuracy of 79.92% and maximum accuracy of 99.50%; an analysis time of 3.91 s for the logistic model tree classifier) than the other methods. Furthermore, the association of datasets and classification methods has been discussed.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A comprehensive search for expert classification methods in disease diagnosis and prediction

et al. 2018

View full text Add to dashboard Cite

show abstract

“…To improve the naive Bayes performance, many different approaches have been proposed in literature [7,8,11]. Solutions like the Extended Naive Bayes [10] and the NB+ [12] are only two examples of techniques being validated over a number of data sets to show a better classification accuracy than the traditional naive Bayes.…”

Section: Discussionmentioning

confidence: 99%

“…According to Chen and colleagues [11] methods of constructing classifiers for incomplete data deserve more attention. Classifiers such as naive Bayes classifiers and C4.5 often adopt two simple strategies to deal with incomplete data: to ignore the instances with unknown entries or to ascribe these unknown entries to a specified dummy value of the respective attribute variables.…”

Section: Introductionmentioning

confidence: 99%

“…Classifiers such as naive Bayes classifiers and C4.5 often adopt two simple strategies to deal with incomplete data: to ignore the instances with unknown entries or to ascribe these unknown entries to a specified dummy value of the respective attribute variables. To overcome these limitations, a selective Robust Bayes Classifier for incomplete data based on gain ratio was proposed [11].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A ‘non-parametric’ version of the naive Bayes classifier

Soria¹,

Garibaldi²,

Ambrogi³

et al. 2011

Knowledge-Based Systems

140

View full text Add to dashboard Cite

Many algorithms have been proposed for the machine learning task of classification. One of the simplest methods, the naive Bayes classifier, has often been found to give good performance despite the fact that its underlying assumptions (of independence and a Normal distribution of the variables) are perhaps violated.In previous work, we applied naive Bayes and other standard algorithms to a breast cancer database from Nottingham City Hospital in which the variables are highly non-Normal and found that the algorithm performed well when predicting a class that had been derived from the same data. However, when we then applied naive Bayes to predict an alternative clinical variable, it performed much worse than other techniques. This motivated us to propose an alternative method, based on naive Bayes, which removes the requirement for the variables to be Normally distributed, but retains the essential structure and other underlying assumptions of the method. We tested our novel algorithm on our breast cancer data and on three UCI datasets which also exhibited strong violations of Normality. We found our algorithm outperformed naive Bayes in all four cases and outperformed multinomial logistic regression (MLR) in two cases. We conclude that our method offers a competitive alternative to MLR and naive Bayes when dealing with data sets in which non-Normal distributions are observed.

show abstract