Software are becoming an indigenous part of human life with the rapid development of software engineering, demands the software to be most reliable. The reliability check can be done by efficient software testing methods using historical software prediction data for development of a quality software system. Machine Learning plays a vital role in optimizing the prediction of defect prone modules in real life software for its effectiveness. The software defect prediction data has class imbalance problem with low ratio of defective class to non-defective class, urges an efficient machine learning classification technique which otherwise degrades the performance of the classification. To alleviate this problem, this paper introduces a novel hybrid instance based classification by combining distribution base balance based instance selection and radial basis function neural network classifier model (DBBRBF) to obtain best prediction in comparison to the existing research. Class imbalanced data sets of NASA, Promise and Softlab were used for the experimental analysis. The experimental results in terms of Accuracy, F-measure, AUC, Recall, Precision and Balance show the effectiveness of the proposed approach. Finally, Statistical significance tests are carried out to understand the suitability of the proposed model. with possible threats to validity of our approach provided in Section 6. Finally, Conclusion and future scope is presented in Section 7.
Related workThe authors [18] propose a novel machine learning approach using multiple linear regression model to predict bug proneness in software defect prediction Eclipse JDT Core data. Considering Software defect prediction as a classification task, the authors proposes SMOTE (Synthetic Minority Over-sampling Technique) ensemble based approach to effectively deal with the class imbalance problem of the datasets used and to achieve high accuracy [19]. In [20], the authors propose to help software developers by identifying software defects basing on the existing software metrics with various classification techniques. It is proposed to evaluate software defect prediction via Maximal Information Coefficient with Hierarchical Agglomerative Clustering (MICHAC) method on 11 widely studied NASA projects using three different classifiers such as: Naive Bayes, RIPPER and Random Forest) with four performance metrics (precision, recall, F-measure, and AUC) and opines their effectiveness in comparison to others [21]. The authors [22] discusses the application of data mining in software defect prediction for both static and dynamic defects, clone defects etc and highlights its importance to assist in software engineering tasks. A good overview on the data quality of the NASA MDP datasets [24] is presented in [23] as reported in [25] where comprehensive rules for data cleansing are used for software defect prediction. Six state-of-the-art within-project defect prediction approaches such as: naive Bayes, Decision tree, Logistic regression, K-nearest neighbor, random forest and Bayesian n...