An empirical analysis of data preprocessing for machine learning-based software cost estimation

Huang, Jianglin; Li, Yan‐Fu; Xie, Min

doi:10.1016/j.infsof.2015.07.004

Cited by 162 publications

(124 citation statements)

References 98 publications

Supporting

Mentioning

112

Contrasting

Unclassified

Order By: Relevance

“…Less attention has been focused on MDT methods themselves. In a more recent study, Huang et al (2015) found that only some of the former software effort estimation studies have considered the significance of the MDTs, of which only Minku and Yao (2011) (Myrtveit et al, 2001;Strike et al, 2001), and the prediction error may be introduced (Mittas and Angelis, 2010). MEI is efficient and has been involved in SEE as the most popular imputation approach; however, it will cause bias to data.…”

Section: Knn Imputation Improvementmentioning

confidence: 99%

“…For example, a well-known technique called listwise deletion, had been widely adopted for handling missing values during data-preprocessing, but it potentially impairs the completeness of data and introduces undesirable biases in estimation (Huang et al, 2015). By contrast, missing data imputation methods replace missing variables by artificial estimates (Song et al, 2008); at the same time maintain the data completeness.…”

Section: Introductionmentioning

confidence: 99%

“…Huang et al (2015) has found that MEI monopolizes the imputation approaches in recent software effort estimation studies. A review of the other MDTs in SEE studies is presented at last.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study

Huang

Keung

Sarro

et al. 2017

Journal of Systems and Software

Self Cite

View full text Add to dashboard Cite

Being able to predict software quality is essential, but also it pose significant challenges in software engineering. Historical software project datasets are often being utilized together with various machine learning algorithms for fault-proneness classification.Unfortunately, the missing values in datasets have negative impacts on the estimation accuracy and therefore, could lead to inconsistent results. As a method handling missing data, K nearest neighbor (KNN) imputation gradually gains acceptance in empirical studies by its exemplary performance and simplicity. To date, researchers still call for optimized parameter setting for KNN imputation to further improve its performance. In the work, we develop a novel incomplete-instance based KNN imputation technique, which utilizes a cross-validation scheme to optimize the parameters for each missing value. An experimental assessment is conducted on eight quality datasets under various missingness scenarios. The study also compared the proposed imputation approach with mean imputation and other three KNN imputation approaches. The results show that our proposed approach is superior to others in general. The relatively optimal fixed parameter settings for KNN imputation for software quality data is also 2 determined. It is observed that the classification accuracy is improved or at least maintained by using our approach for missing data imputation.

show abstract

Section: Knn Imputation Improvementmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study

Huang

Keung

Sarro

et al. 2017

Journal of Systems and Software

Self Cite

View full text Add to dashboard Cite

show abstract

“…They concluded that regression trees or analogy-based methods are the best performers and offered means to address the conclusion instability issue. In (Huang et al 2015) several data preprocessing techniques were empirically assessed on the effectiveness of machine learning methods for effort estimation. The results indicate that data preprocessing techniques may significantly influence the predictions, but sometimes it might have negative impacts on prediction performance.…”

Section: Framework For Benchmarking Prediction Modelsmentioning

confidence: 99%

“…Software project managers need to be able to estimate the effort and cost of development early in the life cycle, as it affects the success of software project management (Huang et al 2015).…”

mentioning

confidence: 99%

A genetic algorithm based framework for software effort prediction

Murillo-Morera

Quesada-López

Castro-Herrera

et al. 2017

J Softw Eng Res Dev

View full text Add to dashboard Cite

Background: Several prediction models have been proposed in the literature using different techniques obtaining different results in different contexts. The need for accurate effort predictions for projects is one of the most critical and complex issues in the software industry. The automated selection and the combination of techniques in alternative ways could improve the overall accuracy of the prediction models. Objectives: In this study, we validate an automated genetic framework, and then conduct a sensitivity analysis across different genetic configurations. Following is the comparison of the framework with a baseline random guessing and an exhaustive framework. Lastly, we investigate the performance results of the best learning schemes. Methods: In total, six hundred learning schemes that include the combination of eight data preprocessors, five attribute selectors and fifteen modeling techniques represent our search space. The genetic framework, through the elitism technique, selects the best learning schemes automatically. The best learning scheme in this context means the combination of data preprocessing + attribute selection + learning algorithm with the highest coefficient correlation possible. The selected learning schemes are applied to eight datasets extracted from the ISBSG R12 Dataset. Results: The genetic framework performs as good as an exhaustive framework. The analysis of the standardized accuracy (SA) measure revealed that all best learning schemes selected by the genetic framework outperforms the baseline random guessing by 45-80%. The sensitivity analysis confirms the stability between different genetic configurations. Conclusions: The genetic framework is stable, performs better than a random guessing approach, and is as good as an exhaustive framework. Our results confirm previous ones in the field, simple regression techniques with transformations could perform as well as nonlinear techniques, and ensembles of learning machines techniques such as SMO, M5P or M5R could optimize effort predictions.

show abstract

The state‐of‐the‐art in software development effort estimation

Gautam

Singh

2018

J Software Evolu Process

View full text Add to dashboard Cite

The software developers and researchers have been facing difficulties regarding software development effort estimation (SDEE) since 1960s. Both overestimation and underestimation are problematic for future software development. The software engineering field is continuously adapting new technologies and development methodologies, so there is always a requirement to have an accurate SDEE method that can cater the needs of continually growing software industry. The major purpose of this state‐of‐the‐art review is to provide an additional insight of existing SDEE studies while considering five points of reference: techniques used to construct models, strengths and weaknesses of different models, availability of benchmark data sets, data set characteristics, generalization ability of models. We have performed a comprehensive review of SDEE studies published in the period 1981‐2016. We have defined a new scheme of categorizing existing SDEE models. We have found that a majority of available data sets do not include complete information of projects, which misleads the direction of research. To compare SDEE models, we recommend to use same data sets while focusing on specific aspects of accuracy as none of SDEE studies has yet been able to compare all the existing models over same data sets while considering same aspects of accuracy.

show abstract

An empirical analysis of data preprocessing for machine learning-based software cost estimation

Cited by 162 publications

References 98 publications

Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study

Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study

A genetic algorithm based framework for software effort prediction

The state‐of‐the‐art in software development effort estimation

Contact Info

Product

Resources

About