Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study

Huang, Jianglin; Keung, Jacky; Sarro, Federica; Li, Yan‐Fu; Yu, Yuen-Tak; Chan, W. K.; Sun, Hongyi

doi:10.1016/j.jss.2017.07.012

Cited by 74 publications

(40 citation statements)

References 65 publications

(95 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Missing Data Ignoring can be recommended in the case of MCAR found in a dataset or with a low level of missing data [17,24] b) Missing Data Toleration: The strategy of this technique is based on the internal treatment where missing data in the dataset is tolerated and analysis is directly performed on the dataset. One such kind of toleration approach is to assign a NULL value to replace the missing piece of data [17,18,26]. c) Missing Data Imputation: There are various strategies employed for missing data imputation, in which the missing values found in the dataset are filled, which lets the complete dataset being analyzed.…”

Section: ) Mechanisms Of Missingmentioning

confidence: 99%

“…Idri, et al [18] conducted a study to evaluate the impact of different missing data techniques on ABE using KNN. Huang, et al [17] performed an empirical study on crossvalidation of KNN imputation for software quality dataset, though the study compared KNN imputation and Mean imputation, it was specifically on software quality dataset, they did not focus on estimation or ABE. The related studies indicate the importance of imputing the missing data in past projects, especially for ABE.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

MINN: A Missing Data Imputation Technique for Analogy-based Effort Estimation

Shah¹,

Dayang²,

Adham³

et al. 2019

ijacsa

View full text Add to dashboard Cite

Success and failure of a complex software project are strongly associated with the accurate estimation of development effort. There are numerous estimation models developed but the most widely used among those is Analogy-Based Estimation (ABE). ABE model follows human nature as it estimates the future project's effort by making analogies with the past project's data. Since ABE relies on the historical datasets, the quality of the datasets affects the accuracy of estimation. Most of the software engineering datasets have missing values. The researchers either delete the projects containing missing values or avoid treating the missing values which reduce the ABE performance. In this study, Numeric Cleansing (NC), K-Nearest Neighbor Imputation (KNNI) and Median Imputation of the Nearest Neighbor (MINN) methods are used to impute the missing values in Desharnais and DesMiss datasets for ABE. MINN technique is introduced in this study. A comparison among these imputation methods is performed to identify the suitable missing data imputation method for ABE. The results suggested that MINN imputes more realistic values in the missing datasets as compared to values imputed through NC and KNNI. It was also found that the imputation treatment method helped in better prediction of the software development effort on ABE model.

show abstract

Section: ) Mechanisms Of Missingmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

MINN: A Missing Data Imputation Technique for Analogy-based Effort Estimation

Shah¹,

Dayang²,

Adham³

et al. 2019

ijacsa

View full text Add to dashboard Cite

show abstract

“…In this study, the optimal choice of K was determined by 10-fold cross-validation [37]. Optimal K based on research from [35], [38] was used in this study.…”

Section: The Multiple Face Recognition Algorithmmentioning

confidence: 99%

Efficient K-Nearest Neighbor Searches for Multiple-Face Recognition in the Classroom based on Three Levels DWT-PCA

Santoso¹,

Harjoko²,

Putra³

2017

ijacsa

View full text Add to dashboard Cite

Abstract-The main weakness of the k-Nearest Neighbor algorithm in face recognition is calculating the distance and sort all training data on each prediction which can be slow if there are a large number of training instances. This problem can be solved by utilizing the priority k-d tree search to speed up the process of k-NN classification. This paper proposes a method for student attendance systems in the classroom using facial recognition techniques by combining three levels of Discrete Wavelet Transforms (DWT) and Principal Component Analysis (PCA) to extract facial features followed by applying the priority of k-d tree search to speed up the process of facial classification using kNearest Neighbor. The proposed algorithm is tested on two datasets that are Honda/UCSD video dataset and our dataset (AtmafaceDB dataset). This research looks for the best value of k to get the right facial recognition using k-fold cross-validation. 10-fold cross-validation at level 3 DWT-PCA shows that face recognition using k-Nearest Neighbor on our dataset is 95.56% with k = 5, whereas in the Honda / UCSD dataset it is only 82% with k = 3. The proposed method gives computational recognition time on our dataset 40 milliseconds.

show abstract

“…The second solution is based on missing value imputation. It can provide estimations for missing values by reasoning from the observed data (i.e., complete data) [13, 14, 20]. …”

Section: Introductionmentioning

confidence: 99%

“…The experimental results have shown that missing value imputation is a better choice than case deletion when the incomplete datasets contain a certain amount of missing values. Model-based missing value imputation algorithms based on machine learning techniques, such as k -nearest neighbor, multilayer perceptron neural networks, and support vector machines, have recently lately been widely considered [14, 16, 21]. …”

Section: Introductionmentioning

confidence: 99%

Outlier Removal in Model-Based Missing Value Imputation for Medical Datasets

Huang,

Lin,

Tsai

2018

Journal of Healthcare Engineering

View full text Add to dashboard Cite

Many real-world medical datasets contain some proportion of missing (attribute) values. In general, missing value imputation can be performed to solve this problem, which is to provide estimations for the missing values by a reasoning process based on the (complete) observed data. However, if the observed data contain some noisy information or outliers, the estimations of the missing values may not be reliable or may even be quite different from the real values. The aim of this paper is to examine whether a combination of instance selection from the observed data and missing value imputation offers better performance than performing missing value imputation alone. In particular, three instance selection algorithms, DROP3, GA, and IB3, and three imputation algorithms, KNNI, MLP, and SVM, are used in order to find out the best combination. The experimental results show that that performing instance selection can have a positive impact on missing value imputation over the numerical data type of medical datasets, and specific combinations of instance selection and imputation methods can improve the imputation results over the mixed data type of medical datasets. However, instance selection does not have a definitely positive impact on the imputation result for categorical medical datasets.

show abstract

Cross-validation based K nearest neighbor imputation for software quality datasets: An empirical study

Cited by 74 publications

References 65 publications

MINN: A Missing Data Imputation Technique for Analogy-based Effort Estimation

MINN: A Missing Data Imputation Technique for Analogy-based Effort Estimation

Efficient K-Nearest Neighbor Searches for Multiple-Face Recognition in the Classroom based on Three Levels DWT-PCA

Outlier Removal in Model-Based Missing Value Imputation for Medical Datasets

Contact Info

Product

Resources

About