Cluster-Based Best Match Scanning for Large-Scale Missing Data Imputation

Yu, Woong‐Ryeol; Zhu, Wendong; Kan, Bowen; Zhao, Ting; Liu, He

doi:10.1109/bigcom.2017.48

Cited by 4 publications

(5 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Yu et al [3] proposed a modification of the K-NN imputation algorithm, known as Cluster-Based Best Match Scanning (CBMS) in terms of improved computational complexity and improved space/memory usage with comparable level of accuracy to K-NN . Simulation was carried upon a large smart meter reading dataset and imputation testing accuracy was measured using the mean absolute deviation method.…”

Section: A Completeness Dqdmentioning

confidence: 99%

“…Wu and Zhu [16] proposed two main methods to deal with the problem of noisy data: 1) applying data cleansing methods to eliminate data quality issues as far as possible, and 2) make data mining applications more robust so that they can tolerate the presence of noisy data. The first method presents some drawbacks, such as: (1) data cleansing algorithms deal with only certain types of errors,(2) data cleansing cannot result into perfect data, (3) data cleansing cannot always be applied to all data sources, (4) eliminating noisy data may lead to crucial data loss for further mining/analytics and (5) the data mining/analytics algorithm cannot consider the original data source context after data cleansing has been applied. However, making data mining applications more tolerant towards the presence of noisy data is based upon a very important assumption, that there is sufficient knowledge of the type of errors that are present as part of a dataset before the actual analytics is applied.…”

Section: B Accuracy Dqdmentioning

confidence: 99%

“…However, this algorithm was found to be inadequate for Big Data as it cannot cater for the volume of data. As regression algorithms were commonly cited in existing research studies [3][4][5][6], linear regression was implemented as it is described as quite close to isotonic regression algorithms [4]. The linear model class from sklearn library in python 2.7 was implemented with the data split as described in the pseudocode for imputation of missing values above.…”

Section: ) Bayesian Isotonic Regressionmentioning

confidence: 99%

“…This algorithm was again not properly explained as part of the original research study [3], but the logic understood from the algorithm implies the need to have regression of missing values performed upon clusters of data. Hence, k-means algorithm was applied for clustering and KNN for regression.…”

Section: ) Cluster-based Best Match Scanningmentioning

confidence: 99%

“…For the data 'completeness' Data Quality Dimension (DQD), detection of missing values in a dataset is very simple to carry out using data science tools such as RapidMiner Studio. Conversely, ML algorithms can be applied to solve 'completeness' issues through imputation techniques [3], [4]. As for the 'accuracy' DQD, some ML algorithms have been proposed to deal with different accuracy issues such as noise and outlier detection [15], [17].…”

mentioning

confidence: 99%

See 4 more Smart Citations