Missing Data Imputation for Geolocation-based Price Prediction Using KNN–MCF Method

Sanjar, Karshiev; Olimov, Bekhzod; Kim, Jae-Soo; Paul, Anand; Kim, Jeonghong

doi:10.3390/ijgi9040227

Cited by 31 publications

(17 citation statements)

References 23 publications

(30 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The technique can aid the description, generalization and categorization of a given set of data by breaking the dataset into smaller subsets while incrementally developing an associated decision tree with decision nodes and leaf nodes. We used Grid search to get the best set of hyperparameters for the model, we tested different values for the min sample split s= [5,10,15,20] and s=10 was found to be the best for the model with max depth of 3. A 10-fold cross validation was used to estimate the performance of the model.…”

Section: Decision Tree (Dt)mentioning

confidence: 99%

“…LCS, however, are prone to various failures including bias, drifts, precision degradation, and loss of considerable amount of data due to operational issues [2]. Missing data is a pervasive issue which occur in most real-world datasets including medical records [3,4], geo-informatics [5], traffic flow [6] and industrial applications [7,8]. The European Union Data Quality Directive (EU-DQD) [9] defined the data quality objective (DQO) that a monitoring method needs to comply with to be used as indicative measurement for regulative purposes.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Missing Data Imputation on Iot Sensor Networks: Implications for on-Site Sensor Calibration

Okafor¹

2021

Preprint

View full text Add to dashboard Cite

IoT sensors are gaining more popularity in the environmental monitoring space due to their relatively small size, cost of acquisition and ease of installation and operation. They are becoming increasingly important<br>supplement to traditional monitoring systems, particularly for in-situ based monitoring. However, data collection based on IoT sensors are often plagued with missing values usually occurring as a result of sensor faults, network failures, drifts and other operational issues. Several imputation strategies have been proposed for handling missing values in various application domains. This paper examines the performance of different imputation techniques including Multiple Imputation by Chain Equations (MICE), Random forest based imputation (missForest) and K-Nearest Neighbour (KNN) for handling missing values on sensor networks deployed for the quantification of Green House Gases(GHGs). Two tasks were conducted: first, Ozone (O3) and NO2/O3 concentration data collected using Aeroqual and Cairclip sensors respectively over a six months data collection period were corrupted by removing data intervals at different missing periods (p) where p 2 f1day; 1week; 2weeks; 1monthg and also at random points on the dataset at varying proportion (r) where r 2 f5%; 10%; 30%; 50%; 70%g. The missing data were then filled using the different imputation strategies and their imputation accuracy calculated. Second, the performance of sensor calibration by different regression models including Multi Linear Regression (MLR), Decision Tree (DT), Random Forest (RF) and XGBoost (XGB) trained on the different imputed datasets were evaluated. The analysis showed the MICE technique to outperform the others in imputing the missing values on both the O3 and NO2/O3 datasets when missingness was introduced over periods p. MissForest, however, outperformed the rest when missingness was introduced as randomly occuring point errors. While the analysis demonstrated the effects of missing and imputed data on sensor calibration, experimental results showed that a simple model on the imputed dataset can achieve state of-the-art result on in-situ sensor calibration, improving the data quality of the sensor.

show abstract

Section: Decision Tree (Dt)mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Missing Data Imputation on Iot Sensor Networks: Implications for on-Site Sensor Calibration

Okafor¹

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…It is most useful when there are only a few hyperparameters to optimize but would usually be outperformed by other weighted-random search algorithms when the model grows in complexity. We tested different values for the min sample split s= [5,10,15,20] and s=10 was found to be the best for the model with max depth of 3. A 10-fold cross validation was used to estimate the performance of the model.…”

Section: B Decision Tree (Dt)mentioning

confidence: 99%

“…LCS, however, are prone to diverse issues including bias, drifts, precision degradation, and loss of considerable amount of data due to operational issues [2]. Missing data is a pervasive issue, affecting most real-world datasets including medical records [3], [4], geo-informatics [5], traffic flow [6] and industrial applications [7], [8]. The European Union Data Quality Directive (EU-DQD) de-fined the data quality objective (DQO) that a monitoring method needs to comply with to be used as indicative measurement for regulative purposes [9].…”

Section: Introductionmentioning

confidence: 99%

Missing Data Imputation on IoT Sensor Networks: Implications for on-site Sensor Calibration

Okafor¹,

Delaney²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Later, the methods of mathematical statistics were introduced to simplify the behavior prediction into a two-category problem [ 16 , 17 ]. KNN is one of the simplest classification methods, which is widely used in vehicle sales forecast [ 18 ], health monitoring [ 19 ], housing price forecast [ 20 ], and other fields. Similar to KNN, SVM is also a popular classification method, which is based on the structural risk minimization (SRM) principle of statistical learning theory and has excellent generalization performance [ 21 – 23 ].…”

Section: Introductionmentioning

confidence: 99%

An improved deep forest model for prediction of e-commerce consumers’ repurchase behavior

Zhang

Wang

2021

PLoS ONE

View full text Add to dashboard Cite

As the Internet retail industry continues to rise, more and more consumers choose to shop online, especially Chinese consumers. Using consumer behavior data left on the Internet to predict repurchase behavior is of great significance for companies to achieve precision marketing. This paper proposes an improved deep forest model, and the interactive behavior characteristics of users and goods are added into the original feature model to predict the repurchase behavior of e-commerce consumers. Based on the Alibaba mobile e-commerce platform data set, first construct a feature engineering that includes user characteristics, product characteristics, and interactive behavior characteristics. And then use our proposed model to make predictions. Experiments show that the model’s overall performance with increased interactive behavior features is better and has higher accuracy. Compared with the existing prediction models, the improved deep forest model has certain advantages, which not only improves the prediction accuracy but also reduces the cost of training time.

show abstract

Missing Data Imputation for Geolocation-based Price Prediction Using KNN–MCF Method

Cited by 31 publications

References 23 publications

Missing Data Imputation on Iot Sensor Networks: Implications for on-Site Sensor Calibration

Missing Data Imputation on Iot Sensor Networks: Implications for on-Site Sensor Calibration

Missing Data Imputation on IoT Sensor Networks: Implications for on-site Sensor Calibration

An improved deep forest model for prediction of e-commerce consumers’ repurchase behavior

Contact Info

Product

Resources

About