Advanced methods for missing values imputation based on similarity learning

Fouad, Khaled M.; Ismail, Mahmoud M.; Azar, Ahmad Taher; Arafa, Mona M.

doi:10.7717/peerj-cs.619

Cited by 21 publications

(5 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The easiest way to deal with the missing data is to remove the corresponding entries completely [47], but that would lead to loss of crucial information. Another method is to impute the missing values with the mean value of the available data [48], however that would not preserve the relationships between inputs and outputs.…”

Section: Techniques To Handle Erroneous and Missing Datamentioning

confidence: 99%

Unveil the unseen: Exploit information hidden in noise

Zviazhynski

Conduit

2022

Appl Intell

View full text Add to dashboard Cite

Noise and uncertainty are usually the enemy of machine learning, noise in training data leads to uncertainty and inaccuracy in the predictions. However, we develop a machine learning architecture that extracts crucial information out of the noise itself to improve the predictions. The phenomenology computes and then utilizes uncertainty in one target variable to predict a second target variable. We apply this formalism to PbZr0.7Sn0.3O3 crystal, using the uncertainty in dielectric constant to extrapolate heat capacity, correctly predicting a phase transition that otherwise cannot be extrapolated. For the second example – single-particle diffraction of droplets – we utilize the particle count together with its uncertainty to extrapolate the ground truth diffraction amplitude, delivering better predictions than when we utilize only the particle count. Our generic formalism enables the exploitation of uncertainty in machine learning, which has a broad range of applications in the physical sciences and beyond.

show abstract

Section: Techniques To Handle Erroneous and Missing Datamentioning

confidence: 99%

Unveil the unseen: Exploit information hidden in noise

Zviazhynski

Conduit

2022

Appl Intell

View full text Add to dashboard Cite

show abstract

“…In this study, an iterative sequential imputation process was executed via regression with the lightgbm algorithm. Other techniques have been proposed such as a hybrid missing data imputation method incorporating records similarity using the global correlation structure by using k-nearest neighbors and iterative imputation algorithms 86 or by merits integration of decision trees and fuzzy clustering into an iterative learning approach. 87 A quantile-based discretization function was performed in this study to discretize features into bins.…”

Section: Discussionmentioning

confidence: 99%

Data-Driven Quantitative Intrinsic Hazard Criteria for Nanoproduct Development in a Safe-by-Design Paradigm: A Case Study of Silver Nanoforms

Furxhi

Bengalli

Motta

et al. 2023

ACS Appl. Nano Mater.

View full text Add to dashboard Cite

The current European (EU) policies, that is, the Green Deal, envisage safe and sustainable practices for chemicals, which include nanoforms (NFs), at the earliest stages of innovation. A theoretically safe and sustainable by design (SSbD) framework has been established from EU collaborative efforts toward the definition of quantitative criteria in each SSbD dimension, namely, the human and environmental safety dimension and the environmental, social, and economic sustainability dimensions. In this study, we target the safety dimension, and we demonstrate the journey toward quantitative intrinsic hazard criteria derived from findable, accessible, interoperable, and reusable data. Data were curated and merged for the development of new approach methodologies, that is, quantitative structure−activity relationship models based on regression and classification machine learning algorithms, with the intent to predict a hazard class. The models utilize system (i.e., hydrodynamic size and polydispersity index) and non-system (i.e., elemental composition and core size)-dependent nanoscale features in combination with biological in vitro attributes and experimental conditions for various silver NFs, functional antimicrobial textiles, and cosmetics applications. In a second step, interpretable rules (criteria) followed by a certainty factor were obtained by exploiting a Bayesian network structure crafted by expert reasoning. The probabilistic model shows a predictive capability of ≈78% (average accuracy across all hazard classes). In this work, we show how we shifted from the conceptualization of the SSbD framework toward the realistic implementation with pragmatic instances. This study reveals (i) quantitative intrinsic hazard criteria to be considered in the safety aspects during synthesis stage, (ii) the challenges within, and (iii) the future directions for the generation and distillation of such criteria that can feed SSbD paradigms. Specifically, the criteria can guide material engineers to synthesize NFs that are inherently safer from alternative nanoformulations, at the earliest stages of innovation, while the models enable a fast and cost-efficient in silico toxicological screening of previously synthesized and hypothetical scenarios of yet-to-be synthesized NFs.

show abstract

“…Khaled M. Fouad. et al [14] proposed a method incorporating KNN and iterative imputation algorithms which can impute missing data depended on the similarity between records.…”

Section: Discussionmentioning

confidence: 99%

A novel method for handling missing data in health care real-world study: Optimal Intact Subset Method

Chang

Tong

et al. 2022

Preprint

View full text Add to dashboard Cite

Background Handling missing data is indispensable in health care real-world data processing. Deleting or imputing missing data may introduce error or lead to multicollinearity. Therefore, we tried to explore a novel missing data processing method to avoid the above issues. Method By exploring an optimal deleting way of columns and rows with missing data, we developed a missing data processing method which can retain most information of original datasets. Traditionally, the goal can be realized by traversing all possible deleting combinations. But the computational cost is too high to use in large datasets. Therefore, we established an Optimal Intact Subset Method (OIS.Method) by using an indicator containing missing information of both columns and rows to determine an optimal deleting order of columns. OIS.Method can ascertain the optimal deleting way and simplify computing meanwhile. In order to validate the effectiveness of OIS.Method, we compared OIS.Method with five other data-imputation methods in 700 classification datasets (simulated datasets 1) generated by computer. In order to simulate real-world datasets, we generated simulated datasets 2: introducing redundant variables in simulated datasets 1. We also compared OIS.Method with control methods on that. Finally, we validated OIS.Method in two real-world classification tasks: 1. predict the risk of hypotension during dialysis, 2. predict the risk of drug adverse reaction in elderly patients with type 2 diabetes. Results In simulated datasets 1, we found that OIS.Method performed well when the distribution of missing data was unbalanced among columns. In simulated datasets 2, the comprehensive performance of OIS.Method was better in all evaluating dimensions. In two real-world datasets, OIS.Method could acquire better classification performance. We used the area under ROC curve (AUC) to evaluate it: OIS.Method VS Simple Impute VS Random Forest VS Modified Random Forest, 0.8179 ± 0.0005VS0.8116 ± 0.0002VS0.8087 ± 0.0009VS0.8093 ± 0.0014 in task1, and 0.7028VS0.6963VS0.6957VS0.6699 in task2. Conclusions Our study provided a novel method for handing missing data in real-world study. Compared with other existing missing data processing methods, the calculation of OIS.Method is smaller, and OIS.Method can reflect the true data situation of original datasets. Moreover, OIS.Method is well-suited for real-world datasets with large sample size and multiple variables.

show abstract

Advanced methods for missing values imputation based on similarity learning

Cited by 21 publications

References 41 publications

Unveil the unseen: Exploit information hidden in noise

Unveil the unseen: Exploit information hidden in noise

Data-Driven Quantitative Intrinsic Hazard Criteria for Nanoproduct Development in a Safe-by-Design Paradigm: A Case Study of Silver Nanoforms

A novel method for handling missing data in health care real-world study: Optimal Intact Subset Method

Contact Info

Product

Resources

About