An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge

Shi, Xi; Prins, Charlotte; Pottelbergh, Gijs Van; Mamouris, Pavlos; Vaes, Bert; Moor, Bart De

doi:10.1186/s12911-021-01630-7

Cited by 16 publications

(12 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…8 Algorithms and programs have also been designed that not only detect pre-existing errors in the process of data cleaning, but also remove and correct diagnosed errors. 14,18 Semi-automatic procedures may complement automatic procedures in data gathering that can further improve the quality of the extraction process with data cleaning. 10…”

Section: Discussionmentioning

confidence: 99%

The Data Error Criteria (DEC) for retrospective studies: development and preliminary application

Buczek

Azar

Bauzon

et al. 2023

Journal of Investigative Medicine

View full text Add to dashboard Cite

Retrospective chart review (RCR) studies rely on the collection and analysis of documented clinical data, a process that can be prone to errors. The aim of this study was to develop a defined set of criteria to evaluate RCR datasets for potential data errors. The Data Error Criteria (DEC) were developed by identifying data coding and data entry errors via literature review and then classifying them based on error types. Three components comprise the DEC: general errors, numerical-specific errors, and categorical variable-specific errors. Two reviewers independently applied these criteria via a manual review process to an existing de-identified database. A total of 10,168 errors were identified out of a total of 28,656 data points. The total number of errors included redundancies as certain errors may be included in multiple categories. These included 2515 general errors, 39 numerical-specific errors, and 7614 categorical variable-specific errors. Input-related categorical variable-specific errors occurred most frequently, followed by errors secondary to blank cells. Inter-rater agreement was near perfect for all categories. Identifying errors outlined in the DEC can be crucial for the data analysis stage as they can lead to inaccurate calculations and delay study timelines. The DEC offers a framework to evaluate datasets while reducing time and efforts needed to create high-quality RCR-related databases.

show abstract

Section: Discussionmentioning

confidence: 99%

The Data Error Criteria (DEC) for retrospective studies: development and preliminary application

Buczek

Azar

Bauzon

et al. 2023

Journal of Investigative Medicine

View full text Add to dashboard Cite

show abstract

“…In general, MIMIC and eICU-CRD may be excellent benchmark databases, but we found that "real-world" data exported directly from a hospital's IT infrastructure pose many challenges that are not present in these databases. [26] presented a medical data cleaning pipeline that explicitly addresses some of the issues that we also encountered in our research. They considered laboratory tests and similar measurements and proposed manually curated validation rules for numerical variables and an automatic strategy for harmonizing (misspelled) units of measurement through fuzzy search and variable-dependent conversion rules.…”

Section: Xsl • Fomentioning

confidence: 99%

“…They considered laboratory tests and similar measurements and proposed manually curated validation rules for numerical variables and an automatic strategy for harmonizing (misspelled) units of measurement through fuzzy search and variable-dependent conversion rules. The focus of Shi et al [26] is on improving the quality of data [27][28][29], whereas Wang et al [15], Tang et al [16], and Mandyam et al [17] are mainly concerned with transforming data into a form suitable for ML. A more detailed evaluation of FIDDLE, MIMIC-Extract, and cleaning and organization pipeline for EHR computational and analytical tasks and the approach to our data by Shi et al [26] can be found in Multimedia Appendix 1 [15][16][17]26].…”

Section: Xsl • Fomentioning

confidence: 99%

“…The focus of Shi et al [26] is on improving the quality of data [27][28][29], whereas Wang et al [15], Tang et al [16], and Mandyam et al [17] are mainly concerned with transforming data into a form suitable for ML. A more detailed evaluation of FIDDLE, MIMIC-Extract, and cleaning and organization pipeline for EHR computational and analytical tasks and the approach to our data by Shi et al [26] can be found in Multimedia Appendix 1 [15][16][17]26].…”

Section: Xsl • Fomentioning

confidence: 99%

“…Shi et al [ 26 ] presented a medical data cleaning pipeline that explicitly addresses some of the issues that we also encountered in our research. They considered laboratory tests and similar measurements and proposed manually curated validation rules for numerical variables and an automatic strategy for harmonizing (misspelled) units of measurement through fuzzy search and variable-dependent conversion rules.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Lifting Hospital Electronic Health Record Data Treasures: Challenges and Opportunities

et al. 2022

View full text Add to dashboard Cite

Electronic health records (EHRs) have been successfully used in data science and machine learning projects. However, most of these data are collected for clinical use rather than for retrospective analysis. This means that researchers typically face many different issues when attempting to access and prepare the data for secondary use. We aimed to investigate how raw EHRs can be accessed and prepared in retrospective data science projects in a disciplined, effective, and efficient way. We report our experience and findings from a large-scale data science project analyzing routinely acquired retrospective data from the Kepler University Hospital in Linz, Austria. The project involved data collection from more than 150,000 patients over a period of 10 years. It included diverse data modalities, such as static demographic data, irregularly acquired laboratory test results, regularly sampled vital signs, and high-frequency physiological waveform signals. Raw medical data can be corrupted in many unexpected ways that demand thorough manual inspection and highly individualized data cleaning solutions. We present a general data preparation workflow, which was shaped in the course of our project and consists of the following 7 steps: obtain a rough overview of the available EHR data, define clinically meaningful labels for supervised learning, extract relevant data from the hospital’s data warehouses, match data extracted from different sources, deidentify them, detect errors and inconsistencies therein through a careful exploratory analysis, and implement a suitable data processing pipeline in actual code. Only few of the data preparation issues encountered in our project were addressed by generic medical data preprocessing tools that have been proposed recently. Instead, highly individualized solutions for the specific data used in one’s own research seem inevitable. We believe that the proposed workflow can serve as a guidance for practitioners, helping them to identify and address potential problems early and avoid some common pitfalls.

show abstract

Development and validation of the SickKids Enterprise-wide Data in Azure Repository (SEDAR)

Guo,

Calligan,

Vettese

et al. 2023

Heliyon

View full text Add to dashboard Cite

An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge

Cited by 16 publications

References 20 publications

The Data Error Criteria (DEC) for retrospective studies: development and preliminary application

The Data Error Criteria (DEC) for retrospective studies: development and preliminary application

Lifting Hospital Electronic Health Record Data Treasures: Challenges and Opportunities

Development and validation of the SickKids Enterprise-wide Data in Azure Repository (SEDAR)

Contact Info

Product

Resources

About