Identifying and categorizing spurious weight data in electronic medical records

Chen, Sunny; Banks, William A.; Sheffrin, Meera; Bryson, William C.; Black, Marissa; Thielke, Stephen

doi:10.1093/ajcn/nqx056

Cited by 18 publications

(23 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…SITAR (Superimposition by Translation And Rotation) [28] and the 'Outliergram' [29] are visualisation methods that allow individual trajectories to be viewed but are specific to each dataset they are applied to and require subjective judgements to be made, which can be time consuming when applied to large datasets. Algorithms that examine the change between two measurements are simple to apply in comparison with many longitudinal methods but are limited by poor specificity and are not cable of identifying consecutive errors [30]. Daymont and colleagues designed an automated data cleaning technique based on exponentially weighted moving average standard deviation scores combined with a decision-making algorithm to identify implausible growth data.…”

Section: Introductionmentioning

confidence: 99%

Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data

et al. 2020

View full text Add to dashboard Cite

All data are prone to error and require data cleaning prior to analysis. An important example is longitudinal growth data, for which there are no universally agreed standard methods for identifying and removing implausible values and many existing methods have limitations that restrict their usage across different domains. A decision-making algorithm that modified or deleted growth measurements based on a combination of pre-defined cutoffs and logic rules was designed. Five data cleaning methods for growth were tested with and without the addition of the algorithm and applied to five different longitudinal growth datasets: four uncleaned canine weight or height datasets and one pre-cleaned human weight dataset with randomly simulated errors. Prior to the addition of the algorithm, data cleaning based on non-linear mixed effects models was the most effective in all datasets and had on average a minimum of 26.00% higher sensitivity and 0.12% higher specificity than other methods. Data cleaning methods using the algorithm had improved data preservation and were capable of correcting simulated errors according to the gold standard; returning a value to its original state prior to error simulation. The algorithm improved the performance of all data cleaning methods and increased the average sensitivity and specificity of the non-linear mixed effects model method by 7.68% and 0.42% respectively. Using non-linear mixed effects models combined with the algorithm to clean data allows individual growth trajectories to vary from the population by using repeated longitudinal measurements, identifies consecutive errors or those within the first data entry, avoids the requirement for a minimum number of data entries, preserves data where possible by correcting errors rather than deleting them and removes duplications intelligently. This algorithm is broadly applicable to data cleaning anthropometric data in different mammalian species and could be adapted for use in a range of other domains.

show abstract

Section: Introductionmentioning

confidence: 99%

Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data

et al. 2020

View full text Add to dashboard Cite

show abstract

“…Approaches include the use of change scores (13, 18) and % change (3), residuals from OLS regression (28, 32), multilevel models (MLM) (23, 35) and non-parametric smoothing routines (9), conditional growth scores (36) and ratios of Euclidean distances between a set of 3 measures (3). Some studies report using manual verification of growth histories (9, 26) (15) (3) (17), which is considered the gold standard approach. Certain methods require more data points which may be a limitation, and some methods perform poorly when the error load (magnitude and frequency of errors) is high (9, 35).…”

Section: Methodsmentioning

confidence: 99%

“…Lawman et al's., 2016 (8) review of approaches to identify height and weight BIVs was used as a starting to identify the different approaches that have been used for error detection. The 12 studies reported in their Table 1 (11-22) were screened along with all subsequent citations of Lawman et al (8) up to Oct 2020 (n=12) (3,9,(23)(24)(25)(26)(27)(28)(29)(30)(31)(32). The reference lists of these citing papers were also screened to identify methodological papers describing cleaning algorithms or approaches (n=5) (7, [33][34][35][36].…”

Section: Scoping Review Of Approaches For Identifying Errorsmentioning

confidence: 99%

“…Data entry error rates have been shown to vary from 0.05 to 9% (2) and since growth or change is characterised by repeated measures, as well as sex and age (often derived from two dates), the error rate per individual will be much higher. One study found at least one error in the clinical weight measurements of 20% of patients (3). A single error will also contaminate measures of growth and any derived variables such as BMI.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Screening & diagnosing errors in longitudinal measures of body size

Wills

2020

Preprint

View full text Add to dashboard Cite

This paper presents a novel multi-step automated algorithm to screen for errors in longitudinal height and weight data and describes the frequency and characteristics of errors in three datasets. It also offers a taxonomy of published cleaning routines from a scoping review.Illustrative data are from three Norwegian retrospective cohorts containing 87,792 assessments (birth to 14y) from 8,428 children. Each has different data pipelines, quality control and data structure. The algorithm contains 43 steps split into 3 sections; (a) dates, (b) Identifiable data entry errors, (c) biologically impossible/ implausible change, and uses logic checks, and cross-sectional and longitudinal routines. The WHO cross-sectional approach was also applied as a comparison.Published cleaning routines were taxonomized by their design, the marker used to screen errors, the reference threshold and how threshold was selected. Fully automated error detection was not possible without false positives or reduced sensitivity. Error frequencies in the cohorts were 0.4%, 2.1% and 2.4% of all assessments, and the percentage of children with ≥1 error was 4.1%, 13.4% and 15.3%. In two of the datasets, >2/3s of errors could be classified as inliers (within ±3SD scores). Children with errors had a similar distribution of HT and WT to those without error. The WHO cross-sectional approach lacked sensitivity (range 0-55%), flagged many false positives (range: 7-100%) and biased estimates of overweight and thinness.Elements of this algorithm may have utility for built-in data entry rules, data harmonisation and sensitivity analyses. The reported error frequencies and structure may also help design more realistic simulation studies to test routines. Multi-step distribution-wide algorithmic approaches are recommended to systematically screen and document the wide range of ways in which errors can occur and to maximise sensitivity for detecting errors, naive cross-sectional trimming as a stand-alone method may do more harm than good.

show abstract

“…When looking into the biomedical literature, we were surprised to find quite scarce reporting about clinical data cleansing (Beaulieu-Jones et al 2018;Chen et al 2018;Coiera et al 2016;Ehrenstein et al 2017). It appears to us that not all in the community are aware of the scale of the problems.…”

Section: Usefulness Of Patient Datasets For Biomedical Research: Whatmentioning

confidence: 99%

Hypocrisy Around Medical Patient Data: Issues of Access for Biomedical Research, Data Quality, Usefulness for the Purpose and Omics Data as Game Changer

Tantoso

Wong

Tay

et al. 2019

ABR

View full text Add to dashboard Cite

Whether due to simplicity or hypocrisy, the question of access to patient data for biomedical research is widely seen in the public discourse only from the angle of patient privacy. At the same time, the desire to live and to live without disability is of much higher value to the patients. This goal can only be achieved by extracting research insight from patient data in addition to working on model organisms, something that is well understood by many patients. Yet, most biomedical researchers working outside of clinics and hospitals are denied access to patient records when, at the same time, clinicians who guard the patient data are not optimally prepared for the data's analysis. Medical data collection is a time-and cost-intensive process that is most of all tedious, with few elements of intellectual and emotional satisfaction on its own. In this process, clinicians and bioinformaticians, each group with their own interests, have to join forces with the goal to generate medical data sets both from clinical trials and from routinely collected electronic health records that are, as much as possible, free from errors and obvious inconsistencies. The data cleansing effort as we have learned during curation of Singaporean clinical trial data is not a trivial task. The introduction of omics and sophisticated imaging modalities into clinical practice that are only partially interpreted in terms of diagnosis and therapy with today's level of knowledge warrant the creation of clinical databases with full patient history. This opens up opportunities for re-analyses and cross-trial studies at future time points with more sophisticated analyses of the same data, the collection of which is very expensive.

show abstract

Identifying and categorizing spurious weight data in electronic medical records

Abstract: Spurious weights are common in EMRs. Straightforward algorithms can identify and remove them, and thus enhance the reliability of EMR data.

Cited by 18 publications

References 8 publications

Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data

Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data

Screening & diagnosing errors in longitudinal measures of body size

Hypocrisy Around Medical Patient Data: Issues of Access for Biomedical Research, Data Quality, Usefulness for the Purpose and Omics Data as Game Changer

Contact Info

Product

Resources

About