Background Informaticians at any institution that are developing clinical research support infrastructure are tasked with populating research databases with data extracted and transformed from their institution’s operational databases, such as electronic health records (EHRs). These data must be properly extracted from these source systems, transformed into a standard data structure, and then loaded into the data warehouse while maintaining the integrity of these data. We validated the correctness of the extract, load, and transform (ETL) process of the extracted data of West Virginia Clinical and Translational Science Institute’s Integrated Data Repository, a clinical data warehouse that includes data extracted from two EHR systems. Methods Four hundred ninety-eight observations were randomly selected from the integrated data repository and compared with the two source EHR systems. Results Of the 498 observations, there were 479 concordant and 19 discordant observations. The discordant observations fell into three general categories: a) design decision differences between the IDR and source EHRs, b) timing differences, and c) user interface settings. After resolving apparent discordances, our integrated data repository was found to be 100% accurate relative to its source EHR systems. Conclusion Any institution that uses a clinical data warehouse that is developed based on extraction processes from operational databases, such as EHRs, employs some form of an ETL process. As secondary use of EHR data begins to transform the research landscape, the importance of the basic validation of the extracted EHR data cannot be underestimated and should start with the validation of the extraction process itself.
Background The United States, and especially West Virginia, have a tremendous burden of coronary artery disease (CAD). Undiagnosed familial hypercholesterolemia (FH) is an important factor for CAD in the U.S. Identification of a CAD phenotype is an initial step to find families with FH. Objective We hypothesized that a CAD phenotype detection algorithm that uses discrete data elements from electronic health records (EHRs) can be validated from EHR information housed in a data repository. Methods We developed an algorithm to detect a CAD phenotype which searched through discrete data elements, such as diagnosis, problem lists, medical history, billing, and procedure (International Classification of Diseases [ICD]-9/10 and Current Procedural Terminology [CPT]) codes. The algorithm was applied to two cohorts of 500 patients, each with varying characteristics. The second (younger) cohort consisted of parents from a school child screening program. We then determined which patients had CAD by systematic, blinded review of EHRs. Following this, we revised the algorithm by refining the acceptable diagnoses and procedures. We ran the second algorithm on the same cohorts and determined the accuracy of the modification. Results CAD phenotype Algorithm I was 89.6% accurate, 94.6% sensitive, and 85.6% specific for group 1. After revising the algorithm (denoted CAD Algorithm II) and applying it to the same groups 1 and 2, sensitivity 98.2%, specificity 87.8%, and accuracy 92.4; accuracy 93% for group 2. Group 1 F1 score was 92.4%. Specific ICD-10 and CPT codes such as “coronary angiography through a vein graft” were more useful than generic terms. Conclusion We have created an algorithm, CAD Algorithm II, that detects CAD on a large scale with high accuracy and sensitivity (recall). It has proven useful among varied patient populations. Use of this algorithm can extend to monitor a registry of patients in an EHR and/or to identify a group such as those with likely FH.
Background Though electronic health record (EHR) data have been linked to national and state death registries, such linkages have rarely been validated for an entire hospital system's EHR. Objectives The aim of the study is to validate West Virginia University Medicine's (WVU Medicine) linkage of its EHR to three external death registries: the Social Security Death Masterfile (SSDMF), the national death index (NDI), the West Virginia Department of Health and Human Resources (DHHR). Methods Probabilistic matching was used to link patients to NDI and deterministic matching for the SSDMF and DHHR vital statistics records (WVDMF). In subanalysis, we used deaths recorded in Epic (n = 30,217) to further validate a subset of deaths captured by the SSDMF, NDI, and WVDMF. Results Of the deaths captured by the SSDMF, 59.8 and 68.5% were captured by NDI and WVDMF, respectively; for deaths captured by NDI this co-capture rate was 80 and 78%, respectively, for the SSDMF and WVDMF. Kappa statistics were strongest for NDI and WVDMF (61.2%) and NDI and SSDMF (60.6%) and weakest for SSDMF and WVDMF (27.9%). Of deaths recorded in Epic, 84.3, 85.5, and 84.4% were captured by SSDMF, NDI, and WVDMF, respectively. Less than 2% of patients' deaths recorded in Epic were not found in any of the death registries. Finally, approximately 0.2% of “decedents” in any death registry re-emerged in Epic at least 6 months after their death date, a very small percentage and thus further validating the linkages. Conclusion NDI had greatest validity in capturing deaths in our EHR. As a similar, though slightly less capture and agreement rate in identifying deaths is observed for SSDMF and state vital statistics records, these registries may be reasonable alternatives to NDI for research and quality assurance studies utilizing entire EHRs from large hospital systems. Investigators should also be aware that there will be a very tiny fraction of “dead” patients re-emerging in the EHR.
Introduction: Electronic Health Records (EHRs) benefit record keeping, information collation, error prevention, and charge capture. They provide a large database of clinical information that can be used for research. Sorting vast amounts of data manually is inefficient, hence, an effectual, validated method is required to uncover information from large sets of data and generate knowledge. The U.S., and especially West Virginia, has a tremendous burden of cardiovascular disease (CVD). Undiagnosed Familial Hypercholesterolemia (FH) is an important factor for CVD in the U.S. FH results in elevated levels of LDL from childhood and early atherosclerotic disease. We are interested in better screening processes for FH. One method is to detect adults with coronary artery disease (CAD) and determine if their lipid levels are indicative of FH. Relatives and children can then be screened for FH and treated. Efficient identification of a CAD phenotype from EHRs is an important initial step in this screening process. Hypothesis: We hypothesized that a CAD phenotype detection algorithm that uses discrete data elements from EHRs can be validated as a precursor to detection of FH. Methods: We developed an algorithm to detect a CAD phenotype, which searched through discrete data elements, such as diagnoses lists (ICD-10) and procedure (CPT) codes. Direct inspection of EHR discrete data avoided the need for artificial intelligence, such as natural language processing. The algorithm was applied to a cohort of 1,000 patients with varying characteristics. We then determined which patients had CAD by systematically going through EHRs. Following this, we revised the algorithm by refining the constraints under which it operated. We ran the algorithm again on the same 1,000 patients, and determined the accuracy of the modified algorithm. Results: Manual validation of the 1,000 patients resulted in 413 with CAD and 587 without. The original algorithm distinguished 488 CAD positive patients and 512 CAD negative patients. This was 89% accurate, 96% sensitive, and 85% specific. After revising the algorithm and applying it to the same cohort, it determined that there were 474 CAD patients and 526 without CAD. This was 93% accurate, 99% sensitive, and 89% specific. Conclusion: EHR usage has created a large pool of minable clinical data. However, without an efficient method to obtain inferences from it, the information cannot be effectually utilized. We have created an algorithm that detects CAD on a large scale with high accuracy. It has proven to be useful among a varied patient population. Since the constraints that are used, such as ICD codes and CPT codes, are universal, it can be utilized across many hospital systems; although, local validation is prudent. Using this algorithm can select a population with a propensity for FH, thereby allowing us to screen and manage patients with undiagnosed FH or other familial dyslipidemias.
Introduction: West Virginia exhibits pervasive cardiovascular disease (CVD) that may relate to a combination of ancestry and shared environment in families, including vulnerabilities related to diet, physical activity and tobacco use. Coronary Artery Risk Detection in Appalachian Communities (CARDIAC) is a school-based child risk factor screening program that has evaluated over 90,000 WV fifth graders in the past 20 years. Reverse Cascade screening for Familial Hypercholesterolemia (FH) has been difficult with the CARDIAC population. The WVU CTSI Integrated Data Repository (IDR) includes over 2 million records. Hypothesis: Linkage of child CARDIAC data to parent IDR data will allow new information discovery to inform management of CVD. Methods: We used direct demographic data linkage via Oracle with Soundex conversion of names, in the IDR, to find parents of the CARDIAC participants. Data was analyzed in the VMWare SSL environment. Results: 4759 children have a parent(s) identified. 959 mothers and 524 fathers have an LDL level from IDR. Race, BMI and gender was recorded from CARDIAC. 6.8 % of children, 40% of mothers and 44.8% of fathers have an abnormal LDL level >130 mg/dl in IDR. Positive predictive value of the abnormal child lipid level (≥130 mg/dl) is 17% for some parent (56/325) to be abnormal. 4 parents had LDL >190 mg/dl with child > 160, indicating likely FH in the pair (2.7% of pairs or 1 in 371 pairs). Conclusion: Formation of a virtual cohort of CARDIAC children and parents allows Virtual Reverse Cascade Screening to find FH. This project highlights the importance of familial tendency to hyperlipidemia that can aid detection of early lipid abnormality and cardiovascular risk in children and their young parents to promote wellness and potentially avoid early coronary artery disease. We are constructing a virtual longitudinal cohort to study CVD in WV as a part of a Learning Health System in which data management is at the forefront of healthcare improvement.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.