Background Informaticians at any institution that are developing clinical research support infrastructure are tasked with populating research databases with data extracted and transformed from their institution’s operational databases, such as electronic health records (EHRs). These data must be properly extracted from these source systems, transformed into a standard data structure, and then loaded into the data warehouse while maintaining the integrity of these data. We validated the correctness of the extract, load, and transform (ETL) process of the extracted data of West Virginia Clinical and Translational Science Institute’s Integrated Data Repository, a clinical data warehouse that includes data extracted from two EHR systems. Methods Four hundred ninety-eight observations were randomly selected from the integrated data repository and compared with the two source EHR systems. Results Of the 498 observations, there were 479 concordant and 19 discordant observations. The discordant observations fell into three general categories: a) design decision differences between the IDR and source EHRs, b) timing differences, and c) user interface settings. After resolving apparent discordances, our integrated data repository was found to be 100% accurate relative to its source EHR systems. Conclusion Any institution that uses a clinical data warehouse that is developed based on extraction processes from operational databases, such as EHRs, employs some form of an ETL process. As secondary use of EHR data begins to transform the research landscape, the importance of the basic validation of the extracted EHR data cannot be underestimated and should start with the validation of the extraction process itself.
BackgroundThe study conducts statistical and spatial analyses to investigate amounts and types of permitted surface water pollution discharges in relation to population mortality rates for cancer and non-cancer causes nationwide and by urban-rural setting. Data from the Environmental Protection Agency's (EPA) Discharge Monitoring Report (DMR) were used to measure the location, type, and quantity of a selected set of 38 discharge chemicals for 10,395 facilities across the contiguous US. Exposures were refined by weighting amounts of chemical discharges by their estimated toxicity to human health, and by estimating the discharges that occur not only in a local county, but area-weighted discharges occurring upstream in the same watershed. Centers for Disease Control and Prevention (CDC) mortality files were used to measure age-adjusted population mortality rates for cancer, kidney disease, and total non-cancer causes. Analysis included multiple linear regressions to adjust for population health risk covariates. Spatial analyses were conducted by applying geographically weighted regression to examine the geographic relationships between releases and mortality.ResultsGreater non-carcinogenic chemical discharge quantities were associated with significantly higher non-cancer mortality rates, regardless of toxicity weighting or upstream discharge weighting. Cancer mortality was higher in association with carcinogenic discharges only after applying toxicity weights. Kidney disease mortality was related to higher non-carcinogenic discharges only when both applying toxicity weights and including upstream discharges. Effects for kidney mortality and total non-cancer mortality were stronger in rural areas than urban areas. Spatial results show correlations between non-carcinogenic discharges and cancer mortality for much of the contiguous United States, suggesting that chemicals not currently recognized as carcinogens may contribute to cancer mortality risk. The geographically weighted regression results suggest spatial variability in effects, and also indicate that some rural communities may be impacted by upstream urban discharges.ConclusionsThere is evidence that permitted surface water chemical discharges are related to population mortality. Toxicity weights and upstream discharges are important for understanding some mortality effects. Chemicals not currently recognized as carcinogens may nevertheless play a role in contributing to cancer mortality risk. Spatial models allow for the examination of geographic variability not captured through the regression models.
Background The United States, and especially West Virginia, have a tremendous burden of coronary artery disease (CAD). Undiagnosed familial hypercholesterolemia (FH) is an important factor for CAD in the U.S. Identification of a CAD phenotype is an initial step to find families with FH. Objective We hypothesized that a CAD phenotype detection algorithm that uses discrete data elements from electronic health records (EHRs) can be validated from EHR information housed in a data repository. Methods We developed an algorithm to detect a CAD phenotype which searched through discrete data elements, such as diagnosis, problem lists, medical history, billing, and procedure (International Classification of Diseases [ICD]-9/10 and Current Procedural Terminology [CPT]) codes. The algorithm was applied to two cohorts of 500 patients, each with varying characteristics. The second (younger) cohort consisted of parents from a school child screening program. We then determined which patients had CAD by systematic, blinded review of EHRs. Following this, we revised the algorithm by refining the acceptable diagnoses and procedures. We ran the second algorithm on the same cohorts and determined the accuracy of the modification. Results CAD phenotype Algorithm I was 89.6% accurate, 94.6% sensitive, and 85.6% specific for group 1. After revising the algorithm (denoted CAD Algorithm II) and applying it to the same groups 1 and 2, sensitivity 98.2%, specificity 87.8%, and accuracy 92.4; accuracy 93% for group 2. Group 1 F1 score was 92.4%. Specific ICD-10 and CPT codes such as “coronary angiography through a vein graft” were more useful than generic terms. Conclusion We have created an algorithm, CAD Algorithm II, that detects CAD on a large scale with high accuracy and sensitivity (recall). It has proven useful among varied patient populations. Use of this algorithm can extend to monitor a registry of patients in an EHR and/or to identify a group such as those with likely FH.
Background Though electronic health record (EHR) data have been linked to national and state death registries, such linkages have rarely been validated for an entire hospital system's EHR. Objectives The aim of the study is to validate West Virginia University Medicine's (WVU Medicine) linkage of its EHR to three external death registries: the Social Security Death Masterfile (SSDMF), the national death index (NDI), the West Virginia Department of Health and Human Resources (DHHR). Methods Probabilistic matching was used to link patients to NDI and deterministic matching for the SSDMF and DHHR vital statistics records (WVDMF). In subanalysis, we used deaths recorded in Epic (n = 30,217) to further validate a subset of deaths captured by the SSDMF, NDI, and WVDMF. Results Of the deaths captured by the SSDMF, 59.8 and 68.5% were captured by NDI and WVDMF, respectively; for deaths captured by NDI this co-capture rate was 80 and 78%, respectively, for the SSDMF and WVDMF. Kappa statistics were strongest for NDI and WVDMF (61.2%) and NDI and SSDMF (60.6%) and weakest for SSDMF and WVDMF (27.9%). Of deaths recorded in Epic, 84.3, 85.5, and 84.4% were captured by SSDMF, NDI, and WVDMF, respectively. Less than 2% of patients' deaths recorded in Epic were not found in any of the death registries. Finally, approximately 0.2% of “decedents” in any death registry re-emerged in Epic at least 6 months after their death date, a very small percentage and thus further validating the linkages. Conclusion NDI had greatest validity in capturing deaths in our EHR. As a similar, though slightly less capture and agreement rate in identifying deaths is observed for SSDMF and state vital statistics records, these registries may be reasonable alternatives to NDI for research and quality assurance studies utilizing entire EHRs from large hospital systems. Investigators should also be aware that there will be a very tiny fraction of “dead” patients re-emerging in the EHR.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.