Introduction: Healthcare information systems can generate and/or record huge volumes of data, some of which may be reused for research, clinical trials, or teaching. However, these databases can be affected by data quality problems; hence, an important step in the data reuse process consists in detecting and rectifying these issues. With a view to facilitating the assessment of data quality, we developed a taxonomy of data quality problems in operational databases. Material: We searched the literature for publications that mentioned "data quality problems", "data quality taxonomy", "data quality assessment", or "dirty data". The publications were then reviewed, compared, summarized, and structured using a bottom-up approach, in order to provide an operational taxonomy of data quality problems. The latter were illustrated with fictional examples (though based on reality) from clinical databases. Results: Twelve publications were selected, and 286 instances of data quality problems were identified and were classified according to six distinct levels of granularity. We used the classification defined by Oliveira et al to structure our taxonomy. The extracted items were grouped into 53 data quality problems. Discussion: This taxonomy facilitated the systematic assessment of data quality in databases by presenting the data’s quality according to their granularity. The definition of this taxonomy is the first step in the data cleaning process. The subsequent steps include the definition of associated quality assessment methods and data cleaning methods. Conclusion: Our new taxonomy enabled the classification and illustration of 53 data quality problems found in hospital databases.
The data collected in the clinical registries or by data reuse require some modifications in order to suit the research needs. Several common operations are frequently applied to select relevant patients across the cohort, combine data from multiple sources, add new variables if needed and create unique tables depending on the research purpose. We carried out a qualitative survey by conducting semi-structured interviews with 7 experts in data reuse and proposed a standard workflow for health data management. We implemented a R tutorial based on a synthetic data set using Jupyter Notebook for a better understanding of the data management workflow.
Health data science is an emerging discipline that bridges computer science, statistics and health domain knowledge. This consists of taking advantage of the large volume of data, often complex, to extract information to improve decision-making. We have created a Master’s degree in Health Data Science to meet the growing need for data scientists in companies and institutions. The training offers, over two years, courses covering computer science, mathematics and statistics, health and biology. With more than 60 professors and lecturers, a total of 835 hours of classes (not including the mandatory 5 months of internship per year), this curriculum has enrolled a total of 53 students today. The feedback from the students and alumni allowed us identifying new needs in terms of training, which may help us to adapt the program for the coming academic years. In particular, we will offer an additional module covering data management, from the edition of the clinical report form to the implementation of a data warehouse with an ETL process. Git and application lifecycle management will be included in programming courses or multidisciplinary projects.
Background Despite the many opportunities data reuse offers, its implementation presents many difficulties, and raw data cannot be reused directly. Information is not always directly available in the source database and needs to be computed afterwards with raw data for defining an algorithm. Objective The main purpose of this article is to present a standardized description of the steps and transformations required during the feature extraction process when conducting retrospective observational studies. A secondary objective is to identify how the features could be stored in the schema of a data warehouse. Methods This study involved the following 3 main steps: (1) the collection of relevant study cases related to feature extraction and based on the automatic and secondary use of data; (2) the standardized description of raw data, steps, and transformations, which were common to the study cases; and (3) the identification of an appropriate table to store the features in the Observation Medical Outcomes Partnership (OMOP) common data model (CDM). Results We interviewed 10 researchers from 3 French university hospitals and a national institution, who were involved in 8 retrospective and observational studies. Based on these studies, 2 states (track and feature) and 2 transformations (track definition and track aggregation) emerged. “Track” is a time-dependent signal or period of interest, defined by a statistical unit, a value, and 2 milestones (a start event and an end event). “Feature” is time-independent high-level information with dimensionality identical to the statistical unit of the study, defined by a label and a value. The time dimension has become implicit in the value or name of the variable. We propose the 2 tables “TRACK” and “FEATURE” to store variables obtained in feature extraction and extend the OMOP CDM. Conclusions We propose a standardized description of the feature extraction process. The process combined the 2 steps of track definition and track aggregation. By dividing the feature extraction into these 2 steps, difficulty was managed during track definition. The standardization of tracks requires great expertise with regard to the data, but allows the application of an infinite number of complex transformations. On the contrary, track aggregation is a very simple operation with a finite number of possibilities. A complete description of these steps could enhance the reproducibility of retrospective studies.
BackgroundEndometriosis is defined by implantation and invasive growth of endometrial tissue in extra-uterine locations causing heterogeneous symptoms, and a unique clinical picture for each patient. Understanding the complex biological mechanisms underlying these symptoms and the protein networks involved may be useful for early diagnosis and identification of pharmacological targets.MethodsIn the present study, we combined three approaches (i) a text-mining analysis to perform a systematic search of proteins over existing literature, (ii) a functional enrichment analysis to identify the biological pathways in which proteins are most involved, and (iii) a protein–protein interaction (PPI) network to identify which proteins modulate the most strongly the symptomatology of endometriosis.ResultsTwo hundred seventy-eight proteins associated with endometriosis symptomatology in the scientific literature were extracted. Thirty-five proteins were selected according to degree and betweenness scores criteria. The most enriched biological pathways associated with these symptoms were (i) Interleukin-4 and Interleukin-13 signaling (p = 1.11 x 10-16), (ii) Signaling by Interleukins (p = 1.11 x 10-16), (iii) Cytokine signaling in Immune system (p = 1.11 x 10-16), and (iv) Interleukin-10 signaling (p = 5.66 x 10-15).ConclusionOur study identified some key proteins with the ability to modulate endometriosis symptomatology. Our findings indicate that both pro- and anti-inflammatory biological pathways may play important roles in the symptomatology of endometriosis. This approach represents a genuine systemic method that may complement traditional experimental studies. The current data can be used to identify promising biomarkers for early diagnosis and potential therapeutic targets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.