Background Managing research data in biomedical informatics research requires solid data governance rules to guarantee sustainable operation, as it generally involves several professions and multiple sites. As every discipline involved in biomedical research applies its own set of tools and methods, research data as well as applied methods tend to branch out into numerous intermediate and output data objects, making it very difficult to reproduce research results. Objectives This article gives an overview of our implementation status applying the Findability, Accessibility, Interoperability and Reusability (FAIR) Guiding Principles for scientific data management and stewardship onto our research data management pipeline focusing on the software tools that are in use. Methods We analyzed our progress FAIRificating the whole data management pipeline, from processing non-FAIR data up to data usage. We looked at software tools for data integration, data storage, and data usage as well as how the FAIR Guiding Principles helped to choose appropriate tools for each task. Results We were able to advance the degree of FAIRness of our data integration as well as data storage solutions, but lack enabling more FAIR Guiding Principles regarding Data Usage. Existing evaluation methods regarding the FAIR Guiding Principles (FAIRmetrics) were not applicable to our analysis of software tools. Conclusion Using the FAIR Guiding Principles, we FAIRificated relevant parts of our research data management pipeline improving findability, accessibility, interoperability and reuse of datasets and research results. We aim to implement the FAIRmetrics to our data management infrastructure and—where required—to contribute to the FAIRmetrics for research data in the biomedical informatics domain as well as for software tools to achieve a higher degree of FAIRness of our research data management pipeline.
In this opinion paper we provide an overview of some challenges concerning data provenance in biomedical research. We reflect current literature and depict some examples of existing implicit or explicit provenance aspects in some standard data types in translational research. Furthermore, we assess the need of further data provenance standardization in biomedical informatics. Basic data provenance should provide a recall about the origin of the data, transformation process steps, support replication and presentation of the data. Even though usable concepts for the documentation of data provenance can be found in other fields as early as 2005, the penetration rate in biomedical projects and in the biomedical literature is quite low. The awareness for the necessity of basic data provenance has to be raised, the education of data managers has to be further improved.
Background Secondary use of routine medical data is key to large-scale clinical and health services research. In a maximum care hospital, the volume of data generated exceeds the limits of big data on a daily basis. This so-called “real world data” are essential to complement knowledge and results from clinical trials. Furthermore, big data may help in establishing precision medicine. However, manual data extraction and annotation workflows to transfer routine data into research data would be complex and inefficient. Generally, best practices for managing research data focus on data output rather than the entire data journey from primary sources to analysis. To eventually make routinely collected data usable and available for research, many hurdles have to be overcome. In this work, we present the implementation of an automated framework for timely processing of clinical care data including free texts and genetic data (non-structured data) and centralized storage as Findable, Accessible, Interoperable, Reusable (FAIR) research data in a maximum care university hospital. Methods We identify data processing workflows necessary to operate a medical research data service unit in a maximum care hospital. We decompose structurally equal tasks into elementary sub-processes and propose a framework for general data processing. We base our processes on open-source software-components and, where necessary, custom-built generic tools. Results We demonstrate the application of our proposed framework in practice by describing its use in our Medical Data Integration Center (MeDIC). Our microservices-based and fully open-source data processing automation framework incorporates a complete recording of data management and manipulation activities. The prototype implementation also includes a metadata schema for data provenance and a process validation concept. All requirements of a MeDIC are orchestrated within the proposed framework: Data input from many heterogeneous sources, pseudonymization and harmonization, integration in a data warehouse and finally possibilities for extraction or aggregation of data for research purposes according to data protection requirements. Conclusion Though the framework is not a panacea for bringing routine-based research data into compliance with FAIR principles, it provides a much-needed possibility to process data in a fully automated, traceable, and reproducible manner.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.