Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data

Tang, Shengpu; Davarmanesh, Parmida; Song, Yanmeng; Koutra, Danai; Sjoding, Michael W.; Wiens, Jenna

doi:10.1093/jamia/ocaa139

Cited by 51 publications

(51 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Noticeably, only one of the tools reviewed directly supported getting all features for a cohort as FIBER does. 21 Providing a Python interface and working on an i2b2 star schema data format, FIBER stands out in facilitating information exchange and cohort comparability between different health organizations following this schema (eg, the JSON cohort definitions can easily be shared across institutions). Generalizability of data extraction pipelines for these institutions has always been challenging, and we anticipate FIBER to alleviate this issue.…”

Section: Discussionmentioning

confidence: 99%

FIBER: enabling flexible retrieval of electronic health records data for clinical predictive modeling

et al. 2021

View full text Add to dashboard Cite

Objectives The development of clinical predictive models hinges upon the availability of comprehensive clinical data. Tapping into such resources requires considerable effort from clinicians, data scientists, and engineers. Specifically, these efforts are focused on data extraction and preprocessing steps required prior to modeling, including complex database queries. A handful of software libraries exist that can reduce this complexity by building upon data standards. However, a gap remains concerning electronic health records (EHRs) stored in star schema clinical data warehouses, an approach often adopted in practice. In this article, we introduce the FlexIBle EHR Retrieval (FIBER) tool: a Python library built on top of a star schema (i2b2) clinical data warehouse that enables flexible generation of modeling-ready cohorts as data frames. Materials and Methods FIBER was developed on top of a large-scale star schema EHR database which contains data from 8 million patients and over 120 million encounters. To illustrate FIBER’s capabilities, we present its application by building a heart surgery patient cohort with subsequent prediction of acute kidney injury (AKI) with various machine learning models. Results Using FIBER, we were able to build the heart surgery cohort (n = 12 061), identify the patients that developed AKI (n = 1005), and automatically extract relevant features (n = 774). Finally, we trained machine learning models that achieved area under the curve values of up to 0.77 for this exemplary use case. Conclusion FIBER is an open-source Python library developed for extracting information from star schema clinical data warehouses and reduces time-to-modeling, helping to streamline the clinical modeling process.

show abstract

Section: Discussionmentioning

confidence: 99%

FIBER: enabling flexible retrieval of electronic health records data for clinical predictive modeling

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Future work will explore the impact that Phenoflow has on the portability of additional types of phenotype definitions, including probabilistic definitions, the development of which is likely to leverage data processing tools such as the Flexible Data-Driven Pipeline (FIDDLE) framework 20 . In addition, future work will investigate how the multidimension annotations of the structured definition model can be leveraged in order to introduce new search and discovery capabilities into phenotype repositories.…”

Section: Discussionmentioning

confidence: 99%

Phenoflow: A Microservice Architecture for Portable Workflow-based Phenotype Definitions

Chapman

Rasmussen

Pacheco

et al. 2020

Preprint

View full text Add to dashboard Cite

Clinical phenotyping is an effective way to identify patients with particular characteristics within a population. In order to enhance the portability of a phenotype, it is often defined abstractly, with users expected to realise the phenotype computationally before executing it against a local dataset. However, complex definitions, which also provide little information about how best to implement a phenotype in practice, mean that this process is often not easy. To address this issue, we propose a new multi-layer model for a phenotype definition, which is realised as a workflow, and can be combined with different implementation units in order to produce a computable form. A novel authoring architecture, Phenoflow, supports the generation of these structured definitions. To illustrate the utility of our approach, we re-author a diabetes phenotype definition, and then compare its portability to the original definition, in the context of a population of 26,406 patients at Northwestern University.

show abstract

“…It incorporates good practices in ML training, testing, and model evaluation (Teschendorff, 2019;Topçuoğlu et al, 2020). Furthermore, it provides data preprocessing steps based on the FIDDLE (FlexIble Data-Driven pipeLinE) framework outlined in Tang et al (Tang et al, 2020) and post-training permutation importance steps to estimate the importance of each feature in the models trained (Breiman, 2001;Fisher et al, 2018).…”

Section: Statement Of Needmentioning

confidence: 99%

“…preprocess_data() takes continuous and categorical data, re-factors categorical data into binary features, and provides options to normalize continuous data, remove features with near-zero variance, and keep only one instance of perfectly correlated features. We set the default options based on those implemented in FIDDLE ( Tang et al, 2020 ). More details on how to use preprocess_data() can be found in the accompanying vignette .…”

Section: Preprocessing Datamentioning

confidence: 99%

mikropml: User-Friendly R Package for Supervised Machine Learning Pipelines

Topçuoğlu¹,

Lapp²,

Sovacool³

et al. 2021

JOSS

Self Cite

View full text Add to dashboard Cite

Machine learning (ML) for classification and prediction based on a set of features is used to make decisions in healthcare, economics, criminal justice and more. However, implementing an ML pipeline including preprocessing, model selection, and evaluation can be time-consuming, confusing, and difficult. Here, we present mikropml (prononced "meek-ROPE em el"), an easy-to-use R package that implements ML pipelines using regression, support vector machines, decision trees, random forest, or gradient-boosted trees. The package is available on GitHub, CRAN, and conda.

show abstract

Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data

Cited by 51 publications

References 44 publications

FIBER: enabling flexible retrieval of electronic health records data for clinical predictive modeling

FIBER: enabling flexible retrieval of electronic health records data for clinical predictive modeling

Phenoflow: A Microservice Architecture for Portable Workflow-based Phenotype Definitions

mikropml: User-Friendly R Package for Supervised Machine Learning Pipelines

Contact Info

Product

Resources

About