The Mass General Brigham Biobank Portal: an i2b2-based data repository linking disparate and high-dimensional patient data to support multimodal analytics
Abstract:Objective
Integrating and harmonizing disparate patient data sources into one consolidated data portal enables researchers to conduct analysis efficiently and effectively.
Materials and Methods
We describe an implementation of Informatics for Integrating Biology and the Bedside (i2b2) to create the Mass General Brigham (MGB) Biobank Portal data repository. The repository integrates data from primary and curated data sources a… Show more
“…The primary data source was the MGB RPDR, an EHR data warehouse covering 4.6 million patients across the MGB HealthCare hospital system (formerly Partners HealthCare) including Brigham & Women’s Hospital, Massachusetts General Hospital, and other affiliated hospitals in the greater Boston area. To assemble the cohort for this study, we queried the MGB RPDR for 1,546,440 patients who self-identified as non-Hispanic White (that is, 74% of the overall MGB patient population) having at least three visits after 2005, more than 30 days apart between the first and last visits, and at least one visit greater than age 10 and less than age 90, as of February 2020 (22,23) (see Figure 1 ). The race and ethnicity restriction was applied here because the subsequent PRS were based on samples of European ancestry.…”
Section: Methodsmentioning
confidence: 99%
“…Each consented participant was asked to provide blood samples (e.g., plasma, serum, DNA), which are then linked to their clinical data in the EHRs as well as survey data on lifestyle, behavioral and environmental factors, and family history. Leveraging in-person and electronic recruitment methods, the MGB Biobank has currently enrolled more than 130,000 participants, collected 82,092 DNA samples, and generated genotyping microarray data for more than 56,923 participants (4,920 using the Illumina MEGA, 5,334 using the Illumina MEGA EX, 26,144 using the Illumina MEG, and 24,789 using the Illumina GSA) (23). This research was conducted as part of the PsycheMERGE Consortium (24), under approval from the MGB Institutional Review Board.…”
Section: Methodsmentioning
confidence: 99%
“…Leveraging in-person and electronic recruitment methods, the MGB Biobank has currently enrolled more than 130,000 participants, collected 82,092 DNA samples, and generated genotyping microarray data for more than 56,923 participants (4,920 using the Illumina MEGA, 5,334 using the Illumina MEGA EX, 26,144 using the Illumina MEG, and 24,789 using the Illumina GSA) (23). This research was conducted as part of the PsycheMERGE Consortium (24), under approval from the MGB Institutional Review Board.…”
Background: Hospital-based biobanks have become an increasingly prominent resource for evaluating the clinical impact of disease-related polygenic risk scores (PRS). However, biobank cohorts typically rely on the selection of volunteers who may differ systematically from non-participants.
Methods: PRS weights for schizophrenia, bipolar disorder, and depression were derived using summary statistics from the largest available genomic studies. These PRS were then calculated in a sample of 24,153 European ancestry participants in the Mass General Brigham (MGB) Biobank. To correct for selection bias, we fitted a model with inverse probability (IP) weights estimated using 1,839 sociodemographic and clinical features extracted from electronic health records (EHRs) of eligible MGB patients. Finally, we tested the utility of a modular specification of the IP weight model for selection.
Results: Case prevalence of bipolar disorder among participants in the top decile of bipolar disorder PRS was 10.0% (95% CI: 8.8%-11.2%) in the unweighted analysis but only 6.2% (5.0%-7.5%) when selection bias was accounted for using IP weights. Similarly, case prevalence of depression among those in the top decile of depression PRS was reduced from 33.5% (31.7%-35.4%) in the unweighted analysis to 28.9% (25.8%-31.9%) after IP weighting. Modular correction for selection bias in intermediate selection steps did not substantially impact PRS effect estimates.
Conclusions: Non-random selection of participants into volunteer biobanks may induce clinically relevant selection bias that could impact the implementation of PRS and risk communication in clinical practice. As efforts to integrate PRS in medical practice expand, recognition and mitigation of these biases should be considered.
“…The primary data source was the MGB RPDR, an EHR data warehouse covering 4.6 million patients across the MGB HealthCare hospital system (formerly Partners HealthCare) including Brigham & Women’s Hospital, Massachusetts General Hospital, and other affiliated hospitals in the greater Boston area. To assemble the cohort for this study, we queried the MGB RPDR for 1,546,440 patients who self-identified as non-Hispanic White (that is, 74% of the overall MGB patient population) having at least three visits after 2005, more than 30 days apart between the first and last visits, and at least one visit greater than age 10 and less than age 90, as of February 2020 (22,23) (see Figure 1 ). The race and ethnicity restriction was applied here because the subsequent PRS were based on samples of European ancestry.…”
Section: Methodsmentioning
confidence: 99%
“…Each consented participant was asked to provide blood samples (e.g., plasma, serum, DNA), which are then linked to their clinical data in the EHRs as well as survey data on lifestyle, behavioral and environmental factors, and family history. Leveraging in-person and electronic recruitment methods, the MGB Biobank has currently enrolled more than 130,000 participants, collected 82,092 DNA samples, and generated genotyping microarray data for more than 56,923 participants (4,920 using the Illumina MEGA, 5,334 using the Illumina MEGA EX, 26,144 using the Illumina MEG, and 24,789 using the Illumina GSA) (23). This research was conducted as part of the PsycheMERGE Consortium (24), under approval from the MGB Institutional Review Board.…”
Section: Methodsmentioning
confidence: 99%
“…Leveraging in-person and electronic recruitment methods, the MGB Biobank has currently enrolled more than 130,000 participants, collected 82,092 DNA samples, and generated genotyping microarray data for more than 56,923 participants (4,920 using the Illumina MEGA, 5,334 using the Illumina MEGA EX, 26,144 using the Illumina MEG, and 24,789 using the Illumina GSA) (23). This research was conducted as part of the PsycheMERGE Consortium (24), under approval from the MGB Institutional Review Board.…”
Background: Hospital-based biobanks have become an increasingly prominent resource for evaluating the clinical impact of disease-related polygenic risk scores (PRS). However, biobank cohorts typically rely on the selection of volunteers who may differ systematically from non-participants.
Methods: PRS weights for schizophrenia, bipolar disorder, and depression were derived using summary statistics from the largest available genomic studies. These PRS were then calculated in a sample of 24,153 European ancestry participants in the Mass General Brigham (MGB) Biobank. To correct for selection bias, we fitted a model with inverse probability (IP) weights estimated using 1,839 sociodemographic and clinical features extracted from electronic health records (EHRs) of eligible MGB patients. Finally, we tested the utility of a modular specification of the IP weight model for selection.
Results: Case prevalence of bipolar disorder among participants in the top decile of bipolar disorder PRS was 10.0% (95% CI: 8.8%-11.2%) in the unweighted analysis but only 6.2% (5.0%-7.5%) when selection bias was accounted for using IP weights. Similarly, case prevalence of depression among those in the top decile of depression PRS was reduced from 33.5% (31.7%-35.4%) in the unweighted analysis to 28.9% (25.8%-31.9%) after IP weighting. Modular correction for selection bias in intermediate selection steps did not substantially impact PRS effect estimates.
Conclusions: Non-random selection of participants into volunteer biobanks may induce clinically relevant selection bias that could impact the implementation of PRS and risk communication in clinical practice. As efforts to integrate PRS in medical practice expand, recognition and mitigation of these biases should be considered.
“…The platform has been used for a wide spectrum of use cases including clinical-trial enrollment ( Bucalo et al , 2021 ), population management ( Wagholikar et al , 2019 ), biobanking ( Castro et al , 2021 ; Mate et al , 2017 ; Segagni et al , 2011 ), clinical decision support and epidemiological analysis ( Klann and Murphy, 2013 ; Murchison et al , 2021 ; Pfiffner et al , 2016 ; Segagni et al , 2011 ; Wagholikar et al , 2017a , b ). However, despite its impact and open-source availability, the deployment of the platform is largely limited to large academic medical centers.…”
Motivation
The i2b2 platform is used at major academic health institutions and research consortia for querying for electronic health data. However, a major obstacle for wider utilization of the platform is the complexity of data-loading that entails a steep curve of learning the platform’s complex data-schemas. To address this problem, we have developed the i2b2-etl package that simplifies the data loading process, which will facilitate wider deployment and utilization of the platform.
Results
We have implemented i2b2-etl as a Python application that imports ontology and patient data using simplified input file schemas and provides inbuilt record-number de-identification and data-validation. We describe a real-world deployment of i2b2-etl for a population-management initiative at MassGeneral Brigham.
Availability
i2b2-etl is a free, open-source application implemented in Python available under the Mozilla 2 license. The application can be downloaded as compiled docker images. A live demo is available at https://i2b2clinical.org/demo-i2b2etl/ (username: demo, password: Etl@2021).
Supplementary information
Supplementary data are available at Bioinformatics online.
“…Data source was the Mass General Brigham (MGB) Electronic Health Records (EHR) and Biobank which stores blood samples from patients at MGB who have consented to provide samples for research purposes. 3 We identified 15 patients who received mepolizumab for the treatment of asthma and who had samples in the biobank, of whom 5 were determined to be responders to mepolizumab, and 10 were non-responders. We compared the pre-treatment serum level of each cytokine at mepolizumab initiation between responders and non-responders using moderated t-test to select top cytokines with highest differential expression between responders and nonresponders.…”
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.