Gilberto Gonzalez-Arroyo scite author profile

Introduction A major challenge to enabling precision health at a global scale is the bias between those who enroll in state sponsored genomic research and those suffering from chronic disease. More than 30 million people have been genotyped by direct-to-consumer (DTC) companies such as 23andMe, Ancestry DNA, and MyHeritage, providing a potential mechanism for democratizing access to medical interventions and thus catalyzing improvements in patient outcomes as the cost of data acquisition drops. However, much of these data are sequestered in the initial provider network, without the ability for the scientific community to either access or validate. Here, we present a novel geno-pheno platform that integrates heterogeneous data sources and applies learnings to common chronic disease conditions including Type 2 diabetes (T2D) and hypertension. Methods We collected genotyped data from a novel DTC platform where participants upload their genotype data files and were invited to answer general health questionnaires regarding cardiometabolic traits over a period of 6 months. Quality control, imputation, and genome-wide association studies were performed on this dataset, and polygenic risk scores were built in a case–control setting using the BASIL algorithm. Results We collected data on N = 4,550 (389 cases / 4,161 controls) who reported being affected or previously affected for T2D and N = 4,528 (1,027 cases / 3,501 controls) for hypertension. We identified 164 out of 272 variants showing identical effect direction to previously reported genome-significant findings in Europeans. Performance metric of the PRS models was AUC = 0.68, which is comparable to previously published PRS models obtained with larger datasets including clinical biomarkers. Discussion DTC platforms have the potential of inverting research models of genome sequencing and phenotypic data acquisition. Quality control (QC) mechanisms proved to successfully enable traditional GWAS and PRS analyses. The direct participation of individuals has shown the potential to generate rich datasets enabling the creation of PRS cardiometabolic models. More importantly, federated learning of PRS from reuse of DTC data provides a mechanism for scaling precision health care delivery beyond the small number of countries who can afford to finance these efforts directly. Conclusions The genetics of T2D and hypertension have been studied extensively in controlled datasets, and various polygenic risk scores (PRS) have been developed. We developed predictive tools for both phenotypes trained with heterogeneous genotypic and phenotypic data generated outside of the clinical environment and show that our methods can recapitulate prior findings with fidelity. From these observations, we conclude that it is possible to leverage DTC genetic repositories to identify individuals at risk of debilitating diseases based on their unique genetic landscape so that informed, timely clinical interventions can be incorporated.

show abstract

Validating and automating learning of cardiometabolic polygenic risk scores from direct-to-consumer genetic and phenotypic data: implications for scaling precision health research

Pineda

Vernekar

Grau³

et al. 2022

Preprint

View full text Add to dashboard Cite

IntroductionA major challenge to enabling precision health at a global scale is the bias between those who enroll in state sponsored genomic research and those suffering from chronic disease. More than 30 million people have been genotyped by direct-to-consumer (DTC) companies such as 23andMe, Ancestry DNA, and MyHeritage, providing a potential mechanism for democratizing access to medical interventions and thus catalyzing improvements in patient outcomes as the cost of data acquisition drops. However, much of these data are sequestered in the initial provider network, without the ability for the scientific community to either access or validate. Here, we present a novel geno-pheno platform that integrates heterogeneous data sources and applies learnings to common chronic disease conditions including Type 2 diabetes (T2D) and hypertension.MethodsWe collected genotyped data from a novel DTC platform where participants upload their genotype data files, and were invited to answer general health questionnaires regarding cardiometabolic traits over a period of 6 months. Quality control, imputation and genome-wide association studies were performed on this dataset, and polygenic risk scores were built in a case-control setting using the BASIL algorithm.ResultsWe collected data on N=4,550 (389 cases / 4,161 controls) who reported being affected or previously affected for T2D; and N=4,528 (1,027 cases / 3,501 controls) for hypertension. We identified 164 out of 272 variants showing identical effect direction to previously reported genome-significant findings in Europeans. Performance metric of the PRS models was AUC=0.68, which is comparable to previously published PRS models obtained with larger datasets including clinical biomarkers.DiscussionDTC platforms have the potential of inverting research models of genome sequencing and phenotypic data acquisition. Quality control (QC) mechanisms proved to successfully enable traditional GWAS and PRS analyses. The direct participation of individuals has shown the potential to generate rich datasets enabling the creation of PRS cardiometabolic models. More importantly, federated learning of PRS from reuse of DTC data provides a mechanism for scaling precision health care delivery beyond the small number of countries who can afford to finance these efforts directly.ConclusionsThe genetics of T2D and hypertension have been studied extensively in controlled datasets, and various polygenic risk scores (PRS) have been developed. We developed predictive tools for both phenotypes trained with heterogeneous genotypic and phenotypic data generated outside of the clinical environment and show that our methods can recapitulate prior findings with fidelity. From these observations, we conclude that it is possible to leverage DTC genetic repositories to identify individuals at risk of debilitating diseases based on their unique genetic landscape so that informed, timely clinical interventions can be incorporated.

show abstract

A binational cancer registry analysis of Hispanic patients in Mexico and the United States.

Pineda¹,

Garrido-Sanchez

Gonzalez-Arroyo³

et al. 2022

JCO

View full text Add to dashboard Cite

e22513 Background: A marked disparity in cancer burden exists between regions of the world. This global picture can only be obtained because of data obtained from population-based cancer registries, which allow estimates for different geographic areas. We investigate the data differences, clinical characteristics, tumor information, and patient outcomes of Hispanic patients in North America. Methods: We used data from Mexico’s National Cancer Registry and the United States Surveillance, Epidemiology, and End Results Program (SEER) program with four years of follow-up on each registry. of Mexican registry, we excluded outliers and records with more than 80% missing variable; of the United States registry, only patients identified as hispanic or latino were included. On the Mexican registry we re-coded all variables and the United States registry data on secondary neoplasm was excluded to match boths registry. Baseline characteristics and a geographical information system was constructed to map cancer registries and their relative frequencies. A survival analysis was done to understand the correlation between covariates. Lastly, we build machine learning models to predict 3-year overall survival. Results: The Hispanic cancer database (HCB) consists of 291,178 patients (19,904 from Mexico, and 271,274 from the United States). The top three most frequent cancer types were breast, prostate and hematological. The age of diagnosis was 55±17 years. Mexico has a slight skewness towards the earlier age of diagnosis of females. The registries with the highest burden of cancer were New Mexico (USA) and Baja California Sur (Mexico). Average survival months seem very stable across registries in the United States, but not in Mexico. Our linear regression model achieved a coefficient of determination (R-squared) of 0.49, while the logistic regression achieved an AUC of 0.82, with an F1-score of 0.88. Conclusions: Cancer registries are important tools for prevention and development of control programs. Hispanics are a traditionally neglected population in oncological clinical trials, with low enrollment of patients outside of the United States. Mexico enacted its National Cancer Registry Law in 2017, which alongside the Hispanic data on the SEER program in the United States offers enormous opportunities for continued collaboration and understanding of cancer. Both Mexico and the United States can strengthen their cancer prevention strategies and generate trans-border collaboration, research, and patient-support networks.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.