Genetic heterogeneity: Challenges, impacts, and methods through an associative lens

Woodward, Alexa; Urbanowicz, Ryan J.; Naj, Adam C.; Moore, Jason H.

doi:10.1002/gepi.22497

Cited by 18 publications

(18 citation statements)

References 178 publications

(210 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The standard GWAS approach is a single-locus one—each variant is tested for association with the trait, and it is implicitly assumed that the presence of other causative loci does not affect marginal associations 1 . This is well-suited for identifying common variants with relatively large effect but is not designed for more complex situations 2,3 . Attempting to map multiple causal variants using single-locus models will generally decrease power and can bias estimates.…”

Section: Figurementioning

confidence: 99%

“…In summary, GARFIELD combines the efficient variable selection (among thousands of markers from a given region or set of regions) and prediction advantages of random forests to produce pseudo-genotypes that help identify complex interactions, and subsequently use logic gates to explore and describe them. We note that several other methods for detecting allelic heterogeneity exist 3,[18][19][20] , particularly various collapsing tests for capturing the cumulative effects of many rare variants 21,22 . Likewise, random forests have been used for variant selection for nearly 20 years 23,24 (although our use of logic gates appears to be novel).…”

mentioning

confidence: 99%

See 1 more Smart Citation

On the contribution of genetic heterogeneity to complex traits

Liu,

Swarts,

et al. 2024

Preprint

View full text Add to dashboard Cite

Genetic heterogeneity, where different alleles or loci are responsible for similar phenotypes, reduces the power of genome-wide association studies and can cause misleading results. Although many striking examples have been identified, the general importance of genetic heterogeneity for complex traits is unclear. Here, we use a novel interpretative machine-learning approach to look for evidence of genetic heterogeneity in plants and humans. Our approach helps identify new loci/alleles influencing trait variation in several agriculturally important species, and we show that at least 6% of maize eQTL, half of them newly identified, exhibit evidence of allelic heterogeneity. Finally, we search for evidence of synthetic associations in human GWAS data, and find that as many as 3-5% may be affected. Our results highlight the need to take genetic heterogeneity seriously, and provide a simple approach for doing so.

show abstract

Section: Figurementioning

confidence: 99%

mentioning

confidence: 99%

On the contribution of genetic heterogeneity to complex traits

Liu,

Swarts,

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…MultiSURF (49) and collective feature selection (50)), and modeling (i.e. ExSTraCS (51), a rule-based algorithm designed specifically to address the challenges of detecting and characterizing epistasis (52) and heterogeneous associations (53) in biomedical data), (4) conduct statistical significance comparisons (between algorithms and datasets), (5) collectively compare and contrast feature importance (FI) estimates across modelling algorithms, and (6) generate a comprehensive sharable summary report. With respect to overall AutoML design and goals, STREAMLINE currently is most closely related to MLIJAR-supervised (33) and MLme (32) AutoML tools.…”

Section: Introductionmentioning

confidence: 99%

STREAMLINE: A Simple, Transparent, End-To-End Automated Machine Learning Pipeline Facilitating Data Analysis and Algorithm Comparison

Urbanowicz

Zhang

Cui

et al. 2023

Genetic and Evolutionary Computation

View full text Add to dashboard Cite

Objective: While machine learning (ML) includes a valuable array of tools for analyzing biomedical data with multivariate and complex underlying associations, significant time and expertise is required to assemble effective, rigorous, comparable, reproducible, and unbiased pipelines. Automated ML (AutoML) tools seek to facilitate ML application by automating a subset of analysis pipeline elements. In this study we develop and validate a Simple, Transparent, End-to-end Automated Machine Learning Pipeline (STREAMLINE) and apply it to investigate the added utility of photography-based phenotypes for predicting obstructive sleep apnea (OSA); a common and underdiagnosed condition associated with a variety of health, economic, and safety consequences. Methods: STREAMLINE is designed to tackle biomedical binary classification tasks while (1) avoiding common mistakes, (2) accommodating complex associations and common data challenges, and (3) allowing scalability, reproducibility, and model interpretation. It automates the majority of established, generalizable, and reliably automatable aspects of an ML analysis pipeline while incorporating cutting edge algorithms and providing opportunities for human-in-the-loop customization. We present a broadly refactored and extended release of STREAMLINE, validating and benchmarking performance across simulated and real-world datasets. Then we applied STREAMLINE to evaluate the utility of demographics (DEM), self-reported comorbidities (DX), symptoms (SYM), and photography-based craniofacial (CF) and intraoral (IO) anatomy measures in predicting 'any OSA' or 'moderate/severe OSA' using 3,111 participants from Sleep Apnea Global Interdisciplinary Consortium (SAGIC). Results: Benchmarking analyses validated the efficacy of STREAMLINE across data simulations with increasingly complex patterns of association including epistatic interactions and genetic heterogeneity. OSA analyses identified a significant increase in ROC-AUC when adding CF to DEM+DX+SYM to predict 'moderate/severe' OSA. Additionally, a consistent but non-significant increase in PRC-AUC was observed with the addition of each subsequent feature set to predict 'any OSA', with CF and IO yielding minimal improvements. Conclusion: STREAMLINE is an effective, rigorous, transparent, and easy-to-use AutoML approach to a comparative ML analysis that adheres to best practices in data science. Application of STREAM-LINE to OSA data suggests that CF features provide additional value in predicting moderate/severe OSA, but neither CF nor IO features meaningfully improved the prediction of 'any OSA' beyond established demographics, comorbidity and symptom characteristics.Keywords automated machine learning • obstructive sleep apnea • data science • predictive modeling • craniofacial traits • intraoral anatomy user-specification of feature types (which cannot always be reliably automated) and one-hot-encoding of categorical features for modeling, (3) engineering of 'missingness features' to consider missingness as a potentially informati...

show abstract

“…In the absence of the right subgrouping, phenotypic heterogeneity may compromise statistical analysis by leading to a substantial power loss and potentially low reproducibility rates in detecting and understanding the underlying mechanisms of heterogeneous phenotypes [10,[25][26][27]. Since the sub-classification or heterogeneity nature of the molecular background of phenotypes is typically unknown, it becomes a computational and statistical challenge to find surrogates to the subtypes.…”

Section: Introductionmentioning

confidence: 99%

Phenotypic subtyping via contrastive learning

Gorla

Sankararaman

Burchard

et al. 2023

Preprint

View full text Add to dashboard Cite

Defining and accounting for subphenotypic structure has the potential to increase statistical power and provide a deeper understanding of the heterogeneity in the molecular basis of complex disease. Existing phenotype subtyping methods primarily rely on clinically observed heterogeneity or metadata clustering. However, they generally tend to capture the dominant sources of variation in the data, which often originate from variation that is not descriptive of the mechanistic heterogeneity of the phenotype of interest; in fact, such dominant sources of variation, such as population structure or technical variation, are, in general, expected to be independent of subphenotypic structure. We instead aim to find a subspace with signal that is unique to a group of samples for which we believe that subphenotypic variation exists (e.g., cases of a disease). To that end, we introduce Phenotype Aware Components Analysis (PACA), a contrastive learning approach leveraging canonical correlation analysis to robustly capture weak sources of subphenotypic variation. In the context of disease, PACA learns a gradient of variation unique to cases in a given dataset, while leveraging control samples for accounting for variation and imbalances of biological and technical confounders between cases and controls. We evaluated PACA using an extensive simulation study, as well as on various subtyping tasks using genotypes, transcriptomics, and DNA methylation data. Our results provide multiple strong evidence that PACA allows us to robustly capture weak unknown variation of interest while being calibrated and well-powered, far superseding the performance of alternative methods. This renders PACA as a state-of-the-art tool for defining de novo subtypes that are more likely to reflect molecular heterogeneity, especially in challenging cases where the phenotypic heterogeneity may be masked by a myriad of strong unrelated effects in the data.

show abstract

Genetic heterogeneity: Challenges, impacts, and methods through an associative lens

Cited by 18 publications

References 178 publications

On the contribution of genetic heterogeneity to complex traits

On the contribution of genetic heterogeneity to complex traits

STREAMLINE: A Simple, Transparent, End-To-End Automated Machine Learning Pipeline Facilitating Data Analysis and Algorithm Comparison

Phenotypic subtyping via contrastive learning

Contact Info

Product

Resources

About