Eun Yong Kang scite author profile

³

et al. 2014

Although genome-wide association studies have successfully identified thousands of risk loci for complex traits, only a handful of the biologically causal variants, responsible for association at these loci, have been successfully identified. Current statistical methods for identifying causal variants at risk loci either use the strength of the association signal in an iterative conditioning framework or estimate probabilities for variants to be causal. A main drawback of existing methods is that they rely on the simplifying assumption of a single causal variant at each risk locus, which is typically invalid at many risk loci. In this work, we propose a new statistical framework that allows for the possibility of an arbitrary number of causal variants when estimating the posterior probability of a variant being causal. A direct benefit of our approach is that we predict a set of variants for each locus that under reasonable assumptions will contain all of the true causal variants with a high confidence level (e.g., 95%) even when the locus contains multiple causal variants. We use simulations to show that our approach provides 20-50% improvement in our ability to identify the causal variants compared to the existing methods at loci harboring multiple causal variants. We validate our approach using empirical data from an expression QTL study of CHI3L2 to identify new causal variants that affect gene expression at this locus. CAVIAR is publicly available online at http://genetics.cs.ucla.edu/caviar/.A LTHOUGH genome-wide association studies (GWAS) reproducibly identified thousands of risk loci (Hakonarson et al. 2007;Sladek et al. 2007;Zeggini et al. 2007; Yang et al. 2011a,b;Kottgen et al. 2013;Lu et al. 2013;Ripke et al. 2013), only a handful of causal genetic variants (i.e., variants that biologically alter disease risk) have been found (Altshuler et al. 2008;Manolio et al. 2008;McCarthy et al. 2008), thus prohibiting the mechanistic understanding of the genetic basis of common diseases. The linkage disequilibrium (LD) (Pritchard and Przeworski 2001;Reich et al. 2001) structure of the human genome has greatly benefited GWAS in interrogating only a subset of all variants to assay common variation across the genome. Unfortunately, LD hinders the identification of causal variants at risk loci in fine-mapping studies as at each locus, there are often tens to hundreds of variants tightly linked to the reported associated single-nucleotide polymorphism (SNP) (Malo et al. 2008;Maller et al. 2012;Yang et al. 2012). In a continued effort to identify causal variants, many finemapping studies that assess genetic variation at known GWAS risk loci are currently underway (Bauer et al. 2013;Coram et al. 2013;Diogo et al. 2013;Gong et al. 2013;Marigorta and Navarro 2013;Peters et al. 2013;Wu et al. 2013).Fine-mapping studies typically follow a two-step procedure. First, a statistical analysis of the association signal is performed to identify a minimum set of SNPs that can explain the signal. Second, the SNPs that ar...

Identifying Causal Variants at Loci with Multiple Signals of Association

Hormozdiari

¹

,

Kostem

²

,

³

et al. 2014

Although genome-wide association studies have successfully identified thousands of risk loci for complex traits, only a handful of the biologically causal variants, responsible for association at these loci, have been successfully identified. Current statistical methods for identifying causal variants at risk loci either use the strength of the association signal in an iterative conditioning framework or estimate probabilities for variants to be causal. A main drawback of existing methods is that they rely on the simplifying assumption of a single causal variant at each risk locus, which is typically invalid at many risk loci. In this work, we propose a new statistical framework that allows for the possibility of an arbitrary number of causal variants when estimating the posterior probability of a variant being causal. A direct benefit of our approach is that we predict a set of variants for each locus that under reasonable assumptions will contain all of the true causal variants with a high confidence level (e.g., 95%) even when the locus contains multiple causal variants. We use simulations to show that our approach provides 20-50% improvement in our ability to identify the causal variants compared to the existing methods at loci harboring multiple causal variants. We validate our approach using empirical data from an expression QTL study of CHI3L2 to identify new causal variants that affect gene expression at this locus. CAVIAR is publicly available online at http://genetics.cs.ucla.edu/caviar/.A LTHOUGH genome-wide association studies (GWAS) reproducibly identified thousands of risk loci (Hakonarson et al. 2007;Sladek et al. 2007;Zeggini et al. 2007; Yang et al. 2011a,b;Kottgen et al. 2013;Lu et al. 2013;Ripke et al. 2013), only a handful of causal genetic variants (i.e., variants that biologically alter disease risk) have been found (Altshuler et al. 2008;Manolio et al. 2008;McCarthy et al. 2008), thus prohibiting the mechanistic understanding of the genetic basis of common diseases. The linkage disequilibrium (LD) (Pritchard and Przeworski 2001;Reich et al. 2001) structure of the human genome has greatly benefited GWAS in interrogating only a subset of all variants to assay common variation across the genome. Unfortunately, LD hinders the identification of causal variants at risk loci in fine-mapping studies as at each locus, there are often tens to hundreds of variants tightly linked to the reported associated single-nucleotide polymorphism (SNP) (Malo et al. 2008;Maller et al. 2012;Yang et al. 2012). In a continued effort to identify causal variants, many finemapping studies that assess genetic variation at known GWAS risk loci are currently underway (Bauer et al. 2013;Coram et al. 2013;Diogo et al. 2013;Gong et al. 2013;Marigorta and Navarro 2013;Peters et al. 2013;Wu et al. 2013).Fine-mapping studies typically follow a two-step procedure. First, a statistical analysis of the association signal is performed to identify a minimum set of SNPs that can explain the signal. Second, the SNPs that ar...

Genetic and environmental control of host-gut microbiota interactions

Org

¹

,

Parks

²

,

Joo

³

et al. 2015

Genetics provides a potentially powerful approach to dissect host-gut microbiota interactions. Toward this end, we profiled gut microbiota using 16s rRNA gene sequencing in a panel of 110 diverse inbred strains of mice. This panel has previously been studied for a wide range of metabolic traits and can be used for high-resolution association mapping. Using a SNPbased approach with a linear mixed model, we estimated the heritability of microbiota composition. We conclude that, in a controlled environment, the genetic background accounts for a substantial fraction of abundance of most common microbiota. The mice were previously studied for response to a high-fat, high-sucrose diet, and we hypothesized that the dietary response was determined in part by gut microbiota composition. We tested this using a cross-fostering strategy in which a strain showing a modest response, SWR, was seeded with microbiota from a strain showing a strong response, A×B19. Consistent with a role of microbiota in dietary response, the cross-fostered SWR pups exhibited a significantly increased response in weight gain. To examine specific microbiota contributing to the response, we identified various genera whose abundance correlated with dietary response. Among these, we chose Akkermansia muciniphila, a common anaerobe previously associated with metabolic effects. When administered to strain A×B19 by gavage, the dietary response was significantly blunted for obesity, plasma lipids, and insulin resistance. In an effort to further understand host-microbiota interactions, we mapped loci controlling microbiota composition and prioritized candidate genes. Our publicly available data provide a resource for future studies.

Fine Mapping in 94 Inbred Mouse Strains Using a High-Density Haplotype Resource

Kirby

¹

,

²

,

Wade

³

et al. 2010

The genetics of phenotypic variation in inbred mice has for nearly a century provided a primary weapon in the medical research arsenal. A catalog of the genetic variation among inbred mouse strains, however, is required to enable powerful positional cloning and association techniques. A recent whole-genome resequencing study of 15 inbred mouse strains captured a significant fraction of the genetic variation among a limited number of strains, yet the common use of hundreds of inbred strains in medical research motivates the need for a high-density variation map of a larger set of strains. Here we report a dense set of genotypes from 94 inbred mouse strains containing 10.77 million genotypes over 121,433 single nucleotide polymorphisms (SNPs), dispersed at 20-kb intervals on average across the genome, with an average concordance of 99.94% with previous SNP sets. Through pairwise comparisons of the strains, we identified an average of 4.70 distinct segments over 73 classical inbred strains in each region of the genome, suggesting limited genetic diversity between the strains. Combining these data with genotypes of 7570 gap-filling SNPs, we further imputed the untyped or missing genotypes of 94 strains over 8.27 million Perlegen SNPs. The imputation accuracy among classical inbred strains is estimated at 99.7% for the genotypes imputed with high confidence. We demonstrated the utility of these data in high-resolution linkage mapping through power simulations and statistical power analysis and provide guidelines for developing such studies. We also provide a resource of in silico association mapping between the complex traits deposited in the Mouse Phenome Database with our genotypes. We expect that these resources will facilitate effective designs of both human and mouse studies for dissecting the genetic basis of complex traits.

Hybrid mouse diversity panel: a panel of inbred mouse strains suitable for analysis of complex genetic traits

Ghazalpour

¹

,

Rau

²

,

Farber

³

et al. 2012

We have developed an association-based approach using classical inbred strains of mice in which we correct for population structure, which is very extensive in mice, using an efficient mixed-model algorithm. Our approach includes inbred parental strains as well as recombinant inbred strains in order to capture loci with effect sizes typical of complex traits in mice (in the range of 5 % of total trait variance). Over the last few years, we have typed the hybrid mouse diversity panel (HMDP) strains for a variety of clinical traits as well as intermediate phenotypes and have shown that the HMDP has sufficient power to map genes for highly complex traits with resolution that is in most cases less than a megabase. In this essay, we review our experience with the HMDP, describe various ongoing projects, and discuss how the HMDP may fit into the larger picture of common diseases and different approaches.

A powerful and efficient set test for genetic markers that handles confounders

Listgarten

¹

,

Lippert

²

,

³

et al. 2013

Motivation: Approaches for testing sets of variants, such as a set of rare or common variants within a gene or pathway, for association with complex traits are important. In particular, set tests allow for aggregation of weak signal within a set, can capture interplay among variants and reduce the burden of multiple hypothesis testing. Until now, these approaches did not address confounding by family relatedness and population structure, a problem that is becoming more important as larger datasets are used to increase power.Results: We introduce a new approach for set tests that handles confounders. Our model is based on the linear mixed model and uses two random effects—one to capture the set association signal and one to capture confounders. We also introduce a computational speedup for two random-effects models that makes this approach feasible even for extremely large cohorts. Using this model with both the likelihood ratio test and score test, we find that the former yields more power while controlling type I error. Application of our approach to richly structured Genetic Analysis Workshop 14 data demonstrates that our method successfully corrects for population structure and family relatedness, whereas application of our method to a 15 000 individual Crohn’s disease case–control cohort demonstrates that it additionally recovers genes not recoverable by univariate analysis.Availability: A Python-based library implementing our approach is available at http://mscompbio.codeplex.com.Contact: jennl@microsoft.com or lippert@microsoft.com or heckerma@microsoft.comSupplementary information: Supplementary data are available at Bioinformatics online.

Co-expression networks reveal the tissue-specific regulation of transcription and splicing

Saha

¹

,

Kim

²

,

Gewirtz

³

et al. 2017

Gene co-expression networks capture biologically important patterns in gene expression data, enabling functional analyses of genes, discovery of biomarkers, and interpretation of genetic variants. Most network analyses to date have been limited to assessing correlation between total gene expression levels in a single tissue or small sets of tissues. Here, we built networks that additionally capture the regulation of relative isoform abundance and splicing, along with tissue-specific connections unique to each of a diverse set of tissues. We used the Genotype-Tissue Expression (GTEx) project v6 RNA sequencing data across 50 tissues and 449 individuals. First, we developed a framework called Transcriptome-Wide Networks (TWNs) for combining total expression and relative isoform levels into a single sparse network, capturing the interplay between the regulation of splicing and transcription. We built TWNs for 16 tissues and found that hubs in these networks were strongly enriched for splicing and RNA binding genes, demonstrating their utility in unraveling regulation of splicing in the human transcriptome. Next, we used a Bayesian biclustering model that identifies network edges unique to a single tissue to reconstruct Tissue-Specific Networks (TSNs) for 26 distinct tissues and 10 groups of related tissues. Finally, we found genetic variants associated with pairs of adjacent nodes in our networks, supporting the estimated network structures and identifying 20 genetic variants with distant regulatory impact on transcription and splicing. Our networks provide an improved understanding of the complex relationships of the human transcriptome across tissues.

Proc. Natl. Acad. Sci. U.S.A.

Identification of individuals by trait prediction using whole-genome sequencing data

Lippert

¹

,

Sabatini

²

,

Maher

³

et al. 2017

Prediction of human physical traits and demographic information from genomic data challenges privacy and data deidentification in personalized medicine. To explore the current capabilities of phenotype-based genomic identification, we applied whole-genome sequencing, detailed phenotyping, and statistical modeling to predict biometric traits in a cohort of 1,061 participants of diverse ancestry. Individually, for a large fraction of the traits, their predictive accuracy beyond ancestry and demographic information is limited. However, we have developed a maximum entropy algorithm that integrates multiple predictions to determine which genomic samples and phenotype measurements originate from the same person. Using this algorithm, we have reidentified an average of >8 of 10 held-out individuals in an ethnically mixed cohort and an average of 5 of either 10 African Americans or 10 Europeans. This work challenges current conceptions of personal privacy and may have far-reaching ethical and legal implications.