Abstract:In metagenomic studies, testing the association between microbiome composition and clinical outcomes translates to testing the nullity of variance components. Motivated by a lung human immunodeficiency virus (HIV) microbiome project, we study longitudinal microbiome data by using variance component models with more than two variance components. Current testing strategies only apply to models with exactly two variance components and when sample sizes are large. Therefore, they are not applicable to longitudinal… Show more
“…Although under regularity conditions, and are consistent estimators for variance component parameters and under the null hypothesis , the classical score-type variance component test above treats them as fixed numbers and ignores the variability in their estimation, which could result in not well-calibrated p values in finite samples. This is a known issue for score-type variance component tests in microbiome association studies with small sample sizes [ 48 – 50 ]. Despite large sample sizes in biobank-scale cohorts, the local IBD matrix Ψ l for genetic locus l is often sparse, which could invalidate asymptotic inference on the quadratic form .…”
Although genome-wide association studies (GWAS) have identified tens of thousands of genetic loci, the genetic architecture is still not fully understood for many complex traits. Most GWAS and sequencing association studies have focused on single nucleotide polymorphisms or copy number variations, including common and rare genetic variants. However, phased haplotype information is often ignored in GWAS or variant set tests for rare variants. Here we leverage the identity-by-descent (IBD) segments inferred from a random projection-based IBD detection algorithm in the mapping of genetic associations with complex traits, to develop a computationally efficient statistical test for IBD mapping in biobank-scale cohorts. We used sparse linear algebra and random matrix algorithms to speed up the computation, and a genome-wide IBD mapping scan of more than 400,000 samples finished within a few hours. Simulation studies showed that our new method had well-controlled type I error rates under the null hypothesis of no genetic association in large biobank-scale cohorts, and outperformed traditional GWAS single-variant tests when the causal variants were untyped and rare, or in the presence of haplotype effects. We also applied our method to IBD mapping of six anthropometric traits using the UK Biobank data and identified a total of 3,442 associations, 2,131 (62%) of which remained significant after conditioning on suggestive tag variants in the ± 3 centimorgan flanking regions from GWAS.
“…Although under regularity conditions, and are consistent estimators for variance component parameters and under the null hypothesis , the classical score-type variance component test above treats them as fixed numbers and ignores the variability in their estimation, which could result in not well-calibrated p values in finite samples. This is a known issue for score-type variance component tests in microbiome association studies with small sample sizes [ 48 – 50 ]. Despite large sample sizes in biobank-scale cohorts, the local IBD matrix Ψ l for genetic locus l is often sparse, which could invalidate asymptotic inference on the quadratic form .…”
Although genome-wide association studies (GWAS) have identified tens of thousands of genetic loci, the genetic architecture is still not fully understood for many complex traits. Most GWAS and sequencing association studies have focused on single nucleotide polymorphisms or copy number variations, including common and rare genetic variants. However, phased haplotype information is often ignored in GWAS or variant set tests for rare variants. Here we leverage the identity-by-descent (IBD) segments inferred from a random projection-based IBD detection algorithm in the mapping of genetic associations with complex traits, to develop a computationally efficient statistical test for IBD mapping in biobank-scale cohorts. We used sparse linear algebra and random matrix algorithms to speed up the computation, and a genome-wide IBD mapping scan of more than 400,000 samples finished within a few hours. Simulation studies showed that our new method had well-controlled type I error rates under the null hypothesis of no genetic association in large biobank-scale cohorts, and outperformed traditional GWAS single-variant tests when the causal variants were untyped and rare, or in the presence of haplotype effects. We also applied our method to IBD mapping of six anthropometric traits using the UK Biobank data and identified a total of 3,442 associations, 2,131 (62%) of which remained significant after conditioning on suggestive tag variants in the ± 3 centimorgan flanking regions from GWAS.
“…This is a known issue for scoretype variance component tests in microbiome association studies with small sample sizes. [43][44][45] Despite large sample sizes in biobank-scale cohorts, the local IBD matrix 𝛹 𝑙 for genetic locus 𝑙 is often sparse, which could invalidate asymptotic inference on the quadratic form 𝑄 𝑙 =…”
Although genome-wide association studies (GWAS) have identified tens of thousands of genetic loci, the genetic architecture is still not fully understood for many complex traits. Most GWAS and sequencing association studies have focused on single nucleotide polymorphisms or copy number variations, including common and rare genetic variants. However, phased haplotype information is often ignored in GWAS or variant set tests for rare variants. Here we leverage the identity-by-descent (IBD) segments inferred from a random projection-based IBD detection algorithm in the mapping of genetic associations with complex traits, to develop a computationally efficient statistical test for IBD mapping in biobank-scale cohorts. We used sparse linear algebra and random matrix algorithms to speed up the computation, and a genome-wide IBD mapping scan of more than 400,000 samples finished within a few hours. Simulation studies showed that our new method had well-controlled type I error rates under the null hypothesis of no genetic association in large biobank-scale cohorts, and outperformed traditional GWAS approaches and variant set tests when the causal variants were untyped and rare, or in the presence of haplotype effects. We also applied our method to IBD mapping of six anthropometric traits using the UK Biobank data and identified a 4 cM region on chromosome 8 associated with multiple traits related to body fat distribution or weight.
“…However, inference on Statistica Sinica: Newly accepted Paper (accepted author-version subject to English editing) the variance components is less studied and often requires strong distributional assumptions on the random effects and the error terms. When the underlying distributions are assumed to be multivariate normal, classical inference methods, such as the likelihood ratio test, the restricted likelihood ratio test, and the score test (Self and Liang, 1987;Zhang and Lin, 2003;Koh et al, 2019;Zhai et al, 2019), can be applied. However, these parametric methods are often restrictive and not robust if the model assumptions are violated.…”
Section: Introductionmentioning
confidence: 99%
“…The linear structure holds when each components of D(θ * ) is a linear function of θ * (Lin, 1997). This encompasses both nested, crossed and clustered designs (Michalski and Zmyślony, 1996;Zhai et al, 2019;Chen et al, 2019;Li et al, 2021). See Section 5.1 for a specific example of such a random-effect model for modeling the family data that includes additive genetic effect, common environment and unique subject-specific We first introduce some notation.…”
Linear mixed-effects models are widely used in analyzing repeated measures data, including clustered and longitudinal data, where inferences of both fixed effects and variance components are of interest. Unlike inference on fixed effect, which has been well studied, inference on the variance components is more challenging due to null value on the boundary and the unknown fixed effects as nuisance parameters. Existing methods require strong distributional assumptions
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.