The interpretation of the results of large association studies encompassing much or all of the human genome faces the fundamental statistical problem that a correspondingly large number of single nucleotide polymorphisms markers will be spuriously flagged as significant. A common method of dealing with these false positives is to raise the significance level for the individual tests for association of each marker. Any such adjustment for multiple testing is ultimately based on a more or less precise estimate for the actual overall type I error probability. We estimate this probability for association tests for correlated markers and show that it depends in a nonlinear way on the significance level for the individual tests. This dependence of the effective number of tests is not taken into account by existing multiple-testing corrections, leading to widely overestimated results. We demonstrate a simple correction for multiple testing, which can easily be calculated from the pairwise correlation and gives far more realistic estimates for the effective number of tests than previous formulae. The calculation is considerably faster than with other methods and hence applicable on a genome-wide scale. The efficacy of our method is shown on a constructed example with highly correlated markers as well as on real data sets, including a full genome scan where a conservative estimate only 8% above the permutation estimate is obtained in about 1% of computation time. As the calculation is based on pairwise correlations between markers, it can be performed at the stage of study design using public databases. INTRODUCTIONCase-control studies to identify single nucleotide polymorphism (SNP) markers associated with a disease are a commonly used methodology to pinpoint genes which may play a central role in understanding the genetic background of complex diseases. The availability of efficient and reliable techniques of genotyping has made it possible to extend the scope of association studies to encompass the whole genome, cf. the recent [WTCCC, 2007] project.A fundamental difficulty in the interpretation of the results of such large-scale association studies is presented by the following simple fact of statistical theory, also known as the multiple-testing problem. When a number N of statistical tests are performed, each of which has a type I error probability a, the expected number of (false) significant findings, assuming the null hypothesis in each test, is equal to Na, irrespective of whether the tests are statistically independent or not, and whether they test the same or different hypotheses. As a consequence, in a largescale association study testing a large number of SNP marker loci, any true association result will be accompanied and obscured by a correspondingly large number of spurious associations.A widely accepted approach to deal with this problem is a multiple-testing correction, adjusting the significance level for each test to a value a such that the overall type I error for the study, i.e. the probability P...
SummaryTraditionally in genetic case-control studies controls have been screened to exclude subjects with a personal history of illness. This control group has the advantage of optimal power to detect loci involved in illness, but requires more work and may incur substantial cost in recruitment. An alternative approach to screening is to use unscreened controls sampled from the general population. Such controls are generally plentiful and inexpensive, but in general there is a risk that some may have the same disease as the cases, which will reduce power to detect associations. We have quantified the extent of this power loss, and produced mathematical formulae for the number of unscreened controls necessary to achieve the same power as a fixed sample of screened controls. The effect of using unscreened controls will also depend on the ratio of the number of screened controls to cases specified in the original study design, and this is also investigated. We have also investigated the cost-benefits of the screened and unscreened approaches, according to variation in the relative costs of sampling screened and unscreened controls, together with genotyping costs. We have, thus, identified the range of situations in which using unscreened controls is a cost-effective alternative to the screened control method and could be considered when designing a study. In many of the typical, real-world situations in complex genetics, the use of unscreened controls is potentially cost-effective and can, in general, be considered for disorders with population prevalence K p < 0.2. With the steady reduction in genotyping costs and the availability of common sets of "population controls" this design is likely to become increasingly cost effective.
No abstract
Polygenic risk scores (PRSs) are a method to summarize the additive trait variance captured by a set of SNPs, and can increase the power of set‐based analyses by leveraging public genome‐wide association study (GWAS) datasets. PRS aims to assess the genetic liability to some phenotype on the basis of polygenic risk for the same or different phenotype estimated from independent data. We propose the application of PRSs as a set‐based method with an additional component of adjustment for linkage disequilibrium (LD), with potential extension of the PRS approach to analyze biologically meaningful SNP sets. We call this method POLARIS: POlygenic Ld‐Adjusted RIsk Score. POLARIS identifies the LD structure of SNPs using spectral decomposition of the SNP correlation matrix and replaces the individuals' SNP allele counts with LD‐adjusted dosages. Using a raw genotype dataset together with SNP effect sizes from a second independent dataset, POLARIS can be used for set‐based analysis. MAGMA is an alternative set‐based approach employing principal component analysis to account for LD between markers in a raw genotype dataset. We used simulations, both with simple constructed and real LD‐structure, to compare the power of these methods. POLARIS shows more power than MAGMA applied to the raw genotype dataset only, but less or comparable power to combined analysis of both datasets. POLARIS has the advantages that it produces a risk score per person per set using all available SNPs, and aims to increase power by leveraging the effect sizes from the discovery set in a self‐contained test of association in the test dataset.
It is shown that the spectrum of a one-dimensional Dirac operator with a potential q tending to infinity at infinity, and such that the positive variation of 1\q is bounded, covers the whole real line and is purely absolutely continuous. An example is given to show that in general, pure absolute continuity is lost if the condition on the positive variation is dropped. The appendix contains a direct proof for the special case of subordinacy theory used.
A major controversy in psychiatric genetics is whether nonadditive genetic interaction effects contribute to the risk of highly polygenic disorders. We applied a support vector machines (SVMs) approach, which is capable of building linear and nonlinear models using kernel methods, to classify cases from controls in a large schizophrenia case–control sample of 11,853 subjects (5,554 cases and 6,299 controls) and compared its prediction accuracy with the polygenic risk score (PRS) approach. We also investigated whether SVMs are a suitable approach to detecting nonlinear genetic effects, that is, interactions. We found that PRS provided more accurate case/control classification than either linear or nonlinear SVMs, and give a tentative explanation why PRS outperforms both multivariate regression and linear kernel SVMs. In addition, we observe that nonlinear kernel SVMs showed higher classification accuracy than linear SVMs when a large number of SNPs are entered into the model. We conclude that SVMs are a potential tool for assessing the presence of interactions, prior to searching for them explicitly.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.