Germline copy number variants (CNVs) increase risk for many diseases, yet detection of CNVs and quantifying their contribution to disease risk in large-scale studies is challenging. We developed an approach called CNPBayes to identify latent batch effects, to provide probabilistic estimates of integer copy number across the estimated batches, and to fully integrate the copy number uncertainty in the association model for disease. We demonstrate this approach in a Pancreatic Cancer Case Control study of 7,598 participants where the major sources of technical variation were not captured by study site and varied across the genome. Candidate associations aided by this approach include deletions of 8q24 near regulatory elements of the tumor oncogene MYC and of Tumor Supressor Candidate 3 (TUSC3). This study provides a robust Bayesian inferential framework for estimating copy number and evaluating the role of copy number in heritable diseases.
Methods
The Pancreatic Cancer Case and Control ConsortiumClinical and demographic characteristics of the cases and controls in PanC4 and recruitment methods have been previously described [20]. All samples were processed using GenomeStudio (version 2011.1, Genotyping Module 1.9.4). For GC-correction, we sampled a random subset of 30,000 Illumina probes, fit LOESS with span 1/3 to the scatterplot of log 2 R ratios and probe GC content, and predicted the log 2 R ratios for the full probeset from the LOESS model. For spatial correction, we applied LOESS to the GC-corrected log 2 R ratios at SNPs with balanced allele frequencies (0.4 < BAF < 0.6) ordered by genomic position within each chromosome arm and predicted the GC-corrected log 2 R ratios for the full probeset, including SNPs with imbalanced allele frequencies. The residuals from the spatial LOESS were used in all downstream analyses with CNPBayes.
CNV regions:CNV regions identified for further analysis by CNPBayes were obtained from the collection of CNVs identified from a hidden Markov model as well as known CNV regions from the 1000 Genomes Project.For the former, we fit a 5-state hidden Markov model implemented in the R package VanillaICE (version 1.40.0) using default parameter settings [35]. To obtain a high confidence call set, we removed CNVs with fewer than 10 probes, CNVs with posterior probability less than 0.9, and restricted inference to autosomal chromosomes. To assess the effect of spatial adjustment on copy number inference, we stratified the samples into deciles of median absolute deviation and autocorrelation and compared the results of the 5-state hidden Markov model fit after GC-correction to the CNVs identified after spatial correction. Concordance of CNVs identified by the hidden Markov models was defined by ≥ 50% reciprocal overlap [36]. 9 CNV regions were defined by the set of non-overlapping disjoint intervals across the pooled set of all CNVs from cases and controls. We computed the number of subjects with a CNV overlapping each disjoint interval, retaining intervals where CNVs were identified in at l...