The role of rare variants in complex traits remains uncharted. Here, we conduct deep whole genome sequencing of 1,457 individuals from an isolated population, and test for rare variant burdens across six cardiometabolic traits. We identify a role for rare regulatory variation, which has hitherto been missed. We find evidence of rare variant burdens overlapping with, and mostly independent of established common variant signals (ADIPOQ and adiponectin, P=4.2x10 -8 ; APOC3 and triglyceride levels, P=1.58×10 -26 ; GGT1 and gamma-glutamyltransferase, P=2.3x10 -6 ; UGT1A9 and bilirubin, P=1.9×10 -8 ), and identify replicating evidence for a burden associated with triglyceride levels in FAM189A (P=2.26x10 -8 ), indicating a role for this gene in lipid metabolism.The role of rare sequence variants in the genetic architecture of medically-relevant complex traits is not well-understood. Population-scale deep whole genome sequencing can capture genetic variation across the entire allele frequency spectrum traversing the coding and noncoding genome. Here, to improve our understanding of the role of rare variants, we perform cohort-wide deep whole genome sequencing of 1,457 individuals from a deeplyphenotyped, isolated population from Crete, Greece (the HELIC-MANOLIS cohort 1-3 ) at an average depth of 22.5x ( Supplementary Figure 1), capturing 98% of true single nucleotide variants (SNVs) (Online Methods and Supplementary Figure 2). We address open questions on whole genome sequencing study design, analysis and interpretation, and identify burdens of coding and regulatory rare variants associated with cardiometabolic traits.
RESULTS
Effect of sequencing depthComparing whole genome sequencing at various depths ranging from 15x to 30x (Online Methods), we find that 96.4% of singletons, 97.9% of doubletons and 97.6% of variants called using 30x sequencing are recapitulated at 22.5x depth. Genotype accuracy (as measured by r 2 ) is 99.7% for 22.5x depth and 98.5% for 15x depth, suggesting that increases between 15x and 30x translate into marginal improvements in both call rate and quality of very rare SNVs (Figure 1, Supplementary Figure 3 and Methods). We find that false discovery rates and genotype accuracy are substantially more dependent on sequencing depth for INDELs than for SNVs (Figure 1).
Landscape of sequence variationFollowing quality control (QC), we call 24,163,896 non-monomorphic SNVs and INDELs, 97.9% of which are biallelic. 14,281,180 (60.31%) of the biallelic SNVs are rare (minor allele frequency [MAF]<0.01); 3,103,273 (13.1%) are low-frequency (MAF 0.01-0.05); and 6,292,726 (26.57%) are common (MAF>0.05). We call 8,294 non-monomorphic variants annotated as loss-of-function (LoF) with low-confidence (LC) 4 , and 438 variants annotated as LoF with high-confidence (HC) ( Supplementary Figure 4). On average, each individual carries 405 (s=19) LC LoF variants and 31 (s=6) HC LoF variants, compared to 149 LoF variants per sample in a whole genome sequencing study of 2,636 Icelanders 5 . 0.6% and 1% of HC and LC LoF c...