Identifying genetic variants that are associated with methylation variation -an analysis commonly referred to as methylation quantitative trait locus (mQTL) mapping --is important for understanding the epigenetic mechanisms underlying genotype-trait associations. Here, we develop a statistical method, IMAGE, for mQTL mapping in sequencing-based methylation studies. IMAGE properly accounts for the count nature of bisulfite sequencing data and incorporates allele-specific methylation patterns from heterozygous individuals to enable more powerful mQTL discovery. We compare IMAGE with existing approaches through extensive simulation. We also apply IMAGE to analyze two bisulfite sequencing studies, in which IMAGE identifies more mQTL than existing approaches.Keywords: allelic specific methylation; ASM; methylation quantitative trait locus; mQTL; IMAGE; bisulfite sequencing; binomial mixed model; penalized quasilikelihood.mapping approaches (as allele-specific expression estimates have been shown to do for eQTL mapping: [51, 52]). Early attempts to perform mQTL mapping with bisulfite sequencing data have yielded promising results [35, 49, 53]. However, existing mQTL mapping methods are designed with array data in mind [37, 38]. To maximize power, mQTL mapping using sequencing data requires new statistical methods development that can properly account for two of its distinctive features.First, methylation data collected in sequencing studies are counts, not continuous representations like those produced by arrays. Specifically, methylation level estimates at a given cytosine base are based on both the total read count at the site and the subset of those reads that are unconverted by sodium bisulfite (or other processes [43]). Previous mQTL studies have dealt with these data by first computing a ratio between the methylated count and the total count, and then treating this ratio as an estimate of the true methylation level [35, 49]. However, the count nature of the raw data means that the mean and variance of the computed ratio are highly interdependent. This relationship is not captured by previously deployed linear regression methods, which likely leads to loss of power. Indeed, similar losses of power are well-documented for differential methylation analysis [40] and differential expression analysis of RNA-seq data [54][55][56][57]. To overcome this challenge, statistical methods for sequencing-based differential methylation analysis now adapt overdispersed count models, including beta-binomial models [58][59][60][61][62] and binomial mixed models [40, 63, 64], to properly model the mean-variance relationship and potential over-dispersion. In differential methylation analysis, these approaches can substantially improve power compared with normalization based approaches [30, 65, 66]. Because mQTL mapping is conceptually similar and can be effectively viewed as genotype-based differential methylation analysis, extending over-dispersed binomial models to mQTL mapping is a promising approach.
ResultsMethod Overview and S...