Fast MCMC sampling for hidden markov models to determine copy number variations

Mahmud, Pavel; Schliep, Alexander

doi:10.1186/1471-2105-12-428

Cited by 9 publications

(27 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly, HMM algorithms were originally formulated for aCGH platforms [22] and many innovations were subsequently proposed. For example, distance-based transition probabilities [6], fully Bayesian HMMs [23], reversible jump and approximate sampling Markov chain Monte Carlo (MCMC) [24,25], iterative approaches to parameter estimation [26], alternatives to the Viterbi algorithm [27], and higher order Markov chains [28]. As HMMs readily accomodate multiple data sequences, the observation that copy number can be estimated from genotyping arrays [29] led to the development of several HMMs that jointly model copy number and genotypes at SNPs [30-37].…”

Section: Introductionmentioning

confidence: 99%

Fast detection of de novo copy number variants from SNP arrays for case-parent trios

et al. 2012

View full text Add to dashboard Cite

BackgroundIn studies of case-parent trios, we define copy number variants (CNVs) in the offspring that differ from the parental copy numbers as de novo and of interest for their potential functional role in disease. Among the leading array-based methods for discovery of de novo CNVs in case-parent trios is the joint hidden Markov model (HMM) implemented in the PennCNV software. However, the computational demands of the joint HMM are substantial and the extent to which false positive identifications occur in case-parent trios has not been well described. We evaluate these issues in a study of oral cleft case-parent trios.ResultsOur analysis of the oral cleft trios reveals that genomic waves represent a substantial source of false positive identifications in the joint HMM, despite a wave-correction implementation in PennCNV. In addition, the noise of low-level summaries of relative copy number (log R ratios) is strongly associated with batch and correlated with the frequency of de novo CNV calls. Exploiting the trio design, we propose a univariate statistic for relative copy number referred to as the minimum distance that can reduce technical variation from probe effects and genomic waves. We use circular binary segmentation to segment the minimum distance and maximum a posteriori estimation to infer de novo CNVs from the segmented genome. Compared to PennCNV on simulated data, MinimumDistance identifies fewer false positives on average and is comparable to PennCNV with respect to false negatives. Genomic waves contribute to discordance of PennCNV and MinimumDistance for high coverage de novo calls, while highly concordant calls on chromosome 22 were validated by quantitative PCR. Computationally, MinimumDistance provides a nearly 8-fold increase in speed relative to the joint HMM in a study of oral cleft trios.ConclusionsOur results indicate that batch effects and genomic waves are important considerations for case-parent studies of de novo CNV, and that the minimum distance is an effective statistic for reducing technical variation contributing to false de novo discoveries. Coupled with segmentation and maximum a posteriori estimation, our algorithm compares favorably to the joint HMM with MinimumDistance being much faster.

show abstract

Section: Introductionmentioning

confidence: 99%

Fast detection of de novo copy number variants from SNP arrays for case-parent trios

et al. 2012

View full text Add to dashboard Cite

show abstract

“…Following [62], we report F-measures (F 1 scores) for binary classification into normal and aberrant segments ( Fig. 2), using the usual definition of F = 2πρ π+ρ being the harmonic mean of precision π = TP TP+FP and recall ρ = TP TP+FN , where TP, FP, TN and FN denote true/false positives/negatives, respectively.…”

Section: Simulated Acgh Datamentioning

confidence: 99%

“…Though there are several schemes available to sample q, [58] has argued strongly in favor of Forward-Backward sampling [57], which yields Forward-Backward Gibbs sampling (FBG) above. Variations of this have been implemented for segmentation of aCGH data before [60,62,78]. However, since in each iteration a quadratic number of terms has to be calculated at each position to obtain the forward variables, and a state has to be sampled at each position in the backward step, this method is still expensive for large data.…”

Section: Bayesian Hidden Markov Modelsmentioning

confidence: 99%

“…However, since in each iteration a quadratic number of terms has to be calculated at each position to obtain the forward variables, and a state has to be sampled at each position in the backward step, this method is still expensive for large data. Recently, [62] have introduced compressed FBG by sampling over a shorter sequence of sufficient statistics of data segments which are likely to come from the same underlying state. Let B := (B w ) W w=1 be a partition of y into W blocks.…”

Section: Bayesian Hidden Markov Modelsmentioning

confidence: 99%

See 1 more Smart Citation

Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression

Wiedenhoeft

Brugel

Schliep

2015

Preprint

Self Cite

View full text Add to dashboard Cite

By integrating Haar wavelets with Hidden Markov Models, we achieve drastically reduced running times for Bayesian inference using Forward-Backward Gibbs sampling. We show that this improves detection of genomic copy number variants (CNV) in array CGH experiments compared to the state-of-the-art, including standard Gibbs sampling. The method concentrates computational effort on chromosomal segments which are difficult to call, by dynamically and adaptively recomputing consecutive blocks of observations likely to share a copy number. This makes routine diagnostic use and re-analysis of legacy data collections feasible; to this end, we also propose an effective automatic prior. An open source software implementation of our method is available at http://schlieplab.org/Software/ HaMMLET/ (DOI: 10.5281/zenodo.46262). This paper was selected for oral presentation at RECOMB 2016, and an abstract is published in the conference proceedings.

show abstract

“…Our approach uses a Poisson hidden Markov model (PHMM) to 1) estimate (hidden) states of gene expression levels in terminal exon 3′ UTRs, 2) infer shortening of the region in human liver and brain cortex tissues and 3) demonstrate tissue-specific APA. Others have used hidden Markov models (HMMs) in a similar fashion to dynamically map chromatin states (Ernst et al, 2011), to integrate genomic data (Day et al, 2007) and for determination of gene copy number variations (Mahmud and Schliep, 2011) just to name a few. We compare our results to those obtained by MISO, a probabilistic approach to quantification of transcripts at the 3′ UTR (Katz et al, 2010) and Cufflinks, based on de novo transcript assembly (Trapnell et al, 2010).…”

Section: Introductionmentioning

confidence: 99%

Dynamic expression of 3′ UTRs revealed by Poisson hidden Markov modeling of RNA-Seq: Implications in gene expression profiling

Lü

Bushel

2013

Gene

View full text Add to dashboard Cite

RNA sequencing (RNA-Seq) allows for the identification of novel exon-exon junctions and quantification of gene expression levels. We show that from RNA-Seq data one may also detect utilization of alternative polyadenylation (APA) in 3′ untranslated regions (3′ UTRs) known to play a critical role in the regulation of mRNA stability, cellular localization and translation efficiency. Given the dynamic nature of APA, it is desirable to examine the APA on a sample by sample basis. We used a Poisson hidden Markov model (PHMM) of RNA-Seq data to identify potential APA in human liver and brain cortex tissues leading to shortened 3′ UTRs. Over three hundred transcripts with shortened 3′ UTRs were detected with sensitivity >75% and specificity >60%. tissue-specific 3′ UTR shortening was observed for 32 genes with a q-value ≤ 0.1. When compared to alternative isoforms detected by Cufflinks or MISO, our PHMM method agreed on over 100 transcripts with shortened 3′ UTRs. Given the increasing usage of RNA-Seq for gene expression profiling, using PHMM to investigate sample-specific 3′ UTR shortening could be an added benefit from this emerging technology.

show abstract

Fast MCMC sampling for hidden markov models to determine copy number variations

Cited by 9 publications

References 35 publications

Fast detection of de novo copy number variants from SNP arrays for case-parent trios

Fast detection of de novo copy number variants from SNP arrays for case-parent trios

Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression

Dynamic expression of 3′ UTRs revealed by Poisson hidden Markov modeling of RNA-Seq: Implications in gene expression profiling

Contact Info

Product

Resources

About