Pavel Mahmud scite author profile

Pavel Mahmud

4Publications

35Citation Statements Received

183Citation Statements Given

How they've been cited

How they cite others

250

175

Affiliations

Rutgers, The State University of New Jersey

Publications

Order By: Most citations

Fast MCMC sampling for hidden markov models to determine copy number variations

Mahmud

Schliep

2011

BMC Bioinformatics

View full text Add to dashboard Cite

BackgroundHidden Markov Models (HMM) are often used for analyzing Comparative Genomic Hybridization (CGH) data to identify chromosomal aberrations or copy number variations by segmenting observation sequences. For efficiency reasons the parameters of a HMM are often estimated with maximum likelihood and a segmentation is obtained with the Viterbi algorithm. This introduces considerable uncertainty in the segmentation, which can be avoided with Bayesian approaches integrating out parameters using Markov Chain Monte Carlo (MCMC) sampling. While the advantages of Bayesian approaches have been clearly demonstrated, the likelihood based approaches are still preferred in practice for their lower running times; datasets coming from high-density arrays and next generation sequencing amplify these problems.ResultsWe propose an approximate sampling technique, inspired by compression of discrete sequences in HMM computations and by kd-trees to leverage spatial relations between data points in typical data sets, to speed up the MCMC sampling.ConclusionsWe test our approximate sampling method on simulated and biological ArrayCGH datasets and high-density SNP arrays, and demonstrate a speed-up of 10 to 60 respectively 90 while achieving competitive results with the state-of-the art Bayesian approaches.Availability: An implementation of our method will be made available as part of the open source GHMM library from http://ghmm.org.

show abstract

Speeding Up Bayesian HMM by the Four Russians Method

Mahmud

Schliep

2011

View full text Add to dashboard Cite

Bayesian computations with Hidden Markov Models (HMMs) are often avoided in practice. Instead, due to reduced running time, point estimates -maximum likelihood (ML) or maximum a posterior (MAP) -are obtained and observation sequences are segmented based on the Viterbi path, even though the lack of accuracy and dependency on starting points of the local optimization are well known. We propose a method to speed-up Bayesian computations which addresses this problem for regular and time-dependent HMMs with discrete observations. In particular, we show that by exploiting sequence repetitions, using the four Russians method, and the conditional dependency structure, it is possible to achieve a Θ(log T ) speed-up, where T is the length of the observation sequence. Our experimental results on identification of segments of homogeneous nucleic acid composition, known as the DNA segmentation problem, show that the speed-up is also observed in practice. Availability: An implementation of our method will be available as part of the open source GHMM library from http://ghmm.org.

show abstract

Indel-tolerant read mapping with trinucleotide frequencies using cache-oblivious kd-trees

Mahmud¹,

Wiedenhoeft²,

Schliep

2012

View full text Add to dashboard Cite

Motivation: Mapping billions of reads from next generation sequencing experiments to reference genomes is a crucial task, which can require hundreds of hours of running time on a single CPU even for the fastest known implementations. Traditional approaches have difficulties dealing with matches of large edit distance, particularly in the presence of frequent or large insertions and deletions (indels). This is a serious obstacle both in determining the spectrum and abundance of genetic variations and in personal genomics.Results: For the first time, we adopt the approximate string matching paradigm of geometric embedding to read mapping, thus rephrasing it to nearest neighbor queries in a q-gram frequency vector space. Using the L1 distance between frequency vectors has the benefit of providing lower bounds for an edit distance with affine gap costs. Using a cache-oblivious kd-tree, we realize running times, which match the state-of-the-art. Additionally, running time and memory requirements are about constant for read lengths between 100 and 1000 bp. We provide a first proof-of-concept that geometric embedding is a promising paradigm for read mapping and that L1 distance might serve to detect structural variations. TreQ, our initial implementation of that concept, performs more accurate than many popular read mappers over a wide range of structural variants.Availability and implementation: TreQ will be released under the GNU Public License (GPL), and precomputed genome indices will be provided for download at http://treq.sf.net.Contact: pavelm@cs.rutgers.eduSupplementary information: Supplementary data are available at Bioinformatics online.

show abstract

Reduced representations for efficient analysis of genomic data

Mahmud¹

2014

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Pavel Mahmud

Fast MCMC sampling for hidden markov models to determine copy number variations

Speeding Up Bayesian HMM by the Four Russians Method

Indel-tolerant read mapping with trinucleotide frequencies using cache-oblivious kd-trees

Reduced representations for efficient analysis of genomic data

Contact Info

Product

Resources

About