On the Minimum Error Correction Problem for Haplotype Assembly in Diploid and Polyploid Genomes

Bonizzoni, Paola; Dondi, Riccardo; Klau, Gunnar W.; Pirola, Yuri; Pisanti, Nadia; Zaccaria, Simone

doi:10.1089/cmb.2015.0220

Cited by 36 publications

(39 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The vast majority of existing haplotype assembly methods attempt to remove the aforementioned ambiguity by altering or even discarding the data, leading to minimum SNP removal (Lancia 2001), maximum fragments cut (Duitama 2010), and minimum error correction (MEC) score optimization criteria. Majority of haplotype assembly methods developed in recent years are focused on optimizing the MEC score, i.e., determining the smallest possible number of nucleotides in sequencing reads that should be altered such that the resulting dataset is consistent with having originated from k haplotypes (k denotes the ploidy of an organism) (Xie 2016;Pirola 2015;Kuleshov 2014;Patterson 2015;Bonizzoni 2016). These include the branch-and-bound scheme (Wang 2005), an integer linear programming formulation in (Chen 2013), and a dynamic programming framework in (Kuleshov 2014).…”

Section: Introductionmentioning

confidence: 99%

A Graph Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction

Vikalo

2019

Preprint

View full text Add to dashboard Cite

Reconstructing components of a genomic mixture from data obtained by means of DNA sequencing is a challenging problem encountered in a variety of applications including single individual haplotyping and studies of viral communities. Highthroughput DNA sequencing platforms oversample mixture components to provide massive amounts of reads whose relative positions can be determined by mapping the reads to a known reference genome; assembly of the components, however, requires discovery of the reads' origin -an NP-hard problem that the existing methods struggle to solve with the required level of accuracy. In this paper, we present a learning framework based on a graph auto-encoder designed to exploit structural properties of sequencing data. The algorithm is a neural network which essentially trains to ignore sequencing errors and infers the posterior probabilities of the origin of sequencing reads. Mixture components are then reconstructed by finding consensus of the reads determined to originate from the same genomic component. Results on realistic synthetic as well as experimental data demonstrate that the proposed framework reliably assembles haplotypes and reconstructs viral communities, often significantly outperforming state-ofthe-art techniques.

show abstract

Section: Introductionmentioning

confidence: 99%

A Graph Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction

Vikalo

2019

Preprint

View full text Add to dashboard Cite

show abstract

“…Beginning with Hapcompass [1], there has been some work on polyploid phasing using algorithms based on branch-and-extend [5], belief propagation [32] and semi-definite programming [14]. In a recent theoretical work [7], the hardness of optimizing the MEC for S > 2 has also been proven, indicating that algorithms for this problem need to be necessarily approximate or tailored to some assumptions. A major drawback of existing works is that they consider only S = 3 , 4 and none have been developed, optimized, or tested for the high ploidy that is encountered in segmental duplications, where S can be potentially larger than 10, and to the low error-rate in Illumina sequencers.…”

Section: Introductionmentioning

confidence: 99%

Resolving Multicopy Duplications de novo Using Polyploid Phasing

Chaisson

Mukherjee

Kannan

et al. 2017

Lecture Notes in Computer Science

View full text Add to dashboard Cite

While the rise of single-molecule sequencing systems has enabled an unprecedented rise in the ability to assemble complex regions of the genome, long segmental duplications in the genome still remain a challenging frontier in assembly. Segmental duplications are at the same time both gene rich and prone to large structural rearrangements, making the resolution of their sequences important in medical and evolutionary studies. Duplicated sequences that are collapsed in mammalian de novo assemblies are rarely identical; after a sequence is duplicated, it begins to acquire paralog specific variants. In this paper, we study the problem of resolving the variations in multicopy long-segmental duplications by developing and utilizing algorithms for polyploid phasing. We develop two algorithms: the first one is targeted at maximizing the likelihood of observing the reads given the underlying haplotypes using discrete matrix completion. The second algorithm is based on correlation clustering and exploits an assumption, which is often satisfied in these duplications, that each paralog has a sizable number of paralog-specific variants. We develop a detailed simulation methodology, and demonstrate the superior performance of the proposed algorithms on an array of simulated datasets. We measure the likelihood score as well as reconstruction accuracy, i.e., what fraction of the reads are clustered correctly. In both the performance metrics, we find that our algorithms dominate existing algorithms on more than 93% of the datasets. While the discrete matrix completion performs better on likelihood score, the correlation clustering algorithm performs better on reconstruction accuracy due to the stronger regularization inherent in the algorithm. We also show that our correlation-clustering algorithm can reconstruct on an average 7.0 haplotypes in 10-copy duplication data-sets whereas existing algorithms reconstruct less than 1 copy on average.

show abstract

“…For this reason, the vast majority of haplotype assembly techniques attempts to remove the aforementioned ambiguities by either discarding or altering sequencing data; this has led to the minimum fragment removal, minimum SNP removal [26], maximum fragments cut [16], and minimum error correction formulations of the assembly problem [29]. Most of the recent haplotype assembly methods (see, e.g., [7,25,31,32,40]) focus on the minimum error correction (MEC) formulation where the goal is to nd the smallest number of nucleotides in reads that need to be changed so that any read partitioning ambiguities would be resolved. It has been shown that nding optimal solution to the MEC formulation of the haplotype assembly problem is NP-hard [7,10,26].…”

Section: Introductionmentioning

confidence: 99%

“…Most of the recent haplotype assembly methods (see, e.g., [7,25,31,32,40]) focus on the minimum error correction (MEC) formulation where the goal is to nd the smallest number of nucleotides in reads that need to be changed so that any read partitioning ambiguities would be resolved. It has been shown that nding optimal solution to the MEC formulation of the haplotype assembly problem is NP-hard [7,10,26]. In [39], the authors used a branch-and-bound scheme to minimize the MEC objective over the space of reads; to reduce the search space, they relied on a bound on the objective obtained by a random partition of the reads.…”

Section: Introductionmentioning

confidence: 99%

Sparse Tensor Decomposition for Haplotype Assembly of Diploids and Polyploids

Hashemi

Zhu

Vikalo

2017

Preprint

View full text Add to dashboard Cite

A framework that formulates haplotype assembly as sparse tensor decomposition is proposed. The problem is cast as that of decomposing a tensor having special structural constraints and missing a large fraction of its entries into a product of two factors, U and V; tensor V reveals haplotype information while U is a sparse matrix encoding the origin of erroneous sequencing reads. An algorithm, AltHap, which reconstructs haplotypes of either diploid or polyploid organisms by solving this decomposition problem is proposed. Starting from a judiciously selected initial point, AltHap alternates between two optimization tasks to recover U and V by relying on a modi ed gradient descent search that exploits salient structural properties of U and V. The performance and convergence properties of AltHap are theoretically analyzed and, in doing so, guarantees on the achievable minimum error correction scores and correct phasing rate are established. AltHap was tested in a number of di erent scenarios and was shown to compare favorably to state-of-the-art methods in applications to haplotype assembly of diploids, and signi cantly outperform existing techniques when applied to haplotype assembly of polyploids.

show abstract

On the Minimum Error Correction Problem for Haplotype Assembly in Diploid and Polyploid Genomes

Cited by 36 publications

References 45 publications

A Graph Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction

A Graph Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction

Resolving Multicopy Duplications de novo Using Polyploid Phasing

Sparse Tensor Decomposition for Haplotype Assembly of Diploids and Polyploids

Contact Info

Product

Resources

About