Chaining of Maximal Exact Matches in Graphs

Rizzo, Nicola; Cáceres, Manuel; Mäkinen, Veli

doi:10.1007/978-3-031-43980-3_29

Cited by 4 publications

(5 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Doing this will also require improvements in Minichain’s seeding and chaining implementation to allow the use of anchors that cover small bubbles in a graph. This is possible by considering a more flexible definition of anchors that can span multiple vertices [50,49,52].…”

Section: Resultsmentioning

confidence: 99%

“…Doing this will also require improvements in Minichain's seeding and chaining implementation to allow the use of anchors which cover small bubbles in a DAG. This should be possible by considering a more flexible definition of anchors which can span multiple vertices [47,46,49]. The runtime of Minichain ranged from 5 to 11 minutes for aligning an MHC sequence to the DAG.…”

Section: Resultsmentioning

confidence: 99%

“…Several versions of co-linear chaining problems have been studied for aligning two sequences [1,22,34,39,11,12,42]. Recent works have further studied the extension of chaining on acyclic [35,32,6,49] and cyclic pangenome graphs [3,45] but these formulations do not consider the haplotype paths.…”

Section: Methods For Haplotype-aware Chaining On Graphsmentioning

confidence: 99%

“…It is used to identify a coherent subset of anchors (short exact matches) that can be joined together to produce an alignment. The existing formulations for chaining on graphs share the same limitation of not considering the associations between genetic variants [6,32,35,45,49]. Some of these chaining algorithms run in O ( KN log KN ) time after graph preprocessing [6,32], where K denotes the minimum number of paths required to cover all the vertices and N denotes the count of input anchors.…”

Section: Introductionmentioning

confidence: 99%

“…These formulations do not consider the associations between genetic variants and may lead to alignments with spurious recombinations in variant-dense regions of the graph [42]. The existing formulations for co-linear chaining on graphs share the same limitation [6,33,46,30,43]. Chaining on DAGs can be solved in O(KN log KN ) time, where K is the minimum number of paths covering all the vertices and N is the number of exact matches between the query and the DAG [6,30].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Haplotype-aware sequence alignment to pangenome graphs

Chandra,

Gibney,

Jain

2023

Preprint

View full text Add to dashboard Cite

Modern pangenome graphs are built using high-quality phased haplotype sequences such that each haplotype sequence corresponds to a path in the graph. Prioritizing the alignment of reads to these paths improves genotyping accuracy (Sirenet al., Science 2021). However, rigorous formulations for sequence-to-graph chaining and alignment do not consider the haplotype paths. As a result, the search space increases combinatorially as more variants are augmented in the graph. This limitation affects the effectiveness of the algorithms. In this paper, we propose novel formulations and provably good algorithms for haplotype-aware pattern matching of sequences to directed acyclic graphs (DAGs). Our work considers both sequence-to-DAG chaining and sequence-to-DAG alignment problems. Drawing inspiration from the commonly used models for genotype imputation, we assume that a query sequence is an imperfect mosaic of the reference haplotypes. Accordingly, our formulations extend previous chaining and alignment formulations by introducing a recombination penalty for a haplotype switch. First, we solve the haplotype-aware sequence-to-DAG alignment inO(|Q| |E||ℋ |) time whereQis the query sequence,Eis the set of edges, and ℋis the set of haplotypes represented in the graph. Second, we prove that an algorithm significantly faster thanO(|Q| |E||ℋ |) is unlikely. Third, we propose a haplotype-aware chaining algorithm that usesO(|ℋ |Nlog |ℋ |N) time, whereNis the count of exact matches. As a proof-of-concept, we implemented the chaining algorithm in the Minichain aligner (https://github.com/at-cg/minichain). Using simulated human major histocompatibility complex (MHC) query sequences and a pangenome graph of 60 publicly available MHC haplotypes, we show that the proposed algorithm offers a much better consistency between the ground-truth recombinations and the recombinations in the output chains when compared to a haplotype-agnostic algorithm.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

Section: Methods For Haplotype-aware Chaining On Graphsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Haplotype-aware sequence alignment to pangenome graphs

Chandra,

Gibney,

Jain

2023

Preprint

View full text Add to dashboard Cite

show abstract

Co-linear chaining on pangenome graphs

Rajput,

Chandra,

Jain

2024

Algorithms Mol Biol

View full text Add to dashboard Cite

Pangenome reference graphs are useful in genomics because they compactly represent the genetic diversity within a species, a capability that linear references lack. However, efficiently aligning sequences to these graphs with complex topology and cycles can be challenging. The seed-chain-extend based alignment algorithms use co-linear chaining as a standard technique to identify a good cluster of exact seed matches that can be combined to form an alignment. Recent works show how the co-linear chaining problem can be efficiently solved for acyclic pangenome graphs by exploiting their small width and how incorporating gap cost in the scoring function improves alignment accuracy. However, it remains open on how to effectively generalize these techniques for general pangenome graphs which contain cycles. Here we present the first practical formulation and an exact algorithm for co-linear chaining on cyclic pangenome graphs. We rigorously prove the correctness and computational complexity of the proposed algorithm. We evaluate the empirical performance of our algorithm by aligning simulated long reads from the human genome to a cyclic pangenome graph constructed from 95 publicly available haplotype-resolved human genome assemblies. While the existing heuristic-based algorithms are faster, the proposed algorithm provides a significant advantage in terms of accuracy. Implementation (https://github.com/at-cg/PanAligner).

show abstract

Finding maximal exact matches in graphs

Rizzo,

Cáceres,

Mäkinen

2024

Algorithms Mol Biol

Self Cite

View full text Add to dashboard Cite

Background We study the problem of finding maximal exact matches (MEMs) between a query string Q and a labeled graph G. MEMs are an important class of seeds, often used in seed-chain-extend type of practical alignment methods because of their strong connections to classical metrics. A principled way to speed up chaining is to limit the number of MEMs by considering only MEMs of length at least $$\kappa$$ κ ($$\kappa$$ κ -MEMs). However, on arbitrary input graphs, the problem of finding MEMs cannot be solved in truly sub-quadratic time under SETH (Equi et al., TALG 2023) even on acyclic graphs. Results In this paper we show an $$O(n\cdot L \cdot d^{L-1} + m + M_{\kappa ,L})$$ O ( n · L · d L - 1 + m + M κ , L ) -time algorithm finding all $$\kappa$$ κ -MEMs between Q and G spanning exactly L nodes in G, where n is the total length of node labels, d is the maximum degree of a node in G, $$m = |Q|$$ m = | Q | , and $$M_{\kappa ,L}$$ M κ , L is the number of output MEMs. We use this algorithm to develop a $$\kappa$$ κ -MEM finding solution on indexable Elastic Founder Graphs (Equi et al., Algorithmica 2022) running in time $$O(nH^2 + m + M_\kappa )$$ O ( n H 2 + m + M κ ) , where H is the maximum number of nodes in a block, and $$M_\kappa$$ M κ is the total number of $$\kappa$$ κ -MEMs. Our results generalize to the analysis of multiple query strings (MEMs between G and any of the strings). Additionally, we provide some experimental results showing that the number of graph MEMs is an order of magnitude smaller than the number of string MEMs of the corresponding concatenated collection. Conclusions We show that seed-chain-extend type of alignment methods can be implemented on top of indexable Elastic Founder Graphs by providing an efficient way to produce the seeds between a set of queries and the graph. The code is available in https://github.com/algbio/efg-mems.

show abstract

Chaining of Maximal Exact Matches in Graphs

Cited by 4 publications

References 21 publications

Haplotype-aware sequence alignment to pangenome graphs

Haplotype-aware sequence alignment to pangenome graphs

Co-linear chaining on pangenome graphs

Finding maximal exact matches in graphs

Contact Info

Product

Resources

About