AStarix: Fast and Optimal Sequence-to-Graph Alignment

Ivanov, Pesho; Bichsel, Benjamin; Mustafa, Harun; Kahles, André; Rätsch, Gunnar; Vechev, Martin

doi:10.1101/2020.01.22.915496

Cited by 11 publications

(23 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Considering all nodes v ∈ V r as possible starting points for the alignment means that the A algorithm would explore all states of the form v, 0 , which immediately induces a high overhead of |V r |. In line with previous works [10,12], we avoid this overhead by complementing the reference graph with a trie index.…”

Section: Trie Indexmentioning

confidence: 97%

See 1 more Smart Citation

Fast and optimal sequence-to-graph alignment guided by seeds

Ivanov

Bichsel

Vechev

2021

Preprint

Self Cite

View full text Add to dashboard Cite

We present a novel A*seed heuristic enabling fast and optimal sequence-to-graph alignment, guaranteed to minimize the edit distance of the alignment assuming non-negative edit costs. We phrase optimal alignment as a shortest path problem and solve it by instantiating the A* algorithm with our novel seed heuristic. The key idea of the seed heuristic is to extract seeds from the read, locate them in the reference, mark preceding reference positions by crumbs, and use the crumbs to direct the A* search. We prove admissibility of the seed heuristic, thus guaranteeing alignment optimality. Our implementation extends the free and open source AStarix aligner and demonstrates that the seed heuristic outperforms all state-of-the-art optimal aligners including Graphaligner, Vargas, PaSGAL, and the prefix heuristic previously employed by AStarix. Specifically, we achieve a consistent speedup of >60× on both short Illumina reads and long HiFi reads (up to 25kbp), on both the E. coli linear reference genome (lMbp) and the MHC variant graph (5Mbp). Our speedup is enabled by the seed heuristic consistently skipping >99.99% of the table cells that optimal aligners based on dynamic programming compute.

show abstract

Section: Trie Indexmentioning

confidence: 97%

“…Implementation. The seed heuristic and prex heuristic reuse the same free and open source C++ codebase of the AStarix aligner [10]. It includes a simple implementation of a graph and trie data structure which is not optimized for memory usage.…”

Section: Implementation and Parameter Choicesmentioning

confidence: 99%

Fast and optimal sequence-to-graph alignment guided by seeds

Ivanov

Bichsel

Vechev

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Alongside these, there has also been growing interest in the use of de Bruijn graph-based indexes for alignment tasks as a way to accelerate alignment to repeat-prone reference genomes [41] or to unassembled read sets [40, 28]. More recent work has focused on improving the scalability of these approaches, either through strategies using more rigorous early cut-off criteria [33], or via the introduction of heuristics [55]. A major challenge faced by all existing methods is to unite the ability to efficiently operate on petabase scale input data with the capacity for fast and versatile query operations.…”

Section: Introductionmentioning

confidence: 99%

Indexing All Life’s Known Biological Sequences

Karasikov¹,

Mustafa²,

Danciu³

et al. 2020

Preprint

Self Cite

108

View full text Add to dashboard Cite

The amount of biological sequencing data available in public repositories is growing exponentially, forming an invaluable biomedical research resource. Yet, making all this sequencing data searchable and easily accessible to life science and data science researchers is an unsolved problem. We present MetaGraph, a versatile framework for the scalable analysis of extensive sequence repositories. MetaGraph efficiently indexes vast collections of sequences to enable fast search and comprehensive analysis. A wide range of underlying data structures offer different practically relevant trade-offs between the space taken by an index and its query performance. Achieving compression ratios of up to 1,000-fold over the already compressed raw input data, MetaGraph indexes can represent the content of large sequencing archives in the working memory of a single compute server. We demonstrate our framework's scalability by indexing over 1.4 million whole genome sequencing (WGS) records from NCBI's Sequence Read Archive, representing a total input of more than three petabases. MetaGraph provides a flexible methodological framework allowing for index construction to be scaled from consumer laptops to distribution onto a cloud compute cluster for processing terabases to petabases of input data. Notably, processing of data sets ranging from 1 TB of raw WGS reads to 20 TB of human RNA-sequencing data results in indexes whose memory footprints are small enough to host on standard desktop workstations. Besides demonstrating the utility of MetaGraph indexes on key applications, such as experiment discovery, sequence alignment, error correction, and differential assembly, we make a wide range of indexes available as a community resource, including indexes of over 450,000 microbial WGS records, more than 110,000 fungi WGS records, and more than 40,000 whole metagenome sequencing records. A subset of these indexes is made available online for interactive queries. All indexes will be available for download and in the cloud. In total, indexes comprising more than 1 million sequencing records are available for download. As an example of our indexes' integrative analysis capabilities, we introduce the concept of differential assembly, which allows for the extraction of sequences present in a foreground set of samples but absent in a given background set. We apply this technique to differentially assemble contigs to identify pathogenic agents transfected via human kidney transplants. In a second example, we indexed more than 20,000 human RNA-Seq records from the TCGA and GTEx cohorts and use them to extract transcriptome features that are hard to characterize using a classical linear reference. We discovered over 200 trans-splicing events in GTEx and found broad evidence for tissue-specific non-A-to-I RNA-editing in GTEx and TCGA.

show abstract

“…Graph representations more accurately reflect the sampled individuals within a population, and their use in genome mapping algorithms reduces reference bias and increases mapping accuracy when sequencing a new individual ( Ballouz et al , 2019 ). There is abundant research on data structures designed for graph representations of genomes and pan-genomes ( Garrison et al , 2018 ; Li et al , 2020 ), their space-efficient indexing ( Chang et al , 2020 ; Ghaffaari and Marschall, 2019 ; Holley et al , 2016 ; Jain et al , 2019b ; Kuhnle et al , 2020 ; Marcus et al , 2014 ; Sirén et al , 2014 ) and alignment algorithms ( Darby et al , 2020 ; Ivanov et al , 2020 ; Jain et al , 2020 ; Kuosmanen et al , 2018 ; Rautiainen and Marschall, 2020 ) to map sequences to reference graphs. For review papers summarizing these developments, see Computational Pan-Genomics Consortium (2018) , Eizenga et al (2020) , and Paten et al (2017) .…”

Section: Introductionmentioning

confidence: 99%

A variant selection framework for genome graphs

2021

View full text Add to dashboard Cite

Motivation Variation graph representations are projected to either replace or supplement conventional single genome references due to their ability to capture population genetic diversity and reduce reference bias. Vast catalogues of genetic variants for many species now exist, and it is natural to ask which among these are crucial to circumvent reference bias during read mapping. Results In this work, we propose a novel mathematical framework for variant selection, by casting it in terms of minimizing variation graph size subject to preserving paths of length α with at most δ differences. This framework leads to a rich set of problems based on the types of variants [e.g. single nucleotide polymorphisms (SNPs), indels or structural variants (SVs)], and whether the goal is to minimize the number of positions at which variants are listed or to minimize the total number of variants listed. We classify the computational complexity of these problems and provide efficient algorithms along with their software implementation when feasible. We empirically evaluate the magnitude of graph reduction achieved in human chromosome variation graphs using multiple α and δ parameter values corresponding to short and long-read resequencing characteristics. When our algorithm is run with parameter settings amenable to long-read mapping (α = 10 kbp, δ = 1000), 99.99% SNPs and 73% SVs can be safely excluded from human chromosome 1 variation graph. The graph size reduction can benefit downstream pan-genome analysis. Availability and implementation : https://github.com/AT-CG/VF. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

AStarix: Fast and Optimal Sequence-to-Graph Alignment

Cited by 11 publications

References 36 publications

Fast and optimal sequence-to-graph alignment guided by seeds

Fast and optimal sequence-to-graph alignment guided by seeds

Indexing All Life’s Known Biological Sequences

A variant selection framework for genome graphs

Contact Info

Product

Resources

About