Validating Paired-end Read Alignments in Sequence Graphs

Jain, Chirag; Zhang, Haowen; Dilthey, Alexander; Aluru, Srinivas

doi:10.1101/682799

Cited by 10 publications

(9 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some research has been done on finding solutions for more specific distance queries in sequence graphs. PairG [8] is a method for determining the validity of independent mappings of reads in a pair by deciding whether there is a path between the mappings whose distance is within a given range. This algorithm uses an index to determine if there is a valid path between two vertices in a single O(1) lookup.…”

Section: Prior Researchmentioning

confidence: 99%

Distance Indexing and Seed Clustering in Sequence Graphs

Chang

Eizenga

Novak

et al. 2019

Preprint

View full text Add to dashboard Cite

Graph representations of genomes are capable of expressing more genetic variation and can therefore better represent a population than standard linear genomes. However, due to the greater complexity of genome graphs relative to linear genomes, some functions that are trivial on linear genomes become more difficult in genome graphs. Calculating distance is one such function that is simple in a linear genome but much more complicated in a graph context. In read mapping algorithms, distance calculations are commonly used in a clustering step to determine if seed alignments could belong to the same mapping. Clustering algorithms are a bottleneck for some mapping algorithms due to the cost of repeated distance calculations. We have developed an algorithm for quickly calculating the minimum distance between positions on a sequence graph using a minimum distance index. We have also developed an algorithm that uses the distance index to cluster seeds on a graph. We demonstrate that our implementations of these algorithms are efficient and practical to use for mapping algorithms.

show abstract

Section: Prior Researchmentioning

confidence: 99%

Distance Indexing and Seed Clustering in Sequence Graphs

Chang

Eizenga

Novak

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…Pairwise alignment dominates our runtime, while sparse matrix construction, which include the creation of both A and A T , and multiplication take only a tiny percentage of our computation, proving the efficiency of our approach for overlap detection. Interestingly, sparse matrix multiplication and semiring abstraction could offer a path for efficient parallelization of many applications in computational biology other than overlap detection (Jain et al, 2019). Figure 8 shows the strong scaling curves of BELLA for the representative P. aeruginosa 30X data set to measure its parallel performance.…”

Section: S9 Experimental Settingmentioning

confidence: 99%

BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper

Guidi

Ellis

Rokhsar

et al. 2018

Preprint

View full text Add to dashboard Cite

Recent advances in long-read sequencing enable the characterization of genome structure and its intra-and inter-species variation at a resolution that was previously impossible. Detecting overlaps between reads is integral to many long-read genomics pipelines, such as de novo genome assembly. While longer reads simplify genome assembly and improve the contiguity of the reconstruction, current long-read technologies come with high error rates. We present Berkeley Long-Read to Long-Read Aligner and Overlapper (BELLA), a novel algorithm for computing overlaps and alignments via sparse matrix-matrix multiplication that balances the goals of recall and precision, performing well on both.We present a probabilistic model that demonstrates the feasibility of using short k-mers for detecting candidate overlaps. We then introduce a notion of reliable k-mers based on our probabilistic model. Combining reliable k-mers with our binning mechanism eliminates both the k-mer set explosion that would otherwise occur with highly erroneous reads, and the spurious overlaps from k-mers originating in repetitive regions. Finally, we present a new method based on Chernoff bounds for separating true overlaps from false positives using a combination of alignment techniques and probabilistic modeling. Our methodologies aim at maximizing the balance between precision and recall. On both real and synthetic data, BELLA performs amongst the best in terms of F1 score, showing a performance stability which is often missing for competitor software. BELLA's F1 score is consistently within 1.7% of the top entry. Notably, we show improved de novo assembly results on synthetic data when coupling BELLA with the Miniasm assembler.Long-read technologies (Eid et al., 2009;Goodwin et al., 2015) generate long reads with average lengths reaching and often exceeding 10,000 base pairs (bp). These allow the resolution of complex genomic repetitions, enabling more accurate ensemble views that were not possible with previous short-read technologies (Phillippy et al., 2008;Nagarajan and Pop, 2009). However, the improved read length of these technologies comes at the cost of lower accuracy, with error rates ranging from 5% to 35%. Nevertheless, errors are more random and more evenly distributed within Pacific Biosciences long-read data (Giordano et al., 2017) compared to short-read technologies.The majority of the state-of-the-art long-read assemblers uses the Overlap-Layout-Consensus (OLC) paradigm (Berlin et al., 2015). The first step in OLC assembly consists of detecting overlaps between reads to construct an overlap (or string) graph. The OLC paradigm benefits from longer reads as significantly fewer reads are required to cover the genome, limiting the size of the overlap graph. Highly-accurate overlap detection is a major computational bottleneck in OLC assembly (Myers, 2014), mainly due to the compute-intensive nature of pairwise alignment.At present, several algorithms are capable of overlapping error-prone long-read data with varying accuracy. The prevaili...

show abstract

“…Graph representations more accurately reflect the sampled individuals within a population, and their use in genome mapping algorithms reduces reference bias and increases mapping accuracy when sequencing a new individual ( Ballouz et al , 2019 ). There is abundant research on data structures designed for graph representations of genomes and pan-genomes ( Garrison et al , 2018 ; Li et al , 2020 ), their space-efficient indexing ( Chang et al , 2020 ; Ghaffaari and Marschall, 2019 ; Holley et al , 2016 ; Jain et al , 2019b ; Kuhnle et al , 2020 ; Marcus et al , 2014 ; Sirén et al , 2014 ) and alignment algorithms ( Darby et al , 2020 ; Ivanov et al , 2020 ; Jain et al , 2020 ; Kuosmanen et al , 2018 ; Rautiainen and Marschall, 2020 ) to map sequences to reference graphs. For review papers summarizing these developments, see Computational Pan-Genomics Consortium (2018) , Eizenga et al (2020) , and Paten et al (2017) .…”

Section: Introductionmentioning

confidence: 99%

A variant selection framework for genome graphs

2021

Self Cite

View full text Add to dashboard Cite

Motivation Variation graph representations are projected to either replace or supplement conventional single genome references due to their ability to capture population genetic diversity and reduce reference bias. Vast catalogues of genetic variants for many species now exist, and it is natural to ask which among these are crucial to circumvent reference bias during read mapping. Results In this work, we propose a novel mathematical framework for variant selection, by casting it in terms of minimizing variation graph size subject to preserving paths of length α with at most δ differences. This framework leads to a rich set of problems based on the types of variants [e.g. single nucleotide polymorphisms (SNPs), indels or structural variants (SVs)], and whether the goal is to minimize the number of positions at which variants are listed or to minimize the total number of variants listed. We classify the computational complexity of these problems and provide efficient algorithms along with their software implementation when feasible. We empirically evaluate the magnitude of graph reduction achieved in human chromosome variation graphs using multiple α and δ parameter values corresponding to short and long-read resequencing characteristics. When our algorithm is run with parameter settings amenable to long-read mapping (α = 10 kbp, δ = 1000), 99.99% SNPs and 73% SVs can be safely excluded from human chromosome 1 variation graph. The graph size reduction can benefit downstream pan-genome analysis. Availability and implementation : https://github.com/AT-CG/VF. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Validating Paired-end Read Alignments in Sequence Graphs

Cited by 10 publications

References 44 publications

Distance Indexing and Seed Clustering in Sequence Graphs

Distance Indexing and Seed Clustering in Sequence Graphs

BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper

A variant selection framework for genome graphs

Contact Info

Product

Resources

About