Aligning reads to a variation graph is a standard task in pangenomics, with downstream applications in e.g., improving variant calling. While the vg toolkit (Garrison et al., Nature Biotechnology, 2018) is a popular aligner of short reads, GraphAligner (Rautiainen and Marschall, Genome Biology, 2020) is the state-of-the-art aligner of long reads. GraphAligner works by finding candidate read occurrences based on individually extending the best seeds of the read in the variation graph. However, a more principled approach recognized in the community is to co-linearly chain multiple seeds. We present a new algorithm to co-linearly chain a set of seeds in an acyclic variation graph, together with the first efficient implementation of such a co-linear chaining algorithm into a new aligner of long reads to variation graphs, GraphChainer. Compared to GraphAligner, at a normalized edit distance threshold of 40%, it aligns 9% to 12% more reads, and 15% to 19% more total read length, on real PacBio reads from human chromosomes 1 and 22. On both simulated and real data, GraphChainer aligns between 97% and 99% of all reads, and of total read length. At the more stringent normalized edit distance threshold of 30%, GraphChainer aligns up to 29% more total real read length than GraphAligner. GraphChainer is freely available at https://github.com/algbio/GraphChainer
Decomposing a network flow into weighted paths is a problem with numerous applications, ranging from networking, transportation planning, to bioinformatics. In some applications we look for a decomposition that is optimal with respect to some property, such as the number of paths used, robustness to edge deletion, or length of the longest path. However, in many bioinformatic applications, we seek a specific decomposition where the paths correspond to some underlying data that generated the flow. In these cases, no optimization criteria guarantee the identification of the correct decomposition. Therefore, we propose to instead report the safe paths, which are subpaths of at least one path in every flow decomposition. In this work, we give the first local characterization of safe paths for flow decompositions in directed acyclic graphs, leading to a practical algorithm for finding the complete set of safe paths. In addition, we evaluate our algorithm on RNA transcript data sets against a trivial safe algorithm (extended unitigs), the recently proposed safe paths for path covers (TCBB 2021) and the popular heuristic greedy-width . On the one hand, we found that besides maintaining perfect precision, our safe and complete algorithm reports a significantly higher coverage ( 50 more) compared with the other safe algorithms. On the other hand, the greedy-width algorithm although reporting a better coverage, it also reports a significantly lower precision on complex graphs (for genes expressing a large number of transcripts). Overall, our safe and complete algorithm outperforms (by 20 ) greedy-width on a unified metric (F-score) considering both coverage and precision when the evaluated data set has a significant number of complex graphs. Moreover, it also has a superior time ( 5 ) and space performance ( 2 2 2 ), resulting in a better and more practical approach for bioinformatic applications of flow decomposition.
A multi-assembly problem asks to reconstruct multiple genomic sequences from mixed reads sequenced from all of them. Standard formulations of such problems model a solution as a path cover in a directed acyclic graph, namely a set of paths that together cover all vertices of the graph. Since multi-assembly problems admit multiple solutions in practice, we consider an approach commonly used in standard genome assembly: output only partial solutions (contigs, or safe paths), that appear in all path cover solutions. We study constrained path covers, a restriction on the path cover solution that incorporate practical constraints arising in multi-assembly problems. We give efficient algorithms finding all maximal safe paths for constrained path covers. We compute the safe paths of splicing graphs constructed from transcript annotations of different species. Our algorithms run in less than 15 seconds per species and report RNA contigs that are over 99% precise and are up to 8 times longer than unitigs. Moreover, RNA contigs cover over 70% of the transcripts and their coding sequences in most cases. With their increased length to unitigs, high precision, and fast construction time, maximal safe paths can provide a better base set of sequences for transcript assembly programs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.