Towards pan-genome read alignment to improve variation calling

Valenzuela, Daniel; Norri, Tuukka; Välimäki, Niko; Pitkänen, Esa; Mäkinen, Veli

doi:10.1186/s12864-018-4465-8

Cited by 33 publications

(38 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We considered to include running BWA-MEM on an index created by CHOP [9], which is a tool for indexing paths through known haplotypes in the graph, but we were unable to build the CHOP index for a whole human genome human graph in reasonable run-time. We also considered to include PanVC [11] in the comparison, but also this method does not seem to scale to a whole human genome, for which it takes weeks to run on [11]. When using vg to simulate reads, we used base pair substitution rate 0.01 and indel rate 0.02.…”

Section: Assessment Of Mapping Methodsmentioning

confidence: 99%

Assessing graph-based read mappers against a novel baseline approach highlights strengths and weaknesses of the current generation of methods

Grytten¹,

Rand²,

Nederbragt

et al. 2019

Preprint

View full text Add to dashboard Cite

Graph-based reference genomes have become popular as they allow read mapping and follow-up analyses in settings where the exact haplotypes underlying a high-throughput sequencing experiment are not precisely known. Two recent papers show that mapping to graph-based reference genomes can improve accuracy as compared to methods using linear references. Both of these methods index the sequences for most paths up to a certain length in the graph in order to enable direct mapping of reads containing common variants. However, the combinatorial explosion of possible paths through nearby variants also leads to a huge search space and an increased chance of false positive alignments to highly variable regions.We here assess three prominent graph-based read mappers against a novel hybrid baseline approach that combines an initial path determination with a tuned linear read mapping method. We show, using a previously proposed benchmark, that this simple approach is able to improve accuracy of read-mapping to graph-based reference genomes.Our method is implemented in a tool Two-step Graph Mapper, which is available at https://github.com/uio-bmi/two_step_graph_mapper along with data and scripts for reproducing the experiments.

show abstract

Section: Assessment Of Mapping Methodsmentioning

confidence: 99%

Assessing graph-based read mappers against a novel baseline approach highlights strengths and weaknesses of the current generation of methods

Grytten¹,

Rand²,

Nederbragt

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…In contrast, the r-index, of which we provide an implementation in this work, has no such length limitation. The most recent implementation of the hybrid index is CHIC (Valenzuela et al, 2018; based on CHICO; Valenzuela, 2016). Although CHIC has support for counting multiple occurrences of a pattern within a genomic database, it is an expensive operation, namely O(' log log n), where ' is the number of occurrences in the databases and n is the length of the database.…”

Section: Related Workmentioning

confidence: 99%

“…This implies that the FM-index will become much slower and/or much larger as the number of genomes in the databases grows significantly. This bottleneck has forced researchers to consider variations of FM-indexes adapted for massive genomic data sets, such as the Valenzuela et al (2018) pan-genomic index or the Garrison et al (2018) variation graph. Some of these proposals use elements of the FM-index, but all deviate in substantial ways from the description above.…”

Section: Introductionmentioning

confidence: 99%

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

Kuhnle

Mun²,

Boucher

et al. 2020

Journal of Computational Biology

View full text Add to dashboard Cite

Short-read aligners predominantly use the FM-index, which is easily able to index one or a few human genomes. However, it does not scale well to indexing collections of thousands of genomes. Driving this issue are the two chief components of the index: (1) a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA), and (2) a sample of the SA that-when used with the rank data structure-allows us to access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that (SODA 2018) has defined an SA sample that takes about the same space as the run-length compressed BWT, we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018, we showed how to build the BWT of large genomic databases efficiently (WABI 2018), but the problem of building the sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over the FMindex-based Bowtie method with respect to both memory and time and over the hybrid index-based CHIC method with respect to query time and memory required for indexing.

show abstract

“…Aligning reads against whole genomic databases is called pan-genomic alignment [8] and should help genomic processing and storage catch up with sequencing. Unfortunately, although several authors have proposed other kinds of indexes (see, e.g., [9,10] and references therein), they lack the complete functionality of the FM-index and have not achieved the same popularity. In particular, they often limit the maximum length of a pattern, which will become problematic The read ATAC does not match exactly against the reference GATTACAT but does against the second genome GATACAT we assembled, so if we add that genome to the index then we can avoid using approximate pattern matching to align that read.…”

Section: Introductionmentioning

confidence: 99%