Compressed Spaced Suffix Arrays

Gagie, Travis; Manzini, Giovanni; Valenzuela, Daniel

doi:10.1007/s11786-016-0283-z

Cited by 2 publications

(3 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The success of the mainstream mappers BWA-MEM (Li, 2013 ) and Bowtie2 (Langmead and Salzberg, 2012 ) is due in part to the FM-index, which only supports contiguous seeds. Some workarounds are available for spaced seeds (Horton et al, 2008 ; Gagie et al, 2017 ) but they increase the memory footprint, explaining that short reads are typically mapped using contiguous seeds. More generally, computing the sensitivity of spaced seeds is challenging (Kucherov et al, 2006 ; Li et al, 2006 ; Martin and Noé, 2017 ).…”

Section: Seedsmentioning

confidence: 99%

Calibrating Seed-Based Heuristics to Map Short Reads With Sesame

2020

View full text Add to dashboard Cite

The increasing throughput of DNA sequencing technologies creates a need for faster algorithms. The fate of most reads is to be mapped to a reference sequence, typically a genome. Modern mappers rely on heuristics to gain speed at a reasonable cost for accuracy. In the seeding heuristic, short matches between the reads and the genome are used to narrow the search to a set of candidate locations. Several seeding variants used in modern mappers show good empirical performance but they are difficult to calibrate or to optimize for lack of theoretical results. Here we develop a theory to estimate the probability that the correct location of a read is filtered out during seeding, resulting in mapping errors. We describe the properties of simple exact seeds, skip seeds and MEM seeds (Maximal Exact Match seeds). The main innovation of this work is to use concepts from analytic combinatorics to represent reads as abstract sequences, and to specify their generative function to estimate the probabilities of interest. We provide several algorithms, which together give a workable solution for the problem of calibrating seeding heuristics for short reads. We also provide a C implementation of these algorithms in a library called Sesame. These results can improve current mapping algorithms and lay the foundation of a general strategy to tackle sequence alignment problems. The Sesame library is open source and available for download at https://github.com/gui11aume/sesame.

show abstract

Section: Seedsmentioning

confidence: 99%

Calibrating Seed-Based Heuristics to Map Short Reads With Sesame

2020

View full text Add to dashboard Cite

show abstract

“…Even if a full suffix array would be too large, we can consider a sparse index, and/or distributing the reference sequences into separately-indexed volumes. Furthermore, these indexes usually compress standard suffix arrays, and it is unclear how effectively they can be extended to subset seeding, minimizers, etc [8], [31].…”

Section: Compact / Succinct / Compressed Indexesmentioning

confidence: 99%

“…seeds" [34]. It is also possible to compress a spaced-seed suffix array relative to a normal suffix array [31].…”

Section: Multiple Seed Patternsmentioning

confidence: 99%

A Simplified Description of Child Tables for Sequence Similarity Search

Frith

Shrestha

2018

IEEE/ACM Trans. Comput. Biol. and Bioinf.

View full text Add to dashboard Cite

Finding related nucleotide or protein sequences is a fundamental, diverse, and incompletely-solved problem in bioinformatics. It is often tackled by seed-and-extend methods, which first find "seed" matches of diverse types, such as spaced seeds, subset seeds, or minimizers. Seeds are usually found using an index of the reference sequence(s), which stores seed positions in a suffix array or related datastructure. A child table is a fundamental way to achieve fast lookup in an index, but previous descriptions have been overly complex. This paper aims to provide a more accessible description of child tables, and demonstrate their generality: they apply equally to all the above-mentioned seed types and more. We also show that child tables can be used without LCP (longest common prefix) tables, reducing the memory requirement.

show abstract

Compressed Spaced Suffix Arrays

Cited by 2 publications

References 36 publications

Calibrating Seed-Based Heuristics to Map Short Reads With Sesame

Calibrating Seed-Based Heuristics to Map Short Reads With Sesame

A Simplified Description of Child Tables for Sequence Similarity Search

Contact Info

Product

Resources

About