Indexing Arbitrary-Length k-Mers in Sequencing Reads

Kowalski, Tomasz Marek; Grabowski, Szymon; Deorowicz, Sebastian

doi:10.1371/journal.pone.0133198

Cited by 18 publications

(10 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…HG-CoLoR: Similar to FMLRC, it avoids using a fixed k-mer size for the de Bruijn graph. Accordingly, it relies on a variable-order de Bruijn graph structure [42]. It also uses a seed-and-extend approach to align long reads to the graph.…”

Section: Short-read-assembly-based Methodsmentioning

confidence: 99%

A comprehensive evaluation of long read error correction methods

Zhang

Jain

Aluru

2019

Preprint

View full text Add to dashboard Cite

Background: Third-generation single molecule sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used. Results:In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research.Conclusions: Despite the high error rate of long reads, the state-of-the-art correction tools can achieve high correction quality. When short reads are available, the best hybrid methods outperform non-hybrid methods in terms of correction quality and computing resource usage. When choosing tools for use, practitioners are suggested to be careful with a few correction tools that discard reads, and check the effect of error correction tools on downstream analysis. Our evaluation code is available as open-source at https://github.com/haowenz/LRECE.

show abstract

Section: Short-read-assembly-based Methodsmentioning

confidence: 99%

A comprehensive evaluation of long read error correction methods

Zhang

Jain

Aluru

2019

Preprint

View full text Add to dashboard Cite

show abstract

“…After the alignment step, a variable order de Bruijn graph is built from the solid k-mers of the corrected short reads. Unlike FMLRC, this graph is built with the help of PgSA [37]. Moreover, HG-CoLoR allows to explore every order of the graph, between a minimum order k and a maximum order K, instead of limiting the graph explorations to two different orders.…”

Section: Hg-color (2018)mentioning

confidence: 99%

Long-read error correction: a survey and qualitative comparison

Morisse

Lecroq

Lefèbvre

2020

Preprint

View full text Add to dashboard Cite

Third generation sequencing technologies Pacific Biosciences and Oxford Nanopore Technologies were respectively made available in 2011 and 2014. In contrast with second generation sequencing technologies such as Illumina, these new technologies allow the sequencing of long reads of tens to hundreds of kbps. These so called long reads are particularly promising, and are especially expected to solve various problems such as contig and haplotype assembly or scaffolding, for instance. However, these reads are also much more error prone than second generation reads, and display error rates reaching 10 to 30%, according to the sequencing technology and to the version of the chemistry. Moreover, these errors are mainly composed of insertions and deletions, whereas most errors are substitutions in Illumina reads. As a result, long reads require efficient error correction, and a plethora of error correction tools, directly targeted at these reads, were developed in the past nine years. These methods can adopt a hybrid approach, using complementary short reads to perform correction, or a self-correction approach, only making use of the information contained in the long reads sequences. Both these approaches make use of various strategies such as multiple sequence alignment, de Bruijn graphs, hidden Markov models, or even combine different strategies. In this paper, we describe a complete survey of long-read error correction, reviewing all the different methodologies and tools existing up to date, for both hybrid and self-correction. Moreover, the long reads characteristics, such as sequencing depth, length, error rate, or even sequencing technology, can have an impact on how well a given tool or strategy performs, and can thus drastically reduce the correction quality. We thus also present an in-depth benchmark of available long-read error correction tools, on a wide variety of datasets, composed of both simulated and real data, with various error rates, coverages, and read lengths, ranging from small bacterial to large mammal genomes.2. Alignment of long reads and contigs obtained from short reads assembly. In the same fashion, long reads can also be corrected with the help of the contig they align to, by computing consensus sequences from these contigs. ECTools [43], HALC [6], and MiRCA [30] adopt this methodology. 3. Use of de Bruijn graphs, built from the short reads' k-mers. Once built, the long reads can indeed be anchored to the graph. It can then be traversed, in order to find paths allowing to link together anchored regions of the long reads, and thus correct unanchored regions. LoRDEC [59], Jabba [52], FMLRC [67], and ParLECH [16] rely on this strategy.4. Use of Hidden Markov Models. These can indeed be used in order to represent the long reads. The models can then be trained with the help of short reads, in order to extract consensus sequences, representing the corrected long reads. Hercules [21] is based on this approach.Other methods, such as NaS [49] and HG-CoLoR [53], combine different of the aforementioned ...

show abstract

“…The proposed PgRC is based on a few ideas. The key one is an approximation of the shortest common superstring over a set of the given reads, which we call a "pseudogenome" (hence the name of our tool), an idea basically described in (Kowalski et al, 2015). In this work, however, we modify the procedure from our earlier research, by partitioning the read set into groups, related to their quality and the existence of N symbols in them.…”

Section: Overviewmentioning

confidence: 99%

“…More concretely, we followed the pseudogenome construction algorithm (Kowalski et al, 2015). Given a read array…”

Section: Read Partitioning and Pseudogenome Generationmentioning

confidence: 99%

PgRC: Pseudogenome based Read Compressor

Kowalski

Grabowski

2019

Preprint

Self Cite

View full text Add to dashboard Cite

Motivation:The amount of sequencing data from High-Throughput Sequencing technologies grows at a pace exceeding the one predicted by Moore's law. One of the basic requirements is to efficiently store and transmit such huge collections of data. Despite significant interest in designing FASTQ compressors, they are still imperfect in terms of compression ratio or decompression resources. Results: We present Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of building an approximation of the shortest common superstring over high-quality reads. Experiments show that PgRC wins in compression ratio over its main competitors, SPRING and Minicom, by up to 18 and 21 percent on average, respectively, while being at least comparably fast in decompression. Availability: PgRC can be downloaded from https://github.com/kowallus/PgRC. Contact: tomasz.kowalski@p.lodz.pl 2Kowalski and Grabowski and distributes them into buckets. Its key concept, however, is to use socalled minimizers (Roberts et al., 2004) for the bucket labels. A minimizer of length for a read R of length m is the lexicographically smallest of the (m − + 1) -mers of R. A canonical minimizer, which is actually used by ORCOM, is a minimizer taken over the read and its reversedcomplemented form. Two reads with a large overlap are likely to share the same (canonical or non-canonical) minimizer and thus the same bucket. The contents of each bucket are compressed separately, with sorting the reads from their minimizer's position, careful modeling of mismatches and other minor improvements, combined with arithmetic coding or PPMd (context-based) compression applied to several resulting data streams. The compression ratio on a 134 Gbp human genome sequencing data achieved by ORCOM was 0.317 bits per base, improving the BEETL's result of 0.518 bits per base. Mince (Patro and Kingsford, 2015) is a related algorithm, but its distribution of reads into buckets is based on the number of shared kmers. More precisely, a read R is thrown to the bucket which maximizes the number of k-mers of R occurring in any read the bucket contains. Its compression ratio is in most cases by a few percent higher than ORCOM's (see, e.g., extensive comparisons in (Liu et al., 2018)), but is less efficient in terms of time and memory usage. FaStore (Roguski et al., 2018) also follows the ORCOM approach, but improves its compression ratio (by a factor of about 1.2 typically) mostly thanks to re-distribution of reads from the buckets and assembling reads into contigs; in other words, it allows to merge similar clusters of reads. FaStore also boasts with good performance-decompression speed exceeding 100 MB/s, and even 250 MB/s in one of the modes, using 8 threads-and several lossy modes for the quality and header streams.HARC (Chandak et al., 2018a) resigns from disk-based bucketing, in favor of a succinct in-memory hash tables. Its basic idea is to find maximum overlaps between reads and create consensus sequences, using majority vot...

show abstract

Indexing Arbitrary-Length k-Mers in Sequencing Reads

Cited by 18 publications

References 33 publications

A comprehensive evaluation of long read error correction methods

A comprehensive evaluation of long read error correction methods

Long-read error correction: a survey and qualitative comparison

PgRC: Pseudogenome based Read Compressor

Contact Info

Product

Resources

About