deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index

Liu, Bo; Liu, Yadong; Li, Junyi; Guo, Hongzhe; Zang, Tianyi; Wang, Yadong

doi:10.1186/s13059-019-1895-9

Cited by 47 publications

(36 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To this end, the de Bruijn graph has become an object of central importance in many genomic analysis tasks. While it was initially used mostly in the context of genome (and transcriptome) assembly (EULER [42], Velvet [51,52], ALLPATHS [9,30], EULER-SR [10], ABySS [46], SOAPdenovo [25,29], Trans-AByss [43], SPAdes [5], Minia [13]), it has seen increasing use in comparative genomics (Cortex [19], DISCOSNP [50], Scalpel [15], BubbZ [34]) and has also been used increasingly in the context of indexing genomic data, either from raw sequencing reads (Mantis [40,1], Vari [37], VariMerge [36], MetaGraph [20]), or from assembled reference sequences (deBGA [27], Pufferfish [2], deSALT [28]), or from both (BLight [32], Bifrost [17]). These latter applications most frequently make use of the (colored) compacted de Bruijn graph, a variant of the de Bruijn graph in which maximal non-branching paths (unitigs) are condensed into single vertices in the underlying graph structure.…”

Section: Introductionmentioning

confidence: 99%

“…To this end, the de Bruijn graph has become an object of central importance in many genomic analysis tasks. While it was initially used mostly in the context of genome (and transcriptome) assembly (EULER (Pevzner et al, 2001), EULER-SR (Chaisson and Pevzner, 2008), Velvet (Zerbino and Birney, 2008;Zerbino et al, 2009), ALLPATHS (Butler et al, 2008;MacCallum et al, 2009), ABySS (Simpson et al, 2009), Trans-AByss (Robertson et al, 2010), SPAdes (Bankevich et al, 2012), Minia (Chikhi and Rizk, 2013), SOAPdenovo (Li et al, 2010;Luo et al, 2015)), it has seen increasing use in i i i i i i i i comparative genomics (Cortex (Iqbal et al, 2012), DISCOSNP (Uricaru et al, 2014), Scalpel (Fang et al, 2016), BubbZ (Minkin and Medvedev, 2020)), and has also been used increasingly in the context of indexing genomic data, either from raw sequencing reads (Vari (Muggli et al, 2017), Mantis (Pandey et al, 2018;Almodaresi et al, 2019), VariMerge (Muggli et al, 2019), MetaGraph (Karasikov et al, 2020)), or from assembled reference sequences (deBGA (Liu et al, 2016), Pufferfish (Almodaresi et al, 2018), deSALT (Liu et al, 2019)), or from both (BLight (Marchet et al, 2019), Bifrost (Holley and Melsted, 2020)). These latter applications most frequently make use of the (colored) compacted de Bruijn graph, a variant of the de Bruijn graph in which the maximal non-branching paths (also referred to as unitigs) are condensed into single vertices in the underlying graph structure.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Cuttlefish: Fast, parallel, and low-memory compaction of de Bruijn graphs from large-scale genome collections

Khan

Patro

2020

Preprint

View full text Add to dashboard Cite

Motivation: The construction of the compacted de Bruijn graph from a large collection of reference genomes is a task of increasing interest in genomic analyses. For example, compacted colored reference de Bruijn graphs are increasingly used as sequence indices for the purposes of alignment of short and long reads. Also, as we sequence and assemble a greater diversity of individual genomes, the compacted colored de Bruijn graph can be used as the basis for methods aiming to perform comparative genomic analyses on these genomes. While algorithms have been developed to construct the compacted colored de Bruijn graph from reference sequences, there is still room for improvement, especially in the memory and the runtime performance as the number and the scale of the genomes over which the de Bruijn graph is built grow. Results: We introduce a new algorithm, implemented in the tool Cuttlefish, to construct the colored compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel modeling scheme of the de Bruijn graph vertices as finite-state automata, and constrains the state-space for the automata to enable tracking of their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that the algorithm scales much better than existing approaches, especially as the number and scale of the input references grow. For example, on a typical shared-memory machine, Cuttlefish constructed the compacted graph for 100 human genomes in less than 7 hours, using ~29 GB of memory; no other tested tool successfully completed this task on the testing hardware. We also applied Cuttlefish on 11 diverse conifer plant genomes, and the compacted graph was constructed in under 11 hours, using ~84 GB of memory, while the only other tested tool able to complete this compaction on our hardware took more than 16 hours and ~289 GB of memory. Availability: Cuttlefish is written in C++14, and is available under an open source license at https://github.com/COMBINE-lab/cuttlefish.

show abstract

“…To this end, the de Bruijn graph has become an object of central importance in many genomic analysis tasks. While it was initially used mostly in the context of genome (and transcriptome) assembly (EULER [42], Velvet [51,52], ALLPATHS [9,30], EULER-SR [10], ABySS [46], SOAPdenovo [25,29], Trans-AByss [43], SPAdes [5], Minia [13]), it has seen increasing use in comparative genomics (Cortex [19], DISCOSNP [50], Scalpel [15], BubbZ [34]) and has also been used increasingly in the context of indexing genomic data, either from raw sequencing reads (Mantis [40,1], Vari [37], VariMerge [36], MetaGraph [20]), or from assembled reference sequences (deBGA [27], Pufferfish [2], deSALT [28]), or from both (BLight [32], Bifrost [17]). These latter applications most frequently make use of the (colored) compacted de Bruijn graph, a variant of the de Bruijn graph in which maximal non-branching paths (unitigs) are condensed into single vertices in the underlying graph structure.…”

Section: Introductionmentioning

confidence: 99%

“…To this end, the de Bruijn graph has become an object of central importance in many genomic analysis tasks. While it was initially used mostly in the context of genome (and transcriptome) assembly (EULER (Pevzner et al, 2001), EULER-SR (Chaisson and Pevzner, 2008), Velvet (Zerbino and Birney, 2008;Zerbino et al, 2009), ALLPATHS (Butler et al, 2008;MacCallum et al, 2009), ABySS (Simpson et al, 2009), Trans-AByss (Robertson et al, 2010), SPAdes (Bankevich et al, 2012), Minia (Chikhi and Rizk, 2013), SOAPdenovo (Li et al, 2010;Luo et al, 2015)), it has seen increasing use in i i i i i i i i comparative genomics (Cortex (Iqbal et al, 2012), DISCOSNP (Uricaru et al, 2014), Scalpel (Fang et al, 2016), BubbZ (Minkin and Medvedev, 2020)), and has also been used increasingly in the context of indexing genomic data, either from raw sequencing reads (Vari (Muggli et al, 2017), Mantis (Pandey et al, 2018;Almodaresi et al, 2019), VariMerge (Muggli et al, 2019), MetaGraph (Karasikov et al, 2020)), or from assembled reference sequences (deBGA (Liu et al, 2016), Pufferfish (Almodaresi et al, 2018), deSALT (Liu et al, 2019)), or from both (BLight (Marchet et al, 2019), Bifrost (Holley and Melsted, 2020)). These latter applications most frequently make use of the (colored) compacted de Bruijn graph, a variant of the de Bruijn graph in which the maximal non-branching paths (also referred to as unitigs) are condensed into single vertices in the underlying graph structure.…”

Section: Introductionmentioning

confidence: 99%

Cuttlefish: Fast, parallel, and low-memory compaction of de Bruijn graphs from large-scale genome collections

Khan

Patro

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Making RNAseq read aligners aware of these sequence features (as is the case for the commonly used spliced aligners STAR (16), HISAT2 (17) and minimap2 (18)) can significantly improve the alignment of reads at splice junctions. In addition, where genome and transcriptome annotations exist, many alignment tools allow users to provide sets of correct splice junctions to guide alignment (16)(17)(18)(19). Introns containing these guide splice junctions are penalised less than novel introns, resulting in fewer alignment errors.…”

Section: Introductionmentioning

confidence: 99%

“…Two-pass alignment has also been used to improve splice junction detection and quantification (16,19,20). In a two-pass alignment approach, splice junctions detected in a first round of alignment are scored less negatively in a second round, thereby allowing information sharing between alignments.…”

Section: Introductionmentioning

confidence: 99%

Two-pass alignment using machine-learning-filtered splice junctions increases the accuracy of intron detection in long-read RNA sequencing

Parker

Knop

Barton

et al. 2020

Preprint

View full text Add to dashboard Cite

12Transcription of eukaryotic genomes involves complex alternative processing of RNAs. Sequencing of 13 full-length RNAs using long-reads reveals the true complexity of processing, however the relatively high 14 error rates of long-read technologies can reduce the accuracy of intron identification. Here we present a 15 two-pass approach, combining alignment metrics and machine-learning-derived sequence information 16 to filter spurious examples from splice junctions identified in long-read alignments. The remaining 17 junctions are then used to guide realignment. This method, available in the software package 2passtools 18 (https://github.com/bartongroup/2passtools), improves the accuracy of spliced alignment and 19 transcriptome annotation without requiring orthogonal information from short read RNAseq or existing 20 annotations. 21 22The development of technologies for sequencing full-length RNA molecules makes the identification of 54 authentic processing events possible in principle, but software tools are also needed to interpret the 55 RNA processing complexity. PacBio and ONT sequencing reads have a higher error rate than Illumina 10-14 . 56Consequently, alignment accuracy for long sequence reads at splice junctions is often compromised 9-11 . 57 This is a problem for genome-guided transcriptome annotation because the incorrect identification of 58 splice junctions leads to mis-annotated open reading frames and incorrectly truncated protein 59predictions. In addition, if alignment errors are systematic (i.e. occur for transcripts with specific 60 characteristics), then quantification of transcripts will be compromised. Even with completely error-free 61 reads, alignment at splice junctions is often confounded by multiple equally plausible alternatives 15 . 62 Accordingly, computational methods for improving the splice-aware alignment of long reads are 63 required. 64Software tools for long and short RNAseq data analysis incorporate several approaches to address the 65

show abstract

“…In many cases these higher error rates can prevent the correct identification of isoforms (11)(12)(13). Although several alignment software (14)(15)(16)(17)(18) are optimized to handle these errors, their shortcomings confound transcript identification and annotation. Many reads cannot be aligned and regions where the sequencing error rates are higher such as UTRs frequently produce ambiguous alignments.…”

Section: Introductionmentioning

confidence: 99%

TALC: Transcript-level Aware Long Read Correction

Broseus

Thomas

Oldfield

et al. 2020

Preprint

View full text Add to dashboard Cite

Long-read sequencing technologies are invaluable for determining complex RNA transcript architectures but are error-prone. Numerous "hybrid correction" algorithms have been developed for genomic data that correct long reads by exploiting the accuracy and depth of short reads sequenced from the same sample. These algorithms are not suited for correcting more complex transcriptome sequencing data. We have created a novel algorithm called TALC (Transcription Aware Long Read Correction) which models changes in RNA expression and isoform representation in a weighted De-Bruijn graph to correct long reads from transcriptome studies. We tested TALC on a dataset of short and long reads generated for this study. TALC correction results in more accurate reads with less structural errors than existing methods. TALC is implemented in C++ and available at https://gitlab.igh.cnrs.fr/lbroseus/TALC. INTRODUCTIONRecent advances in RNA sequencing (RNA-seq) technologies have revealed that transcription is more pervasive(1), more diverse (2) and more cryptic (3) than expected. Given the major role that RNA processing plays in disease and normal biology, it is crucial to ascertain the existence of novel isoforms and to accurately quantify their abundance. Second generation RNA-seq technologies such as Illumina are well suited to the tasks of assessing gene expression levels and determining proximal exon connectivity. They produce numerous sequencing reads at a low cost ensuring sufficient representation of most transcripts. However, because the RNA or cDNA is fragmented during short RNA-seq protocols, long range connectivity can only be computationally inferred. These predictions based on short reads struggle to correctly identify transcript isoforms that contain multiple alternative exons (4) or that contain retained introns (5). In these cases, long-read (LR) sequencing technologies are invaluable because they can sequence entire molecules in one pass and thus capture long-range connectivity of complex isoforms. LR technologies however produce less reads than short read (SR) sequencing approaches (6) for similar costs and have higher error rates. In many cases these higher error rates can prevent the correct identification of isoforms.Although several alignment software (7-9) are optimized to handle long erroneous sequences, they still display several issues which confound transcript identification and annotation. Many reads fail to be aligned and regions where the sequencing error rates are higher such as UTRs frequently produce ambiguous alignments. Secondly, they struggle to identify splice junctions, notably those flanking small structural variants such as small exons or alternative 5' and 3' splice sites. This impacts the evaluation of exon skipping events and worsens the quality of . CC-BY-NC-ND 4.0

show abstract

deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index

Cited by 47 publications

References 42 publications

Cuttlefish: Fast, parallel, and low-memory compaction of de Bruijn graphs from large-scale genome collections

Cuttlefish: Fast, parallel, and low-memory compaction of de Bruijn graphs from large-scale genome collections

Two-pass alignment using machine-learning-filtered splice junctions increases the accuracy of intron detection in long-read RNA sequencing

TALC: Transcript-level Aware Long Read Correction

Contact Info

Product

Resources

About