2019
DOI: 10.1007/s40484-019-0181-x
|View full text |Cite
|
Sign up to set email alerts
|

Overlap graphs and de Bruijn graphs: data structures for de novogenome assembly in the big data era

Abstract: Background: De novo genome assembly relies on two kinds of graphs: de Bruijn graphs and overlap graphs. Overlap graphs are the basis for the Celera assembler, while de Bruijn graphs have become the dominant technical device in the last decade. Those two kinds of graphs are collectively called assembly graphs. Results: In this review, we discuss the most recent advances in the problem of constructing, representing and navigating assembly graphs, focusing on very large datasets. We will also explore some computa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
25
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 35 publications
(26 citation statements)
references
References 96 publications
0
25
0
Order By: Relevance
“…For the overlap-layout-consensus assembly approach, elevated coverage depth leads to a quadratic increase in the time necessary to compute overlaps (and in the number of overlaps that need to be computed). For de Bruijn graph assemblers, the very high coverage depth amplifies the effect of errors on the assembly graph and may even confuse error correction algorithms (simply by chance multiple random errors can “confirm” each other; Rizzi et al, 2019 ). Also the level of complexity (the number of different species) of a metagenome could be very different, from tens to ten thousands even further complicating the assembly process.…”
Section: Challenges Of Metagenome Assembliesmentioning
confidence: 99%
“…For the overlap-layout-consensus assembly approach, elevated coverage depth leads to a quadratic increase in the time necessary to compute overlaps (and in the number of overlaps that need to be computed). For de Bruijn graph assemblers, the very high coverage depth amplifies the effect of errors on the assembly graph and may even confuse error correction algorithms (simply by chance multiple random errors can “confirm” each other; Rizzi et al, 2019 ). Also the level of complexity (the number of different species) of a metagenome could be very different, from tens to ten thousands even further complicating the assembly process.…”
Section: Challenges Of Metagenome Assembliesmentioning
confidence: 99%
“…Canu, on the other hand, applies an overlapping (OLC) strategy for de novo assembly. There are differences on how these two approaches (DBG and OLC) work, which have been extensively studied (Rizzi et al, 2019). OLC tends to be computationally demanding due to the fact that it performs an all-vs-all alignment of the reads to find overlapping regions and call a consensus, while DBG has a more relaxed computer requirement and therefore it has been widely used for SR assembly.…”
Section: Raw Read Filtering and Assembly Of Metagenomic Samplesmentioning
confidence: 99%
“…Within transcriptome reference sets, such as the cDNA databases available from Ensembl representing various species [5], or those that are de novo assembled from short-read RNA-Seq data, non-chimeric sequences are direct representations of transcribed genes, while artificially generated chimeric ones are mosaics of two or more pieces of DNA incorrectly pieced together. The latter occurring during library preparation [6,7], or during the de novo assembly process [8,9], where there is a requirement to traverse paths across graphs constructed from read data that ranges in complexity depending on the nature of the gene families being represented [10][11][12]. Chimeras also occur at a genomic level during de novo assembly, such as when inferring haplotypes [13,14], but the causes, and consequences, at a genomic level are different [15][16][17].…”
Section: Introductionmentioning
confidence: 99%
“…A crucial part of de novo transcriptome assembly of short-read data is the arrangement of information present within reads into structures that represent full or partial gene families. These take the form of graphs, mostly de Bruijn [ 9 , 24 ], but may also be created from overlap consensus approaches [ 9 , 51 ]. In the de Bruijn based approach millions of fragments of specified length, termed kmers, are extracted from reads and used as nodes.…”
Section: Introductionmentioning
confidence: 99%