Overlap graphs and de Bruijn graphs: data structures for de novogenome assembly in the big data era

Rizzi, Raffaella; Beretta, Stefano; Patterson, Murray; Pirola, Yuri; Previtali, Marco; Vedova, Gianluca Della; Bonizzoni, Paola

doi:10.1007/s40484-019-0181-x

Cited by 35 publications

(26 citation statements)

References 96 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the overlap-layout-consensus assembly approach, elevated coverage depth leads to a quadratic increase in the time necessary to compute overlaps (and in the number of overlaps that need to be computed). For de Bruijn graph assemblers, the very high coverage depth amplifies the effect of errors on the assembly graph and may even confuse error correction algorithms (simply by chance multiple random errors can “confirm” each other; Rizzi et al, 2019 ). Also the level of complexity (the number of different species) of a metagenome could be very different, from tens to ten thousands even further complicating the assembly process.…”

Section: Challenges Of Metagenome Assembliesmentioning

confidence: 99%

Metagenomic Data Assembly – The Way of Decoding Unknown Microorganisms

Lapidus

Korobeynikov

2021

Front. Microbiol.

View full text Add to dashboard Cite

Metagenomics is a segment of conventional microbial genomics dedicated to the sequencing and analysis of combined genomic DNA of entire environmental samples. The most critical step of the metagenomic data analysis is the reconstruction of individual genes and genomes of the microorganisms in the communities using metagenomic assemblers – computational programs that put together small fragments of sequenced DNA generated by sequencing instruments. Here, we describe the challenges of metagenomic assembly, a wide spectrum of applications in which metagenomic assemblies were used to better understand the ecology and evolution of microbial ecosystems, and present one of the most efficient microbial assemblers, SPAdes that was upgraded to become applicable for metagenomics.

show abstract

Section: Challenges Of Metagenome Assembliesmentioning

confidence: 99%

Metagenomic Data Assembly – The Way of Decoding Unknown Microorganisms

Lapidus

Korobeynikov

2021

Front. Microbiol.

View full text Add to dashboard Cite

show abstract

“…Canu, on the other hand, applies an overlapping (OLC) strategy for de novo assembly. There are differences on how these two approaches (DBG and OLC) work, which have been extensively studied (Rizzi et al, 2019). OLC tends to be computationally demanding due to the fact that it performs an all-vs-all alignment of the reads to find overlapping regions and call a consensus, while DBG has a more relaxed computer requirement and therefore it has been widely used for SR assembly.…”

Section: Raw Read Filtering and Assembly Of Metagenomic Samplesmentioning

confidence: 99%

Enhanced Recovery of Microbial Genes and Genomes From a Marine Water Column Using Long-Read Metagenomics

2021

View full text Add to dashboard Cite

Third-generation sequencing has penetrated little in metagenomics due to the high error rate and dependence for assembly on short-read designed bioinformatics. However, second-generation sequencing metagenomics (mostly Illumina) suffers from limitations, particularly in the assembly of microbes with high microdiversity and retrieval of the flexible (adaptive) fraction of prokaryotic genomes. Here, we have used a third-generation technique to study the metagenome of a well-known marine sample from the mixed epipelagic water column of the winter Mediterranean. We have compared PacBio Sequel II with the classical approach using Illumina Nextseq short reads followed by assembly to study the metagenome. Long reads allow for efficient direct retrieval of complete genes avoiding the bias of the assembly step. Besides, the application of long reads on metagenomic assembly allows for the reconstruction of much more complete metagenome-assembled genomes (MAGs), particularly from microbes with high microdiversity such as Pelagibacterales. The flexible genome of reconstructed MAGs was much more complete containing many adaptive genes (some with biotechnological potential). PacBio Sequel II CCS appears particularly suitable for cellular metagenomics due to its low error rate. For most applications of metagenomics, from community structure analysis to ecosystem functioning, long reads should be applied whenever possible. Specifically, for in silico screening of biotechnologically useful genes, or population genomics, long-read metagenomics appears presently as a very fruitful approach and can be analyzed from raw reads before a computationally demanding (and potentially artifactual) assembly step.

show abstract

“…Within transcriptome reference sets, such as the cDNA databases available from Ensembl representing various species [5], or those that are de novo assembled from short-read RNA-Seq data, non-chimeric sequences are direct representations of transcribed genes, while artificially generated chimeric ones are mosaics of two or more pieces of DNA incorrectly pieced together. The latter occurring during library preparation [6,7], or during the de novo assembly process [8,9], where there is a requirement to traverse paths across graphs constructed from read data that ranges in complexity depending on the nature of the gene families being represented [10][11][12]. Chimeras also occur at a genomic level during de novo assembly, such as when inferring haplotypes [13,14], but the causes, and consequences, at a genomic level are different [15][16][17].…”

Section: Introductionmentioning

confidence: 99%

“…A crucial part of de novo transcriptome assembly of short-read data is the arrangement of information present within reads into structures that represent full or partial gene families. These take the form of graphs, mostly de Bruijn [ 9 , 24 ], but may also be created from overlap consensus approaches [ 9 , 51 ]. In the de Bruijn based approach millions of fragments of specified length, termed kmers, are extracted from reads and used as nodes.…”

Section: Introductionmentioning

confidence: 99%

CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure

Linheiro

Archer

2021

PLoS Comput Biol

View full text Add to dashboard Cite

With the exponential growth of sequence information stored over the last decade, including that of de novo assembled contigs from RNA-Seq experiments, quantification of chimeric sequences has become essential when assembling read data. In transcriptomics, de novo assembled chimeras can closely resemble underlying transcripts, but patterns such as those seen between co-evolving sites, or mapped read counts, become obscured. We have created a de Bruijn based de novo assembler for RNA-Seq data that utilizes a classification system to describe the complexity of underlying graphs from which contigs are created. Each contig is labelled with one of three levels, indicating whether or not ambiguous paths exist. A by-product of this is information on the range of complexity of the underlying gene families present. As a demonstration of CStones ability to assemble high-quality contigs, and to label them in this manner, both simulated and real data were used. For simulated data, ten million read pairs were generated from cDNA libraries representing four species, Drosophila melanogaster, Panthera pardus, Rattus norvegicus and Serinus canaria. These were assembled using CStone, Trinity and rnaSPAdes; the latter two being high-quality, well established, de novo assembers. For real data, two RNA-Seq datasets, each consisting of ≈30 million read pairs, representing two adult D. melanogaster whole-body samples were used. The contigs that CStone produced were comparable in quality to those of Trinity and rnaSPAdes in terms of length, sequence identity of aligned regions and the range of cDNA transcripts represented, whilst providing additional information on chimerism. Here we describe the details of CStones assembly and classification process, and propose that similar classification systems can be incorporated into other de novo assembly tools. Within a related side study, we explore the effects that chimera’s within reference sets have on the identification of differentially expression genes. CStone is available at: https://sourceforge.net/projects/cstone/.

show abstract

Overlap graphs and de Bruijn graphs: data structures for de novogenome assembly in the big data era

Cited by 35 publications

References 96 publications

Metagenomic Data Assembly – The Way of Decoding Unknown Microorganisms

Metagenomic Data Assembly – The Way of Decoding Unknown Microorganisms

Enhanced Recovery of Microbial Genes and Genomes From a Marine Water Column Using Long-Read Metagenomics

CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure

Contact Info

Product

Resources

About