The highly anticipated transition from next generation sequencing (NGS) to third generation sequencing (3GS) has been difficult primarily due to high error rates and excessive sequencing cost. The high error rates make the assembly of long erroneous reads of large genomes challenging because existing software solutions are often overwhelmed by error correction tasks. Here we report a hybrid assembly approach that simultaneously utilizes NGS and 3GS data to address both issues. We gain advantages from three general and basic design principles: (i) Compact representation of the long reads leads to efficient alignments. (ii) Base-level errors can be skipped; structural errors need to be detected and corrected. (iii) Structurally correct 3GS reads are assembled and polished. In our implementation, preassembled NGS contigs are used to derive the compact representation of the long reads, motivating an algorithmic conversion from a de Bruijn graph to an overlap graph, the two major assembly paradigms. Moreover, since NGS and 3GS data can compensate for each other, our hybrid assembly approach reduces both of their sequencing requirements. Experiments show that our software is able to assemble mammalian-sized genomes orders of magnitude more quickly than existing methods without consuming a lot of memory, while saving about half of the sequencing cost.
Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. It has also been widely used to study structural variants, phase haplotypes and more. Here, we introduce the assembler SMARTdenovo, a single-molecule sequencing (SMS) assembler that follows the overlap-layout-consensus (OLC) paradigm. SMARTdenovo (RRID: SCR_017622) was designed to be a rapid assembler, which, unlike contemporaneous SMS assemblers, does not require highly accurate raw reads for error correction. It has performed well in the evaluation of congeneric assemblers and has been successfully users for various assembly projects. It is compatible with Canu for assembling high-quality genomes, and several of the assembly strategies in this program have been incorporated into subsequent popular assemblers. The assembler has been in use since 2015; here we provide information on the development of SMARTdenovo and how to implement its algorithms into current projects.
A persistent enigma is the rarity of polyploidy in animals, compared to its prevalence in plants. Although animal polyploids are thought to experience deleterious genomic chaos during initial polyploidization and subsequent rediploidization processes, this hypothesis has not been tested. We provide an improved reference-quality de novo genome for allotetraploid goldfish whose origin dates to ~15 million years ago. Comprehensive analyses identify changes in subgenomic evolution from asymmetrical oscillation in goldfish and common carp to diverse stabilization and balanced gene expression during continuous rediploidization. The homoeologs are coexpressed in most pathways, and their expression dominance shifts temporally during embryogenesis. Homoeolog expression correlates negatively with alternation of DNA methylation. The results show that allotetraploid cyprinids have a unique strategy for balancing subgenomic stabilization and diversification. Rediploidization process in these fishes provides intriguing insights into genome evolution and function in allopolyploid vertebrates.
Domesticated buffaloes have been integral to rice-paddy agro-ecosystems for millennia, yet relatively little is known about the buffalo genomics. Here, we sequenced and assembled reference genomes for both swamp and river buffaloes and we re-sequenced 230 individuals (132 swamp buffaloes and 98 river buffaloes) sampled from across Asia and Europe. Beyond the many actionable insights that our study revealed about the domestication, basic physiology and breeding of buffalo, we made the striking discovery that the divergent domestication traits between swamp and river buffaloes can be explained with recent selections of genes on social behavior, digestion metabolism, strengths and milk production.
BackgroundThe advent of third-generation sequencing (TGS) technologies opens the door to improve genome assembly. Long reads are promising for enhancing the quality of fragmented draft assemblies constructed from next-generation sequencing (NGS) technologies. To date, a few algorithms that are capable of improving draft assemblies have released. There are SSPACE-LongRead, OPERA-LG, SMIS, npScarf, DBG2OLC, Unicycler, and LINKS. Hybrid assembly on large genomes remains challenging, however.ResultsWe develop a scalable and computationally efficient scaffolder, Long Reads Scaffolder (LRScaf, https://github.com/shingocat/lrscaf), that is capable of significantly boosting assembly contiguity using long reads. In this study, we summarise a comprehensive performance assessment for state-of-the-art scaffolders and LRScaf on seven organisms, i.e., E. coli, S. cerevisiae, A. thaliana, O. sativa, S. pennellii, Z. mays, and H. sapiens. LRScaf significantly improves the contiguity of draft assemblies, e.g., increasing the NGA50 value of CHM1 from 127.1 kbp to 9.4 Mbp using 20-fold coverage PacBio dataset and the NGA50 value of NA12878 from 115.3 kbp to 12.9 Mbp using 35-fold coverage Nanopore dataset. Besides, LRScaf generates the best contiguous NGA50 on A. thaliana, S. pennellii, Z. mays, and H. sapiens. Moreover, LRScaf has the shortest run time compared with other scaffolders, and the peak RAM of LRScaf remains practical for large genomes (e.g., 20.3 and 62.6 GB on CHM1 and NA12878, respectively).ConclusionsThe new algorithm, LRScaf, yields the best or, at least, moderate scaffold contiguity and accuracy in the shortest run time compared with other scaffolding algorithms. Furthermore, LRScaf provides a cost-effective way to improve contiguity of draft assemblies on large genomes.
LRScaf is faster 300 times for S. cerevisiae and 2,300 times for D. melanogaster. The peak 38 RAM of LRScaf, by contrast, is more efficient than LINKS in our test. For the rice case, the peak RAM of LINKS (877.72 Gb) is about 196 times higher than LRScaf. For the experiment 40 of human assembly, the peak RAM of LINKS is beyond the capacity of system memory (1 Tb) whereas LRScaf takes 20. 28 and 41.20 Gb on CHM1 and NA12878 datasets. 42 With the advent of Next Generation Sequencing (NGS) technologies, the genomics community 50 has made significant contributions to de novo assembling genomes. Despite that many studies and tools are aimed at reconstructing NGS data into complete de novo assemblies of genomes, 52 this goal is difficult to achieve because of intrinsic limitation of NGS data, i.e., read lengths are shorter than most of the repetitive sequences [1]. The existence of repeats makes it difficult to 54 reconstruct complete genomes instead of generating a large set of contiguous sequences (contigs) even when the sequencing coverage is high [2]. Thus, attention is focused on the 56 so-called genomic scaffolding procedure, which aims at reducing the number of contigs by using fragments of moderate lengths whose ends are sequenced (double-barreled data) [3,4]. 58Nevertheless, major genomic regions still hinder genomic assemblies because of, primarily, large-size repeat and low coverage. In response, Third Generation Sequencing (TGS) 60 technologies have been developed. TGS sheds light on different alternatives to solve genome assembly problems by offering very long reads, e.g., the Single Molecule Real Time (SMRT) 62delivers read lengths of up to 50 Kb [5] and the nanopore sequencing technology of Oxford Nanopore Technologies ® (ONT) delivers 64 read lengths which are greater than 800 Kb [6]. These long reads suffer from high sequencing error rates, however, which necessitates high coverage during the genome assembly [7]. In 66 4 addition, TGS technologies have a higher cost per base than NGS methods. Consequently, long reads are more commonly used for scaffolding draft assemblies generated from NGS data than 68 for de novo assembly [8].The process of genome assembly is typically divided into two major steps. The first step is to 70 piece overlapping reads together into contigs which is commonly done using the de Bruijn or overlap graph [1]. The second step is to assemble scaffolds, consisting of ordered sequences of 72 oriented contigs with estimated distances between them. Scaffolding, which was first introduced by Huson [3], is a critical part of the genome assembly process, especially for NGS 74data. Yet, scaffolding is a research area that remains largely open because of the NP-hard complexity [9]. By using paired-end and/or mate-pair reads linking information, a number of 76 standalone scaffolders, e.g. Bambus [4], MIP [10], Opera [11], SCARPA [12], SOPRA [13], SSPACE [14], BESST [15], and BOSS [16], have been developed. Nevertheless, a recent 78comprehensive evaluation showed that scaffolding was stil...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.