Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes

Shafin, Kishwar; Pesout, Trevor; Lorig-Roach, Ryan; Haukness, Marina; Olsen, Hugh E.; Bosworth, Colleen M.; Armstrong, Joel; Tigyi, Kristof; Maurer, Nicholas; Koren, Sergey; Sedlazeck, Fritz J.; Marschall, Tobias; Mayes, Simon; Costa, Vânia; Zook, Justin M.; Liu, Kelvin; Kilburn, Duncan; Sorensen, Melanie; Munson, Katherine M.; Vollger, Mitchell R.; Monlong, Jean; Garrison, Erik; Eichler, Evan E.; Salama, Sofie R.; Haussler, David; Green, Edward; Akeson, Mark; Phillippy, Adam M.; Miga, Karen H.; Carnevali, P.; Jain, Miten; Paten, Benedict

doi:10.1038/s41587-020-0503-6

Cited by 380 publications

(389 citation statements)

References 65 publications

Supporting

Mentioning

383

Contrasting

Order By: Relevance

“…2a). ONT data were not considered for further benchmarking due to practical issues concerning systematic base call errors, consistency, and scalability at the time (early 2017) 39 ; however the technology has since improved in these areas 40 and will be reconsidered in future phases of the VGP, as will PacBio's recently released HiFi circular consensus sequencing (CCS) 41 . assembly pipeline applied across multiple species.…”

Section: Iterative Assembly Pipelinementioning

confidence: 99%

Towards complete and error-free genome assemblies of all vertebrate species

Rhie

McCarthy

Fédrigo

et al. 2020

Preprint

Self Cite

247

463

View full text Add to dashboard Cite

High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are only available for a few non-microbial species 1-4 . To address this issue, the international Genome 10K (G10K) consortium 5,6 has worked over a five-year period to evaluate and develop cost-effective methods for assembling the most accurate and complete reference genomes to date. Here we summarize these developments, introduce a set of quality standards, and present lessons learned from sequencing and assembling 16 species representing major vertebrate lineages (mammals, birds, reptiles, amphibians, teleost fishes and cartilaginous fishes). We confirm that long-read sequencing technologies are essential for maximizing genome quality and that unresolved complex repeats and haplotype heterozygosity are major sources of error in assemblies. Our new assemblies identify and correct substantial errors in some of the best historical reference genomes. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an effort to generate high-quality, complete reference genomes for all ~70,000 extant vertebrate species and help enable a new era of discovery across the life sciences.

show abstract

Section: Iterative Assembly Pipelinementioning

confidence: 99%

Towards complete and error-free genome assemblies of all vertebrate species

Rhie

McCarthy

Fédrigo

et al. 2020

Preprint

Self Cite

247

463

View full text Add to dashboard Cite

show abstract

“…Collaborators evaluated 5 False Positive SNVs, 5 False Positive Indels, 5 False Negative SNVs, 5 False Negative Indels both inside and outside v3.3.2 along with 5 False Positive SNVs, 5 False Positive Indels, 5 False Negative SNVs, 5 False Negative Indels in the MHC for GRCh37. We generated IGV sessions with BAM files for Illumina HiSeq, 10x Genomics, PacBio HiFi 15kb & 20 kb merged, and ONT Ultralong 11 , then asked that the evaluators identify for each site if both alleles in the benchmark were correct and if both alleles in the query call set were correct.…”

Section: Evaluation Of the Benchmarkmentioning

confidence: 99%

“…5 These benchmarks and benchmarking tools helped enable the development and optimization of new technologies and bioinformatics approaches, including linked reads, 6 highly accurate long reads, 7 deep learning-based variant callers, 8,9 graph-based variant callers, 10 and de novo assembly. 11,12 However, these benchmarks did not cover some challenging regions that these new methods could access, including many known medically relevant genes. 13,14 This limitation highlighted the need for improved benchmarks covering segmental duplications, the Major Histocompatibility Complex (MHC), and other challenging regions.…”

Section: Introductionmentioning

confidence: 99%

Benchmarking challenging small variants with linked and long reads

Wagner

Olson

Harris

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Genome in a Bottle (GIAB) benchmarks have been widely used to validate clinical sequencing pipelines and develop new variant calling and sequencing methods. Here we use accurate long and linked reads to expand the prior benchmark to include difficult-to-map regions and segmental duplications that are not readily accessible to short reads. Our new benchmark adds more than 300,000 SNVs, 50,000 indels, and 16 % new exonic variants, many in challenging, clinically relevant genes not previously covered (e.g., PMS2). We increase coverage of the GRCh38 assembly from 85 % to 92 %, while excluding problematic regions for benchmarking small variants (e.g., copy number variants and assembly errors) that should not have been in the previous version. Our new benchmark reliably identifies both false positives and false negatives across multiple short-, linked-, and long-read based variant calling methods. As an example of its utility, this benchmark identifies eight times more false negatives in a short read variant call set relative to our previous benchmark, mostly in difficult-to-map regions. To enable robust small variant benchmarking, we still exclude 3.6% of GRCh37 and 5.0% of GRCh38 in (1) highly repetitive regions such as large, highly similar segmental duplications and the centromere not accessible to our data and (2) regions where our sample is highly divergent from the reference due to large indels, structural variation, copy number variation, and/or errors in the reference (e.g., some KIR genes that have duplications in HG002). We have demonstrated the utility of this benchmark to assess performance in more challenging regions, which enables benchmarking in more difficult genes and continued technology and bioinformatics development. The benchmarks are available at: ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.1/ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_v4.2_SmallVariantDraftBenchmark_07092020/

show abstract

“…Canu version 2.0, Flye version 2.7, Miniasm/Minipolish version 0.1.3 (35) Raven version 1.1.10 (36), NECAT version 0.01 (37), wtdbg2 version 2.5 (38), and shasta version 0.5.1 (39). All assemblers were run with default parameters ( agging raw or corrected reads depending on read input, Raven was run with theweaken ag when corrected reads were used).…”

Section: Validation Of Assembly and Comparison Of Long Read Assemblermentioning

confidence: 99%

“…Different isolates (variants) of the same species have been found to vary greatly in their phenotypes (16), but due to the relatively small number of isolates sequenced, the extent of genomic variation between strains is poorly understood. Owing to their genomes having multiple chromosomes that contribute to their relatively large genome sizes (30)(31)(32)(33)(34)(35)(36)(37)(38)(39)(40)(41)(42)(43)(44)(45) in comparison to bacterial microbes (around 5 Mb), de novo genome assemblies of Metarhizium spp. using rst generation sequencing is very costly, and second-generation sequencing results in assemblies that are highly contiguous, falling apart around repeat rich and homologous regions of the genome.…”

Section: Introductionmentioning

confidence: 99%

Telomere Length De Novo Assembly of all 7 Chromosomes and Mitogenome Sequencing of the Model Entomopathogenic Fungus, Metarhizium Brunneum, by Means of a Novel Assembly Pipeline

Saud

Kortsinoglou

Kouvelis

et al. 2020

Preprint

View full text Add to dashboard Cite

More accurate and complete reference genomes have improved understanding of gene function, biology, and evolutionary mechanisms. Hybrid genome assembly approaches leverage benefits of both long, relatively error-prone reads from third-generation sequencing technologies and short, accurate reads from second-generation sequencing technologies, to produce more accurate and contiguous de novo genome assemblies in comparison to using either technology independently. In this study, we present a novel hybrid assembly pipeline that allowed for both mitogenome de novo assembly and telomere length de novo assembly of all 7 chromosomes of the model entomopathogenic fungus, Metarhizium brunneum. The improved assembly allowed for better ab initio gene prediction and a more BUSCO complete proteome set has been generated in comparison to the eight current NCBI reference Metarhizium spp. genomes. Remarkably, we note that including the mitogenome in ab initio gene prediction training improved overall gene prediction. The assembly was further validated by comparing contig assembly agreement across various assemblers, assessing the assembly performance of each tool. Genomic synteny and orthologous protein clusters were compared between Metarhizium brunneum and three other Hypocreales species with complete genomes, identifying core proteins, and listing orthologous protein clusters shared uniquely between the two entomopathogenic fungal species, so as to further facilitate the understanding of molecular mechanisms underpinning fungal-insect pathogenesis. The novel assembly pipeline may be used for other haploid fungal species, facilitating the need to produce high-quality reference fungal genomes, leading to better understanding of fungal genomic evolution, chromosome structuring and gene regulation.

show abstract

Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes

Cited by 380 publications

References 65 publications

Towards complete and error-free genome assemblies of all vertebrate species

Towards complete and error-free genome assemblies of all vertebrate species

Benchmarking challenging small variants with linked and long reads

Telomere Length De Novo Assembly of all 7 Chromosomes and Mitogenome Sequencing of the Model Entomopathogenic Fungus, Metarhizium Brunneum, by Means of a Novel Assembly Pipeline

Contact Info

Product

Resources

About