ntEdit: scalable genome sequence polishing

Rm, Warren; Coombe, Lauren; Mohamadi, Hamid; Zhang, Jessica; Jaquish, Barry; Isabel, Nathalie; Jones, Steven J.M.; Bousquet, Jean; Bohlmann, Jöerg; Birol, İnanç

doi:10.1093/bioinformatics/btz400

Cited by 77 publications

(80 citation statements)

References 13 publications

(14 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Using the polished output from Pilon, we repeated the short read polishing two additional times and observed moderate improvements in BUSCO scores. Finally, due to the low Illumina sequencing coverage, we employed an additional polishing step utilizing ntEdit [27], which functions well in low sequence coverage situations. We observed a slight improvement recovering an additional 81 complete BUSCOs, ultimately obtaining 90% BUSCO completeness with ~5% listed as fragmented or missing, respectively.…”

Section: Resultsmentioning

confidence: 99%

“…Left indicates the number of rounds of each program, and bars display BUSCO notation. (D) ntEdit [27] was performed using the eHAP1 short reads on the 5x Racon/3x Pilon (eHAP1 Only) polished assembly and using eHAP1 then HAP1 (eHAP1-HAP1) or HAP1 then eHAP1 (HAP1-eHAP1) short reads. BUSCO sores were calculated after each round.…”

Section: Data Accessionmentioning

confidence: 99%

“…The resulting Pilon-polished assembly was used as input for ntEdit (v1.2.2) [27]. Briefly, we ran Structural variant detection: Structural variants were detected using Sniffles [29] after alignment of the long reads against the hg19 reference genome using NGMLR [29] (-t 24, -x ont).…”

mentioning

confidence: 99%

See 2 more Smart Citations

Establishment of an eHAP1 Human Haploid Cell Line Hybrid Reference Genome Assembled from Short and Long Reads

Law

Warren²,

McCallion

2019

Preprint

View full text Add to dashboard Cite

Background: Haploid cell lines are a valuable research tool with broad applicability for genetic assays. As such the fully haploid human cell line, eHAP1, has been used in a wide array of studies. However, the absence of a corresponding reference genome sequence for this cell line has limited the potential for more widespread applications to experiments dependent on available sequence, like capture-clone methodologies. Results:We generated ~15x coverage Nanopore long reads from ten GridION flowcells. We utilized this data to assemble a de novo draft genome using minimap and miniasm and subsequently polished using Racon. This assembly was further polished using previously generated, low-coverage, Illumina short reads with Pilon and ntEdit. This resulted in a hybrid eHAP1 assembly with >90% complete BUSCO scores. We further assessed the eHAP1 long read data for structural variants using Sniffles and identify a variety of rearrangements, including a previously established Philadelphia translocation. Finally, we demonstrate how some of these variants overlap open chromatin regions, potentially impacting regulatory regions.Conclusions: By integrating both long and short reads, we generated a high-quality reference assembly for eHAP1 cells. We identify structural variants using long reads, including some that may impact putative regulatory elements. The union of long and short reads demonstrates the utility in combining sequencing platforms to generate a high-quality reference genome de novo solely from low coverage data. We expect the resulting eHAP1 genome assembly to provide a useful resource to enable novel experimental applications in this important model cell line. Introduction:The vast majority of eukaryotic cells are diploid and many cellular models used experimentally are either diploid or polyploid. The presence of additional alleles, while evolutionarily beneficial, 9 Methods: eHAP1 cell culture: eHAP1 cells were purchased from Horizon Discovery (SKU: c669). The cells were cultured using the following growth media: 445 mL IMDM media (Gibco: 12440-053), 50 mL FBS, and 5 mL 100x Pen/Strep. Cells were passaged every 2-3 days at a ratio of 1:5.The cells were rapidly expanded post purchase to reduce the number of passages and possible ploidy changes, prior to genomic DNA isolation.Genomic DNA isolation, library prep, and sequencing: Genomic DNA was harvested from 5 million cells using the Circulomics Nanobind CBB Big DNA kit (Part #NB-900-001-01). The DNA was extracted following the included handbook (v1.7) protocol for "Cultured Mammalian Cells -HMW" with minor modifications. Specifically, cells were vortexed intensively (1 second pulses, 10x pulses), the final DNA was pipetted 10 times through a p200 tip, and immediately prior to library preparation, the DNA was run through a 28G needle five times. This was done to help the DNA into solution with minimal effect on length.The genomic DNA was prepared using the Nanopore Ligation Sequencing Kit (SQK-LSK109) following the manufacturer's protocol (GDE_9063_v109_revD...

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Data Accessionmentioning

confidence: 99%

See 1 more Smart Citation

Establishment of an eHAP1 Human Haploid Cell Line Hybrid Reference Genome Assembled from Short and Long Reads

Law

Warren²,

McCallion

2019

Preprint

View full text Add to dashboard Cite

show abstract

“…It can be applied during and after the genome assembly. Usually, 21 long-read assemblers perform a single round of long-read polishing [14,16,17], that is followed by 22 several rounds of polishing with long [15,17,19,21] and short [15,20,22] reads using third-party 23 tools [15,17,[19][20][21][22]. 24 Currently, polishing large genomes, such as the human genome, can take much more com-25 putational time than the long-read assembly itself [14,16,17].…”

mentioning

confidence: 99%

“…This new assembly approach gave rise to some criticisms because even after 1 several rounds of polishing, a substantial fraction of consensus errors remains, hampering the sub-2 sequent genome analyses such as gene and protein prediction [23]. 3 When the aforementioned assembly approach employs short-read polishing [15,20,22], then 4 it corresponds to a long-read-first hybrid assembly strategy [24,25]. Another hybrid assembly 5 strategy consists in starting the assembly process with short reads [26].…”

mentioning

confidence: 99%

WENGAN: Efficient and high quality hybrid de novo assembly of human genomes

Buena-Atienza

Ossowski

et al. 2019

Preprint

View full text Add to dashboard Cite

The continuous improvement of long-read sequencing technologies along with the development of ad-doc algorithms has launched a new de novo assembly era that promises high-quality genomes.However, it has proven difficult to use only long reads to generate accurate genome assemblies of large, repeat-rich human genomes. To date, most of the human genomes assembled from long error-prone reads add accurate short reads to further polish the consensus quality. Here, we report the development of a novel algorithm for hybrid assembly, WENGAN, and the de novo assembly of four human genomes using a combination of sequencing data generated on ONT PromethION, PacBio Sequel, Illumina and MGI technology. WENGAN implements efficient algorithms that exploit the sequence information of short and long reads to tackle assembly contiguity as well as consensus quality. The resulting genome assemblies have high contiguity Mb), few assembly errors (contig NGA50:10.9-45.91 Mb), good consensus quality (QV:27.79-33.61), and high gene completeness (BUSCO complete: 94.6-95.1%), while consuming low computational resources (CPU hours:153-1027). In particular, the WENGAN assembly of the haploid CHM13 sample achieved a contig NG50 of 62.06 Mb (NGA50:45.91 Mb), which surpasses the contiguity of the current human reference genome (GRCh38 contig NG50:57.88 Mb). Providing highest quality at low computational cost, WENGAN is an important step towards the democratization of the de novo assembly of human genomes. The WENGAN assembler is available at https://github.com/adigenova/wengan 1 Introduction 1 Genome assembly is the process by which an unknown genome sequence is constructed by detect-2 ing overlaps between a set of redundant genomic reads. Most genome assemblers represent the 3 overlap information using different kinds of assembly graphs [1,2]. The main idea behind these 4 algorithms is to reduce the genome assembly problem to a path problem where the genome is re-5 constructed by finding "the" true genome path in a tangled assembly graph [1,2]. The tangledness 6 comes from the complexity that repetitive genomic regions induce in the assembly graphs [1,2]. 7 The first graph-based genome assemblers used overlaps of variable length to construct an overlap-8 graph [2]. In such graph, the reads are the vertices and the edges represent the pairwise align-9 ments [2]. The main goal of the overlap graph approach and of its subsequent evolution, namely 10 the string graph [2], is to preserve as much as possible the reads information [2]. However, the 11 read-level graph construction requires an expensive all-vs-all read comparison [2]. The read-level 12 nature implies that a path in such a graph represents a read layout, and a subsequent consensus step 13 must be performed in order to improve the quality of bases called along the path [2]. These graph 14 properties are the foundation of the overlap-layout-consensus (OLC) paradigm [2][3][4]. 15A seemingly counterintuitive idea is to fix the overlap length to a given size (k) to build a 16 de Bruijn gra...

show abstract

ntEdit+Sealer: Efficient Targeted Error Resolution and Automated Finishing of Long‐Read Genome Assemblies

Coombe

Wong

et al. 2022

Current Protocols

Self Cite

View full text Add to dashboard Cite

High-quality genome assemblies are crucial to many biological studies, and utilizing long sequencing reads can help achieve higher assembly contiguity. While long reads can resolve complex and repetitive regions of a genome, their relatively high associated error rates are still a major limitation. Long reads generally produce draft genome assemblies with lower base quality, which must be corrected with a genome polishing step. Hybrid genome polishing solutions can greatly improve the quality of long-read genome assemblies by utilizing more accurate short reads to validate bases and correct errors. Currently available hybrid polishing methods rely on read alignments, and are therefore memory-intensive and do not scale well to large genomes. Here we describe ntEdit+Sealer, an alignment-free, k-mer-based genome finishing protocol that employs memory-efficient Bloom filters. The protocol includes ntEdit for correcting base errors and small indels, and for marking potentially problematic regions, then Sealer for filling both assembly gaps and problematic regions flagged by ntEdit. ntEdit+Sealer produces highly accurate, error-corrected genome assemblies, and is available as a Makefile pipeline from https:// github.com/ bcgsc/ ntedit_sealer_protocol.

show abstract

ntEdit: scalable genome sequence polishing

Cited by 77 publications

References 13 publications

Establishment of an eHAP1 Human Haploid Cell Line Hybrid Reference Genome Assembled from Short and Long Reads

Establishment of an eHAP1 Human Haploid Cell Line Hybrid Reference Genome Assembled from Short and Long Reads

WENGAN: Efficient and high quality hybrid de novo assembly of human genomes

ntEdit+Sealer: Efficient Targeted Error Resolution and Automated Finishing of Long‐Read Genome Assemblies

Contact Info

Product

Resources

About