Background Long sequencing reads allow increasing contiguity and completeness of fragmented, short-read–based genome assemblies by closing assembly gaps, ideally at high accuracy. While several gap-closing methods have been developed, these methods often close an assembly gap with sequence that does not accurately represent the true sequence. Findings Here, we present DENTIST, a sensitive, highly accurate, and automated pipeline method to close gaps in short-read assemblies with long error-prone reads. DENTIST comprehensively determines repetitive assembly regions to identify reliable and unambiguous alignments of long reads to the correct loci, integrates a consensus sequence computation step to obtain a high base accuracy for the inserted sequence, and validates the accuracy of closed gaps. Unlike previous benchmarks, we generated test assemblies that have gaps at the exact positions where real short-read assemblies have gaps. Generating such realistic benchmarks for Drosophila (134 Mb genome), Arabidopsis (119 Mb), hummingbird (1 Gb), and human (3 Gb) and using simulated or real PacBio continuous long reads, we show that DENTIST consistently achieves a substantially higher accuracy compared to previous methods, while having a similar sensitivity. Conclusion DENTIST provides an accurate approach to improve the contiguity and completeness of fragmented assemblies with long reads. DENTIST's source code including a Snakemake workflow, conda package, and Docker container is available at https://github.com/a-ludi/dentist. All test assemblies as a resource for future benchmarking are at https://bds.mpi-cbg.de/hillerlab/DENTIST/.
Long sequencing reads allow increasing contiguity and completeness of fragmented, short-read based genome assemblies by closing assembly gaps, ideally at high accuracy. While several gap closing methods have been developed, these methods often close an assembly gap with sequence that does not accurately represent the true sequence. Here, we developed DENTIST, a sensitive, highly-accurate and automated pipeline method to close gaps in short read assemblies with long reads. DENTIST comprehensively determines repetitive assembly regions to identify reliable and unambiguous alignments of long reads to the right loci, integrates a consensus sequence computation step to obtain a high base accuracy for the inserted sequence, and validates the accuracy of closed gaps. Unlike previous benchmarks, we generated test assemblies that have gaps at the exact positions where real short-read assemblies have gaps. Generating such realistic benchmarks for Drosophila (134 Mb genome), Arabidopsis (119 Mb), hummingbird (1 Gb) and human (3 Gb) and using simulated or real PacBio reads, we show that DENTIST consistently achieves a substantially higher accuracy compared to previous methods, while having a similar sensitivity. As another distinguishing feature, DENTIST can accurately scaffold the given contigs with long reads in addition to closing gaps, extending its application range to contig-only assemblies. In summary, DENTIST provides an accurate approach to improve the contiguity and completeness of fragmented assemblies with long reads. DENTIST's source code including a Snakemake workflow and Docker container is available at https://github.com/a-ludi/dentist. All test assemblies as a resource for future benchmarking are at https://bds.mpi-cbg.de/hillerlab/DENTIST/.
The km River Oder (Odra in Polish), with a catchment area of , km , flows through Czechia, Poland and Germany. In late July , it became severely polluted and its fish and aquatic molluscs suffered unprecedented mass mortality (Save the Oder Coalition, , saveoder.org).The disaster began in the Polish part of the river on July. At the beginning of August, toxic golden algae Prymnesium parvum were detected by the German research institute IGB (Leibniz-Institute for Freshwater Ecology and Inland Fisheries). As these algae are associated with saline
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.