2020
DOI: 10.1101/2020.01.26.920173
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank

Abstract: Metagenomic sequencing allows researchers to investigate organisms sampled from their native environments by sequencing their DNA directly, and then quantifying the abundance and taxonomic composition of the organisms thus captured. However, these types of analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here we describe Conterminator, an efficient method to detect and remove incorrectly labelled sequences by an exhaustive all-against-all se… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
29
0

Year Published

2020
2020
2020
2020

Publication Types

Select...
4
2

Relationship

3
3

Authors

Journals

citations
Cited by 20 publications
(29 citation statements)
references
References 33 publications
0
29
0
Order By: Relevance
“…Eukaryotic reference genomes and proteomes have many sequences that are derived from bacteria, which have entered these genomes either spuriously through contamination during sequencing and assembly, or represent true biology of horizontal transfer from bacteria to eukaryotes [22,23]. In either case, these bacterial-derived sequences overwhelm the ability of either k -mer matching or read mapping-based approaches to detect eukaryotes from microbiome sequencing, as bacteria represent the majority of the sequencing library from many microbiomes.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…Eukaryotic reference genomes and proteomes have many sequences that are derived from bacteria, which have entered these genomes either spuriously through contamination during sequencing and assembly, or represent true biology of horizontal transfer from bacteria to eukaryotes [22,23]. In either case, these bacterial-derived sequences overwhelm the ability of either k -mer matching or read mapping-based approaches to detect eukaryotes from microbiome sequencing, as bacteria represent the majority of the sequencing library from many microbiomes.…”
Section: Resultsmentioning
confidence: 99%
“…However, databases of eukaryotic genomes and proteins have widespread contamination from bacterial sequences, and these methods therefore frequently misattribute bacterial reads to eukaryotic species [22,23]. Gene-based taxonomic profilers, such as Metaphlan3, have been developed to detect eukaryotic species, but these target a small subset of microbial eukaryotes (122 eukaryotic species as of the mpa_v30 release) [24,25].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Conterminator [35] is implemented in C++ and its open source licensed as GPLv3 and available at https://github.com/ martin-steinegger/conterminator. The version to reproduce the results is available under https://doi.org/10.5281/ zenodo.3750825 [36]. Commands to rerun the analysis of RefSeq and NR are in Additional file 3: Listing S1.…”
Section: Supplementary Informationmentioning
confidence: 99%
“…For sequences to be entered into RefSeq, curators at NCBI perform both automated and manual checks to ensure minimal contamination and high sequence quality. Despite these efforts, multiple studies have identified contamination in RefSeq and other publicly available genome databases [3,4,5,6,7]. NCBI requires Refseq assemblies to have an appropriate genome length as compared to existing genomes from the same species, and it labels assemblies as "complete" if the genome exists in one contiguous sequence per chromosome, with no unplaced scaffolds and with all chromosomes present.…”
Section: Introductionmentioning
confidence: 99%