2020
DOI: 10.1186/s13059-020-02023-1
|View full text |Cite
|
Sign up to set email alerts
|

Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank

Abstract: Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method to detect and remove incorrectly labeled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination of 2,161,746, 114,035, and 14,148 sequences in the RefSeq, GenBank, and NR databases, respectively, spanning the whole range from draft to "complete" model organism genomes. Our method scales linearly wi… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
196
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 170 publications
(212 citation statements)
references
References 38 publications
0
196
0
Order By: Relevance
“…Although the inference of GCCs is using very sensitive methods to compare profile HMMs, low sequence diversity in GCs can limit its effectiveness. Our approach is affected by the presence and propagation of contamination in reference databases, a significant problem in 'omics 66,67 . In our case, we only use Pfam as a source for annotation owing to its highquality and manual curation process.…”
Section: Discussionmentioning
confidence: 99%
“…Although the inference of GCCs is using very sensitive methods to compare profile HMMs, low sequence diversity in GCs can limit its effectiveness. Our approach is affected by the presence and propagation of contamination in reference databases, a significant problem in 'omics 66,67 . In our case, we only use Pfam as a source for annotation owing to its highquality and manual curation process.…”
Section: Discussionmentioning
confidence: 99%
“…Taxonomy was assigned to each of the tags by comparing tags to the GenBank database (28) (Fig 2). A large proportion of the reads did not have taxonomy assigned (ApeKI: 61.92%, PstI: 70.73%) which is partially due to the absence of genome assemblies in the GenBank database, particularly for uncultured microbes [15], but may also be due to the contamination of sequences in the GenBank database [36], whereby a sequence is assigned to the incorrect organism (e.g. microbial genome incorrectly inserted into a "host" genome assembly)-in these cases the sequence may match to reads in two different kingdoms (one correct, and one incorrect) and therefore be unassigned.…”
Section: Comparison Of Tags Against the Genbank Databasementioning
confidence: 99%
“…Eukaryotic reference genomes and proteomes have many sequences that are derived from bacteria, which have entered these genomes either spuriously through contamination during sequencing and assembly, or represent true biology of horizontal transfer from bacteria to eukaryotes (Lu and Salzberg 2018;Steinegger and Salzberg 2020). In either case, these bacterial-derived sequences overwhelm the ability of either k-mer matching or read mapping-based approaches to detect eukaryotes from microbiome sequencing, as bacteria represent the majority of the sequencing library from most microbiomes.…”
Section: Bacterial Sequences Are Ubiquitous In Eukaryotic Genomesmentioning
confidence: 99%
“…Research questions about eukaryotes in microbiomes have primarily been addressed by using eukaryotic reference genomes, genes, or proteins for either k-mer matching or read mapping approaches (Nash et al 2017;Chehoud et al 2015). However, databases of eukaryotic genomes and proteins have widespread contamination from bacterial sequences, and these methods therefore frequently misattribute bacterial reads to eukaryotic species (Lu and Salzberg 2018;Steinegger and Salzberg 2020). Some gene-based taxonomic profilers, such as Metaphlan2, can detect eukaryotic species, but these target a small subset of microbial eukaryotes (122 eukaryotes detectable at the species level as of the mpa_v30 release) (Truong et al 2015).…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation