2013
DOI: 10.1093/database/bat053
|View full text |Cite
|
Sign up to set email alerts
|

MisPred: a resource for identification of erroneous protein sequences in public databases

Abstract: Correct prediction of the structure of protein-coding genes of higher eukaryotes is still a difficult task; therefore, public databases are heavily contaminated with mispredicted sequences. The high rate of misprediction has serious consequences because it significantly affects the conclusions that may be drawn from genome-scale sequence analyses of eukaryotic genomes. Here we present the MisPred database and computational pipeline that provide efficient means for the identification of erroneous sequences in p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
23
0

Year Published

2014
2014
2023
2023

Publication Types

Select...
6
2

Relationship

2
6

Authors

Journals

citations
Cited by 18 publications
(24 citation statements)
references
References 23 publications
1
23
0
Order By: Relevance
“…The problem of gene prediction errors appears to be even more severe in the case of lancelet genomes: our analysis of the predicted proteomes of various metazoa with MisPred tools have shown that the rate of misprediction of lancelet genes is significantly higher than in the case of vertebrate genomes 22 23 . Accordingly, in the case of the genomes of B. floridae and B. belcheri a high proportion of gene models is mispredicted.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…The problem of gene prediction errors appears to be even more severe in the case of lancelet genomes: our analysis of the predicted proteomes of various metazoa with MisPred tools have shown that the rate of misprediction of lancelet genes is significantly higher than in the case of vertebrate genomes 22 23 . Accordingly, in the case of the genomes of B. floridae and B. belcheri a high proportion of gene models is mispredicted.…”
Section: Resultsmentioning
confidence: 99%
“…There are several reasons why the presence of a Peptidase_M2 domain in XP_002602990.1 was likely to reflect gene prediction error and not innovation. First, Peptidase_M2 domains are usually present in extracellular proteins, whereas Mito_carr domains are restricted to the intracellular space; their co-occurrence violates one of the basic dogmas that MisPred uses to detect gene prediction errors 22 23 . Second, the average length of Peptidase_M2 domains is 470 amino acid residues with little variation 32 , but it is only ~230 residue in the B. floridae protein.…”
Section: Resultsmentioning
confidence: 99%
“…Unfortunately, the quality of the sequences is not always high, partly due to limitations in sequencing technologies. Moreover, at the amino acid sequence level, a number of errors can be introduced due to difficulty in gene prediction ( Brent, 2005 ; Gotoh et al , 2014 ; Nagy and Patthy, 2013 ; Yandell and Ence, 2012 ). With incorrect reading frames, unrelated amino acid segments can appear in a set of homologous sequences.…”
Section: Introductionmentioning
confidence: 99%
“…Have you ever wondered how many high quality eukaryotic genome sequences have been produced so far? There are legions of partially finished genomes [ 61 ] but the good ones will fit on the fingers of one hand (see also [ 62 ]). The way science is set up currently, once the grant has finished, the genome (in whatever state) gets published, usually in a flagship journal, and that is the end of it.…”
Section: Multiple Alignments and The Choppy State Of Public Sequence mentioning
confidence: 99%