2020
DOI: 10.1186/s12859-020-03855-1
|View full text |Cite
|
Sign up to set email alerts
|

Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes

Abstract: Background Recent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon–intron structures. Even the best eukaryotic gene prediction algorithms can make serious errors that will significantly affect subsequent analyses. Results We first investigated the prevalence of gene prediction errors in a lar… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
19
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 22 publications
(20 citation statements)
references
References 41 publications
(40 reference statements)
1
19
0
Order By: Relevance
“…The accurate classification of these features provides the basis for questions focused on species evolution, population dynamics, and functional genomics. Errors in genome annotation are frequent, even among well-studied models, and are propagated through downstream analyses (Deutekom et al, 2019; Meyer et al, 2020; Salzberg, 2019). In most eukaryotes, genome annotation is challenged by partial conservation of sequence patterns, variable lengths of introns, variable distances between genes, alternative splicing, and higher densities of TEs and pseudogenes (Kersey, 2019; Salzberg, 2019).…”
Section: Introductionmentioning
confidence: 99%
“…The accurate classification of these features provides the basis for questions focused on species evolution, population dynamics, and functional genomics. Errors in genome annotation are frequent, even among well-studied models, and are propagated through downstream analyses (Deutekom et al, 2019; Meyer et al, 2020; Salzberg, 2019). In most eukaryotes, genome annotation is challenged by partial conservation of sequence patterns, variable lengths of introns, variable distances between genes, alternative splicing, and higher densities of TEs and pseudogenes (Kersey, 2019; Salzberg, 2019).…”
Section: Introductionmentioning
confidence: 99%
“…Sequence homology lies at the heart of numerous protein and transcript predictions. However, there is still room for improvement in the underlying comparative genomics and spliced alignment methods [ 22 ]. The latter work shows recurrent challenges in accurately identifying intron-exon boundaries, and in handling non canonical GT and AG splice sites.…”
Section: Discussionmentioning
confidence: 99%
“…One can thus make the case that homology-based methods are only as good as the databases they rely upon ( Dimonaco et al., 2021 ). Unfortunately, misannotation has not only become commonplace in public databases, as it is also an ongoing problem ( Arkhipova, 2020 , Girardi, Thoden, Holden, 2020 , Impey, Lee, Hawkins, Sutton, Panjikar, Perugini, Soares da Costa, 2020 , Meyer, Scalzitti, Jeannin-Girardon, Collet, Poch, Thompson, 2020 , Nobre, Campos, Lucic-Mercy, Arnholdt-Schmitt, 2016 , Rembeza, Engqvist, 2021 , Schnoes, Brown, Dodevski, Babbitt, 2009 ). To make matters worse, misannotations are seldomly confined to the database where the error first occurred, they often percolate throughout several databases as well ( Promponas et al., 2015 ).…”
Section: Progress and Pitfallsmentioning
confidence: 99%
“…Previous reports showed that the percentage of incorrect functional assignments in public databases soared from less than 5% in 1998 to as high as 40% in 2005 ( Schnoes et al., 2009 ). More recently, it was reported that up to 50% of protein sequences from public databases contain at least one error ( Meyer et al., 2020 ). As a case in point, some authors express particular concern towards error propagation issuing from draft genome assemblies (e.g., MAGs) ( Arkhipova, 2020 , Koonin, Makarova, Wolf, 2021 , Salzberg, 2019 ).…”
Section: Progress and Pitfallsmentioning
confidence: 99%