2023
DOI: 10.1002/aps3.11533
|View full text |Cite|
|
Sign up to set email alerts
|

Welcome to the big leaves: Best practices for improving genome annotation in non‐model plant genomes

Abstract: PremiseRobust standards to evaluate quality and completeness are lacking in eukaryotic structural genome annotation, as genome annotation software is developed using model organisms and typically lacks benchmarking to comprehensively evaluate the quality and accuracy of the final predictions. The annotation of plant genomes is particularly challenging due to their large sizes, abundant transposable elements, and variable ploidies. This study investigates the impact of genome quality, complexity, sequence read … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
12
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
6
4

Relationship

1
9

Authors

Journals

citations
Cited by 15 publications
(14 citation statements)
references
References 72 publications
2
12
0
Order By: Relevance
“…In order to estimate the quality of annotation, we performed several tests, in particular, calculated BUSCO metrics for the set of predicted proteins (for all proteins together and in a subgenome-wise manner and performed comparison with A. thaliana in terms of such parameters as CDS length, the number of exons, and the percent of single-exon genes. This approach follows recently published recommendations for the improvement and quality control of plant genome annotations [ 21 ]. The BUSCO completeness score for all proteins together was in the range of 97.7–94.1%, for subgenomes—from 90.4 to 96.8% (Additional file 2 : Table S1).…”
Section: Resultsmentioning
confidence: 99%
“…In order to estimate the quality of annotation, we performed several tests, in particular, calculated BUSCO metrics for the set of predicted proteins (for all proteins together and in a subgenome-wise manner and performed comparison with A. thaliana in terms of such parameters as CDS length, the number of exons, and the percent of single-exon genes. This approach follows recently published recommendations for the improvement and quality control of plant genome annotations [ 21 ]. The BUSCO completeness score for all proteins together was in the range of 97.7–94.1%, for subgenomes—from 90.4 to 96.8% (Additional file 2 : Table S1).…”
Section: Resultsmentioning
confidence: 99%
“…Similar approaches have been discussed and applied before to reduce large numbers of predicted gene models based on transcription evidence to the most important ones (Liang et al ., 2009; Dohm et al ., 2014; McGrath et al ., 2022). Given that structural annotation workflows do not provide perfectly accurate results, evaluation and filtering is recommended (Vuruputoor et al ., 2023). Although RNA-seq samples from a range of different plant organs and treatments were included in this filtering based on transcript evidence, it is possible that transcripts of bona fide genes were not detected in the RNA-seq data leading to a mis-classification of the corresponding genes as non-expressed pseudogene.…”
Section: Discussionmentioning
confidence: 99%
“…From this, StringTie generated 76,471 transcripts while PsiCLASS produced 116,777. Before filtering, the consensus of transcripts totaled 334 K with 197 K genes, containing many false positives as indicated by the high mono to multi-exonic rate (0.69) (Vuruputoor et al 2023) and low sequence similarity rate within EnTAP (0.65). False positives were well resolved by filtering, with only 69,563 final transcripts and 41,039 genes, with a more typical mono/multi-exonic rate of 0.24 and a high 0.92 sequence similarity rate reported by EnTAP (Table 3).…”
Section: Gene Annotationsmentioning
confidence: 99%