State of the art in eukaryotic gene prediction

Alioto, Tyler; Guigó, Roderic

doi:10.1007/978-3-211-75123-7_2

Cited by 2 publications

(4 citation statements)

References 79 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Accurate computational methods are needed to classify these transcripts and the corresponding genomic exons as protein coding or non-coding, even if the transcript models are incomplete or if they only reveal novel exons of already-known genes. In addition to classifying novel transcript models, such methods also have applications in evaluating and revising existing gene annotations ( Butler et al , 2009 ; Clamp et al , 2007 ; Kellis et al , 2003 ; Lin et al , 2007 ; Pruitt et al , 2009 ), and as input features for de novo gene structure predictors ( Alioto and Guigó, 2009 ; Brent, 2008 ). We have previously ( Lin et al , 2008 ) compared numerous methods for determining whether an exon-length nucleotide sequence is likely to be protein coding or non-coding, including single-sequence metrics that analyze the genome of interest only and comparative genomics metrics that use alignments of orthologous regions in the genomes of related species.…”

Section: Introductionmentioning

confidence: 99%

PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions

2011

View full text Add to dashboard Cite

Motivation: As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multispecies nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models.Results: We show that PhyloCSF's classification performance in 12-species Drosophila genome alignments exceeds all other methods we compared in a previous study. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE, and as interest grows in long non-coding RNAs, often initially recognized by their lack of protein coding potential rather than conserved RNA secondary structures.Availability and Implementation: The Objective Caml source code and executables for GNU/Linux and Mac OS X are freely available at http://compbio.mit.edu/PhyloCSFContact: mlin@mit.edu; manoli@mit.edu

show abstract

Section: Introductionmentioning

confidence: 99%

PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions

2011

View full text Add to dashboard Cite

show abstract

“…34 Finally, codon and amino acid preference metrics, codon pair preferences, and hidden Markov models could be used (Figure 1B). 34,36 Comparative genomics approaches might be combined with ab initio coding region identification using nucleotide periodicities, codon, and codon pair frequencies to increase the precision of prediction. 37 Experimental methods are needed to support translated sORF predictions made purely by computational sequence analysis.…”

Section: ■ Methods Of Sorf Detectionmentioning

confidence: 99%

“…Another approach to predict the functionality of a putative sORF is to analyze its nucleotide and codon composition (Figure B). , Because a fragment of meaningful written text differs from a random set of letters, a similar difference in compositional statistics applies to functional protein coding sequences. Parameters as simple as nucleotide frequencies are different for coding and noncoding DNA; for example, human coding sequences are generally more GC-rich than noncoding .…”

Section: Methods Of Sorf Detectionmentioning

confidence: 99%

“…Parameters as simple as nucleotide frequencies are different for coding and noncoding DNA; for example, human coding sequences are generally more GC-rich than noncoding . The likelihood of a sequence to be coding may be assessed more precisely using periodicity in nucleotide frequencies because the first, the second, and the third nucleotides of codons have different average composition . Finally, codon and amino acid preference metrics, codon pair preferences, and hidden Markov models could be used (Figure B). , Comparative genomics approaches might be combined with ab initio coding region identification using nucleotide periodicities, codon, and codon pair frequencies to increase the precision of prediction …”

Section: Methods Of Sorf Detectionmentioning

confidence: 99%

See 1 more Smart Citation

Mining for Small Translated ORFs

et al. 2017

View full text Add to dashboard Cite

Peptides encoded by short open reading frames (sORFs) are usually defined as peptides ≤100 aa long. Usually sORFs were ignored by automatic genome annotation programs due to the high probability of false discovery. However, improved computational tools along with a high-throughput RIBO-seq approach identified a myriad of translated sORFs. Their importance becomes evident as we are gaining experimental validation of their diverse cellular functions. This Review examines various computational and experimental approaches of sORFs identification as well as provides the summary of our current knowledge of their functional roles in cells.

show abstract

State of the art in eukaryotic gene prediction

Cited by 2 publications

References 79 publications

PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions

PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions

Mining for Small Translated ORFs

Contact Info

Product

Resources

About