BackgroundThe quality of automated gene prediction in microbial organisms has improved steadily over the past decade, but there is still room for improvement. Increasing the number of correct identifications, both of genes and of the translation initiation sites for each gene, and reducing the overall number of false positives, are all desirable goals.ResultsWith our years of experience in manually curating genomes for the Joint Genome Institute, we developed a new gene prediction algorithm called Prodigal (PROkaryotic DYnamic programming Gene-finding ALgorithm). With Prodigal, we focused specifically on the three goals of improved gene structure prediction, improved translation initiation site recognition, and reduced false positives. We compared the results of Prodigal to existing gene-finding methods to demonstrate that it met each of these objectives.ConclusionWe built a fast, lightweight, open source gene prediction program called Prodigal http://compbio.ornl.gov/prodigal/. Prodigal achieved good results compared to existing methods, and we believe it will be a valuable asset to automated microbial annotation pipelines.
We report the draft genome of the black cottonwood tree,
Populus trichocarpa
. Integration of shotgun sequence assembly with genetic mapping enabled chromosome-scale reconstruction of the genome. More than 45,000 putative protein-coding genes were identified. Analysis of the assembled genome revealed a whole-genome duplication event; about 8000 pairs of duplicated genes from that event survived in the
Populus
genome. A second, older duplication event is indistinguishably coincident with the divergence of the
Populus
and
Arabidopsis
lineages. Nucleotide substitution, tandem gene duplication, and gross chromosomal rearrangement appear to proceed substantially more slowly in
Populus
than in
Arabidopsis. Populus
has more protein-coding genes than
Arabidopsis
, ranging on average from 1.4 to 1.6 putative
Populus
homologs for each
Arabidopsis
gene. However, the relative frequency of protein domains in the two genomes is similar. Overrepresented exceptions in
Populus
include genes associated with lignocellulosic wall biosynthesis, meristem development, disease resistance, and metabolite transport.
MicroRNAs are important regulators of gene expression, acting primarily by binding to sequence-specific locations on already transcribed messenger RNAs (mRNA) and typically down-regulating their stability or translation. Recent studies indicate that microRNAs may also play a role in up-regulating mRNA transcription levels, although a definitive mechanism has not been established. Double-helical DNA is capable of forming triple-helical structures through Hoogsteen and reverse Hoogsteen interactions in the major groove of the duplex, and we show physical evidence (i.e., NMR, FRET, SPR) that purine or pyrimidine-rich microRNAs of appropriate length and sequence form triple-helical structures with purine-rich sequences of duplex DNA, and identify microRNA sequences that favor triplex formation. We developed an algorithm (Trident) to search genome-wide for potential triplex-forming sites and show that several mammalian and non-mammalian genomes are enriched for strong microRNA triplex binding sites. We show that those genes containing sequences favoring microRNA triplex formation are markedly enriched (3.3 fold, p<2.2 × 10−16) for genes whose expression is positively correlated with expression of microRNAs targeting triplex binding sequences. This work has thus revealed a new mechanism by which microRNAs could interact with gene promoter regions to modify gene transcription.
In the Fourth Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP4), we predicted all 43 targets using our threading application PROSPECT. PROSPECT guarantees to find an optimal alignment between a protein sequence and a structural fold for a general energy function with pairwise contact potential. For each prediction, it gives a reliability assessment based on a neural network approach. In addition, PROSPECT has been added to the Genomic Integrated Supercomputing Toolkit (GIST) and is deployed on terascale computing resources. Structural predictions in CASP4 included three categories, that is comparative modeling, fold recognition, and prediction for structures with new folds. In the fold recognition category, PROSPECT correctly identified 8 of a total of 22 and finished the sixth in the total scores among 127 assessed groups. In the "new fold" category, it found important structural features for most targets, and its overall performance is among the best of all prediction methods. Our CASP4 performance demonstrates that PROSPECT is a powerful tool to quickly characterize structures with new folds, and it may provide useful structural restraints for ab initio prediction methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.