BackgroundA major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging.ResultsWe conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2.ConclusionsThe top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent.Electronic supplementary materialThe online version of this article (doi:10.1186/s13059-016-1037-6) contains supplementary material, which is available to authorized users.
Single nucleotide polymorphisms (SNPs) are the simplest and most frequent form of human DNA variation, also valuable as genetic markers of disease susceptibility. The most investigated SNPs are missense mutations resulting in residue substitutions in the protein. Here we propose SNPs&GO, an accurate method that, starting from a protein sequence, can predict whether a mutation is disease related or not by exploiting the protein functional annotation. The scoring efficiency of SNPs&GO is as high as 82%, with a Matthews correlation coefficient equal to 0.63 over a wide set of annotated nonsynonymous mutations in proteins, including 16,330 disease-related and 17,432 neutral polymorphisms. SNPs&GO collects in unique framework information derived from protein sequence, evolutionary information, and function as encoded in the Gene Ontology terms, and outperforms other available predictive methods.
Background: Several eukaryotic proteins associated to the extracellular leaflet of the plasma membrane carry a Glycosylphosphatidylinositol (GPI) anchor, which is linked to the C-terminal residue after a proteolytic cleavage occurring at the so called ω-site. Computational methods were developed to discriminate proteins that undergo this post-translational modification starting from their aminoacidic sequences. However more accurate methods are needed for a reliable annotation of whole proteomes.
Summary Consensus linkage maps are important tools in crop genomics. We have assembled a high‐density tetraploid wheat consensus map by integrating 13 data sets from independent biparental populations involving durum wheat cultivars (Triticum turgidum ssp. durum), cultivated emmer (T. turgidum ssp. dicoccum) and their ancestor (wild emmer, T. turgidum ssp. dicoccoides). The consensus map harboured 30 144 markers (including 26 626 SNPs and 791 SSRs) half of which were present in at least two component maps. The final map spanned 2631 cM of all 14 durum wheat chromosomes and, differently from the individual component maps, all markers fell within the 14 linkage groups. Marker density per genetic distance unit peaked at centromeric regions, likely due to a combination of low recombination rate in the centromeric regions and even gene distribution along the chromosomes. Comparisons with bread wheat indicated fewer regions with recombination suppression, making this consensus map valuable for mapping in the A and B genomes of both durum and bread wheat. Sequence similarity analysis allowed us to relate mapped gene‐derived SNPs to chromosome‐specific transcripts. Dense patterns of homeologous relationships have been established between the A‐ and B‐genome maps and between nonsyntenic homeologous chromosome regions as well, the latter tracing to ancient translocation events. The gene‐based homeologous relationships are valuable to infer the map location of homeologs of target loci/QTLs. Because most SNP and SSR markers were previously mapped in bread wheat, this consensus map will facilitate a more effective integration and exploitation of genes and QTL for wheat breeding purposes.
Alternative premessenger RNA splicing enables genes to generate more than one gene product. Splicing events that occur within protein coding regions have the potential to alter the biological function of the expressed protein and even to create new protein functions. Alternative splicing has been suggested as one explanation for the discrepancy between the number of human genes and functional complexity. Here, we carry out a detailed study of the alternatively spliced gene products annotated in the ENCODE pilot project. We find that alternative splicing in human genes is more frequent than has commonly been suggested, and we demonstrate that many of the potential alternative gene products will have markedly different structure and function from their constitutively spliced counterparts. For the vast majority of these alternative isoforms, little evidence exists to suggest they have a role as functional proteins, and it seems unlikely that the spectrum of conventional enzymatic or structural functions can be substantially extended through alternative splicing.function ͉ human ͉ isoforms ͉ splice ͉ structure A lternative mRNA splicing, the generation of a diverse range of mature RNAs, has considerable potential to expand the cellular protein repertoire (1-3), and recent studies have estimated that 40-80% of multiexon human genes can produce differently spliced mRNAs (4, 5). The importance of alternative splicing in processes such as development (6) has long been recognized, and proteins coded by alternatively spliced transcripts have been implicated in a number of cellular pathways (7-9). The extent of alternative splicing in eukaryotic genomes has lead to suggestions that alternative splicing is key to understanding how human complexity can be encoded by so few genes (10).The pilot project of the Encyclopedia of DNA Elements (ENCODE) (11), which aims to identify all the functional elements in the human genome, has undertaken a comprehensive analysis of 44 selected regions that make up 1% of the human genome. One valuable element of the project has been the detailing of a reference set of manually annotated splice variants by the GENCODE consortium (12). The annotation by the GENCODE consortium is an extension of the manually curated annotation by the Havana team at The Sanger Institute.Although a full understanding of the functional implications of alternative splicing is still a long way off, the GENCODE set has provided us with the material to make an in-depth assessment of a systematically collected reference set of splice variants. ResultsAlternative Splicing Frequency. The GENCODE set is made up of 2,608 annotated transcripts for 487 distinct loci. A total of 1,097 transcripts from 434 loci are predicted to be protein coding. There are on average 2.53 protein coding variants per locus; 182 loci have only one variant, whereas one locus, RP1-309K20.2 (CPNE1) has 17 coding variants.A total of 57.8% of the loci are annotated with alternatively spliced transcripts, although there are differences between target re...
Here, we present BUSCA (http://busca.biocomp.unibo.it), a novel web server that integrates different computational tools for predicting protein subcellular localization. BUSCA combines methods for identifying signal and transit peptides (DeepSig and TPpred3), GPI-anchors (PredGPI) and transmembrane domains (ENSEMBLE3.0 and BetAware) with tools for discriminating subcellular localization of both globular and membrane proteins (BaCelLo, MemLoci and SChloro). Outcomes from the different tools are processed and integrated for annotating subcellular localization of both eukaryotic and bacterial protein sequences. We benchmark BUSCA against protein targets derived from recent CAFA experiments and other specific data sets, reporting performance at the state-of-the-art. BUSCA scores better than all other evaluated methods on 2732 targets from CAFA2, with a F1 value equal to 0.49 and among the best methods when predicting targets from CAFA3. We propose BUSCA as an integrated and accurate resource for the annotation of protein subcellular localization.
We carried out a cross species cattle-sheep array comparative genome hybridization experiment to identify copy number variations (CNVs) in the sheep genome analysing ewes of Italian dairy or dual-purpose breeds (Bagnolese, Comisana, Laticauda, Massese, Sarda, and Valle del Belice) using a tiling oligonucleotide array with ~385,000 probes designed on the bovine genome. We identified 135 CNV regions (CNVRs; 24 reported in more than one animal) covering ~10.5 Mb of the virtual sheep genome referred to the bovine genome (0.398%) with a mean and a median equal to 77.6 and 55.9 kb, respectively. A comparative analysis between the identified sheep CNVRs and those reported in cattle and goat genomes indicated that overlaps between sheep and both other species CNVRs are highly significant (P<0.0001), suggesting that several chromosome regions might contain recurrent interspecies CNVRs. Many sheep CNVRs include genes with important biological functions. Further studies are needed to evaluate their functional relevance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.