The Database of Protein Disorder (DisProt, URL: https://disprot.org) provides manually curated annotations of intrinsically disordered proteins from the literature. Here we report recent developments with DisProt (version 8), including the doubling of protein entries, a new disorder ontology, improvements of the annotation format and a completely new website. The website includes a redesigned graphical interface, a better search engine, a clearer API for programmatic access and a new annotation interface that integrates text mining technologies. The new entry format provides a greater flexibility, simplifies maintenance and allows the capture of more information from the literature. The new disorder ontology has been formalized and made interoperable by adopting the OWL format, as well as its structure and term definitions have been improved. The new annotation interface has made the curation process faster and more effective. We recently showed that new DisProt annotations can be effectively used to train and validate disorder predictors. We believe the growth of DisProt will accelerate, contributing to the improvement of function and disorder predictors and therefore to illuminate the ‘dark’ proteome.
There are multiple definitions for low complexity regions (LCRs) in protein sequences, with all of them broadly considering LCRs as regions with fewer amino acid types compared to an average composition. Following this view, LCRs can also be defined as regions showing composition bias. In this critical review, we focus on the definition of sequence complexity of LCRs and their connection with structure. We present statistics and methodological approaches that measure low complexity (LC) and related sequence properties. Composition bias is often associated with LC and disorder, but repeats, while compositionally biased, might also induce ordered structures. We illustrate this dichotomy, and more generally the overlaps between different properties related to LCRs, using examples. We argue that statistical measures alone cannot capture all structural aspects of LCRs and recommend the combined usage of a variety of predictive tools and measurements. While the methodologies available to study LCRs are already very advanced, we foresee that a more comprehensive annotation of sequences in the databases will enable the improvement of predictions and a better understanding of the evolution and the connection between structure and function of LCRs. This will require the use of standards for the generation and exchange of data describing all aspects of LCRs. Short abstract There are multiple definitions for low complexity regions (LCRs) in protein sequences. In this critical review, we focus on the definition of sequence complexity of LCRs and their connection with structure. We present statistics and methodological approaches that measure low complexity (LC) and related sequence properties. Composition bias is often associated with LC and disorder, but repeats, while compositionally biased, might also induce ordered structures. We illustrate this dichotomy, plus overlaps between different properties related to LCRs, using examples.
To assess the role of core metabolism genes in bacterial virulence - independently of their effect on growth - we correlated the genome, the transcriptome and the pathogenicity in flies and mice of 30 fully sequenced Pseudomonas strains. Gene presence correlates robustly with pathogenicity differences among all Pseudomonas species, but not among the P. aeruginosa strains. However, gene expression differences are evident between highly and lowly pathogenic P. aeruginosa strains in multiple virulence factors and a few metabolism genes. Moreover, 16.5%, a noticeable fraction of the core metabolism genes of P. aeruginosa strain PA14 (compared to 8.5% of the non-metabolic genes tested), appear necessary for full virulence when mutated. Most of these virulence-defective core metabolism mutants are compromised in at least one key virulence mechanism independently of auxotrophy. A pathway level analysis of PA14 core metabolism, uncovers beta-oxidation and the biosynthesis of amino-acids, succinate, citramalate, and chorismate to be important for full virulence. Strikingly, the relative expression among P. aeruginosa strains of genes belonging in these metabolic pathways is indicative of their pathogenicity. Thus, P. aeruginosa strain-to-strain virulence variation, remains largely obscure at the genome level, but can be dissected at the pathway level via functional transcriptomics of core metabolism.
Haemoglobinopathies are common monogenic disorders with diverse clinical manifestations, partly attributed to the influence of modifier genes. Recent years have seen enormous growth in the amount of genetic data, instigating the need for ranking methods to identify candidate genes with strong modifying effects. Here, we present the first evidence-based gene ranking metric (IthaScore) for haemoglobinopathy-specific phenotypes by utilising curated data in the IthaGenes database. IthaScore successfully reflects current knowledge for well-established disease modifiers, while it can be dynamically updated with emerging evidence. Protein–protein interaction (PPI) network analysis and functional enrichment analysis were employed to identify new potential disease modifiers and to evaluate the biological profiles of selected phenotypes. The most relevant gene ontology (GO) and pathway gene annotations for (a) haemoglobin (Hb) F levels/Hb F response to hydroxyurea included urea cycle, arginine metabolism and vascular endothelial growth factor receptor (VEGFR) signalling, (b) response to iron chelators included xenobiotic metabolism and glucuronidation, and (c) stroke included cytokine signalling and inflammatory reactions. Our findings demonstrate the capacity of IthaGenes, together with dynamic gene ranking, to expand knowledge on the genetic and molecular basis of phenotypic variation in haemoglobinopathies and to identify additional candidate genes to potentially inform and improve diagnosis, prognosis and therapeutic management.
Haemoglobinopathies are the commonest monogenic diseases worldwide and are caused by variants in the globin gene clusters. With over 2400 variants detected to date, their interpretation using the ACMG/AMP guidelines is challenging and computational evidence can provide valuable input about their functional annotation. While many in silico predictors have already been developed, their performance varies for different genes and diseases. In this study, we evaluate 31 in silico predictors using a dataset of 1627 variants in HBA1, HBA2, and HBB. By varying the decision threshold for each tool, we analyse their performance (a) as binary classifiers of pathogenicity, and (b) by using different non-overlapping pathogenic and benign thresholds for their optimal use in the ACMG/AMP framework. Our results show that CADD, Eigen-PC, and REVEL are the overall top performers, with the former reaching moderate strength level for pathogenic prediction. Eigen-PC and REVEL achieve the highest accuracies for missense variants, while CADD is also a reliable predictor of non-missense variants. Moreover, SpliceAI is the top performing splicing predictor, reaching strong level of evidence, while GERP++ and phyloP are the most accurate conservation tools. This study provides evidence about the optimal use of computational tools in globin gene clusters under the ACMG/AMP framework.
Several types of haemoglobinopathies are caused by copy number variants (CNVs). While diagnosis is often based on haematological and biochemical parameters, a definitive diagnosis requires molecular DNA analysis. In some cases, the molecular characterisation of large deletions/duplications is challenging and inconclusive and often requires the use of specific diagnostic procedures, such as multiplex ligation-dependent probe amplification (MLPA). Herein, we collected and comprehensively analysed all known CNVs associated with haemoglobinopathies. The dataset of 291 CNVs was retrieved from the IthaGenes database and was further manually annotated to specify genomic locations, breakpoints and MLPA probes relevant for each CNV. We developed IthaCNVs, a publicly available and easy-to-use online tool that can facilitate the diagnosis of rare and diagnostically challenging haemoglobinopathy cases attributed to CNVs. Importantly, it facilitates the filtering of available entries based on the type of breakpoint information, on specific chromosomal and locus positions, on MLPA probes, and on affected gene(s). IthaCNVs brings together manually curated information about CNV genomic locations, functional effects, and information that can facilitate CNV characterisation through MLPA. It can help laboratory staff and clinicians confirm suspected diagnosis of CNVs based on molecular DNA screening and analysis.
13Bacterial virulence may rely on secondary metabolism, but core metabolism genes are assumed to be 14 necessary primarily for bacterial growth. To assess this assumption, we correlated the genome, the 15 transcriptome and the pathogenicity of 30 fully sequenced Pseudomonas strains using two Drosophila 16 and one mouse infection assay. In accordance with previous studies gene presence-absence does not 17 explain differences in virulence among P. aeruginosa strains, but merely between P. aeruginosa and 18 other Pseudomonas species. Similarly, classical gene expression analysis of highly vs. lowly pathogenic P. 19 aeruginosa strains identifies many virulence factors, and only a few metabolism genes related to 20 virulence. Nevertheless, assessing the virulence of 553 core metabolic and 95 random non-metabolic 21 gene mutants of P. aeruginosa PA14, we found 16.5% of the core metabolic and 8.5% of the non-22 metabolic genes to be necessary for full virulence. Strikingly, 11.8% of the core metabolism genes 23 exhibit defects in virulence that cannot be attributed to auxotrophy. The compromised in virulence 24 metabolic gene mutants were mapped in multiple pathways and exhibited further defects in acute 25 virulence phenotypes and in a mouse lung infection model. Functional transcriptomics re-analysis of 26 core metabolism at the pathway level, reveals amino-acid, succinate, citramalate, and chorismate 27 biosynthesis and beta-oxidation as important for full virulence and expression of these pathways 28 indicative of virulence in various strains. Thus, P. aeruginosa virulence variation, which to this point 29 remains unpredictably combinatorial at the gene level, can be dissected at the pathway level via 30 combinatorial trancriptome and functional core metabolism analysis. 31 53 pathogens to a host-associated lifestyle 6, 16, 17 . 54 Besides the identification of many virulence factors and some of their immediate regulators, P.55 aeruginosa pathogenicity appears not to follow the same paths for each strain. P. aeruginosa strains can 56 evolve, nevertheless, in the lung of cystic fibrosis patients, and some strains, such as CF5, are only 57 weakly pathogenic in models of acute infection 8,14 . Despite extensive efforts to link gene content with 58 pathogenicity of P. aeruginosa, virulence cannot be explained by the presence or absence of single 59 virulence genes 3, 8 . Accordingly, P. aeruginosa pathogenicity has been characterized as context 60 dependent, that is, genes required for pathogenicity in one strain may not necessarily contribute to 61 virulence in other strains. We hypothesized that, if not gene presence, then gene expression of virulence 62 factors and their regulators should be able to explain differences in pathogenicity among Pseudomonas 63 strains.3 64In the current study, we compared 30 fully sequenced Pseudomonas genomes using large-scale 65 comparisons at the protein sequence level. Primarily we sought to identify differences in the presence 66 and absence of genes that may explain t...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.