The discovery of biomolecular condensates transformed our understanding of intracellular compartmentalization of molecules. To integrate interdisciplinary scientific knowledge about the function and composition of biomolecular condensates, we developed the crowdsourcing condensate database and encyclopedia (cd-code.org). CD-CODE is a community-editable platform, which includes a database of biomolecular condensates based on the literature, an encyclopedia of relevant scientific terms and a crowdsourcing web application. Our platform will accelerate the discovery and validation of biomolecular condensates, and facilitate efforts to understand their role in disease and as therapeutic targets.
Intrinsically disordered regions (IDRs) are structurally flexible protein segments with regulatory functions in multiple contexts, such as in the assembly of biomolecular condensates. Since IDRs undergo more rapid evolution than ordered regions, identifying homology of such poorly conserved regions remains challenging for state-of-the-art alignment-based methods that rely on position-specific conservation of residues. Thus, systematic functional annotation and evolutionary analysis of IDRs have been limited, despite comprising ~21% of proteins. To accurately assess homology between unalignable sequences, we developed an alignment-free sequence comparison algorithm, SHARK (Similarity/Homology Assessment by Relating K-mers). We trained SHARK-dive, a machine learning homology classifier, which achieved superior performance to standard alignment in assessing homology in unalignable sequences, and correctly identified dissimilar IDRs capable of functional rescue in IDR-replacement experiments reported in the literature. SHARK-dive not only predicts functionally similar IDRs, but also identifies cryptic sequence properties and motifs that drive remote homology, thereby facilitating systematic analysis and functional annotation of the unalignable protein universe.
Biomolecular condensates are membrane-less organelles that selectively concentrate biomolecules to perform a multitude of fundamental biochemical functions. Despite their biological and therapeutical interest, experimental identification of condensate-forming proteins on the entire proteome scale remains challenging. To enable systematic detection of condensate proteins, we developed an algorithm to recognize proteins involved inin vivobiomolecular condensates regardless of the mechanism of condensate formation. Using a curated dataset of condensates, we trained a machine learning classifier based on sequence- and structure-based features to predict if a protein is part of a condensate. Our model, PICNIC (Proteins Involved in CoNdensates In Cells) outperforms other prediction tools, and although it was trained on human data, it generalizes well to various organisms. Experimental validation of 24 proteins spanning a wide-range of functions, structural content and disease relevance confirmed that 18 of them localize to condensates with high confidence, while 3 form condensates with low confidence. Thus, our experimental validation suggests an ∼87.5% success rate (75% with high confidence and 12.5% with low confidence) in identifying condensate-forming proteins. Proteome-wide predictions by PICNIC estimate that ∼40% of proteins partition into condensates across different organisms, from bacteria to humans, with no apparent correlation with organismal complexity or disordered protein content. Our model will shed light on the evolution of biomolecular condensates and will help identify potential protein targets to modulate biomolecular condensate formation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.