Motivation: Millions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the proteins is still a time consuming, low-throughput, and expensive process, leading to a large protein sequence-function gap. Therefore, it is important to develop computational methods to accurately predict protein function to fill the gap. Even though many methods have been developed to use protein sequences as input to predict function, much fewer methods leverage protein structures in protein function prediction because there was lack of accurate protein structures for most proteins until recently. Results: We developed TransFun - a method using a transformer-based protein language model and 3D-equivariant graph neural networks to distill information from both protein sequences and structures to predict protein function. It extracts feature embeddings from protein sequences using a pre-trained protein language model (ESM) via transfer learning and combines them with 3D structures of proteins predicted by AlphaFold2 through equivariant graph neural networks. Benchmarked on the CAFA3 test dataset and a new test dataset, TransFun outperforms several state-of-the-art methods, indicating the language model and 3D-equivariant graph neural networks are effective methods to leverage protein sequences and structures to improve protein function prediction. Combining TransFun predictions and sequence similarity-based predictions can further increase prediction accuracy. Availability: The source code of TransFun is available at https://github.com/jianlin-cheng/TransFun .
Motivation Millions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the proteins is still a time consuming, low-throughput, and expensive process, leading to a large protein sequence-function gap. Therefore, it is important to develop computational methods to accurately predict protein function to fill the gap. Even though many methods have been developed to use protein sequences as input to predict function, much fewer methods leverage protein structures in protein function prediction because there was lack of accurate protein structures for most proteins until recently. Results We developed TransFun—a method using a transformer-based protein language model and 3D-equivariant graph neural networks to distill information from both protein sequences and structures to predict protein function. It extracts feature embeddings from protein sequences using a pre-trained protein language model (ESM) via transfer learning and combines them with 3D structures of proteins predicted by AlphaFold2 through equivariant graph neural networks. Benchmarked on the CAFA3 test dataset and a new test dataset, TransFun outperforms several state-of-the-art methods, indicating that the language model and 3D-equivariant graph neural networks are effective methods to leverage protein sequences and structures to improve protein function prediction. Combining TransFun predictions and sequence similarity-based predictions can further increase prediction accuracy. Availability and implementation The source code of TransFun is available at https://github.com/jianlin-cheng/TransFun.
As an aneuploidy, trisomy is associated with mammalian embryonic and postnatal abnormalities. Understanding the underlying mechanisms involved in mutant phenotypes is broadly important and may lead to new strategies to treat clinical manifestations in individuals with trisomies, such as trisomy 21 (Down syndrome). While increased gene dosage effects due to a trisomy may account for the mutant phenotypes, there is also the possibility that phenotypic consequences of a trisomy can arise because of the presence of a freely segregating extra chromosome with its own centromere, i.e. a ‘free trisomy’ independent of gene dosage effects. Presently, there are no reports of attempts to functionally separate these two types of effects in mammals. To fill this gap, here we describe a strategy that employed two new mouse models of Down syndrome, Ts65Dn;Df(17)2Yey/+ and Dp(16)1Yey/Df(16)8Yey. Both models carry triplications of the same 103 human chromosome 21 gene orthologs; however, only Ts65Dn;Df(17)2Yey/+ mice carry a free trisomy. Comparison of these models revealed the gene dosage-independent impacts of an extra chromosome at the phenotypic and molecular levels for the first time. They are reflected by impairments of Ts65Dn;Df(17)2Yey/+ males in T-maze tests when compared with Dp(16)1Yey/Df(16)8Yey males. Results from the transcriptomic analysis suggest the extra chromosome plays a major role in trisomy-associated expression alterations of disomic genes beyond gene dosage effects. This model system can now be used to deepen our mechanistic understanding of this common human aneuploidy and obtain new insights into the effects of free trisomies in other human diseases such as cancers.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.