The recent emergence of deep learning to characterize complex patterns of protein big data reveals its potential to address the classic challenges in the field of protein data mining. Much research has revealed the promise of deep learning as a powerful tool to transform protein big data into valuable knowledge, leading to scientific discoveries and practical solutions. In this review, we summarize recent publications on deep learning predictive approaches in the field of mining protein data. The application architectures of these methods include multilayer perceptrons, stacked autoencoders, deep belief networks, two- or three-dimensional convolutional neural networks, recurrent neural networks, graph neural networks, and complex neural networks and are described from five perspectives: residue-level prediction, sequence-level prediction, three-dimensional structural analysis, interaction prediction, and mass spectrometry data mining. The advantages and deficiencies of these architectures are presented in relation to various tasks in protein data mining. Additionally, some practical issues and their future directions are discussed, such as robust deep learning for protein noisy data, architecture optimization for specific tasks, efficient deep learning for limited protein data, multimodal deep learning for heterogeneous protein data, and interpretable deep learning for protein understanding. This review provides comprehensive perspectives on general deep learning techniques for protein data analysis.
IntroductionThe ocean microbiome represents one of the largest microbiomes and produces nearly half of the primary energy on the planet through photosynthesis or chemosynthesis. Using recent advances in marine genomics, we explore new applications of oceanic metagenomes for protein structure and function prediction.ResultsBy processing 1.3 TB of high-quality reads from the Tara Oceans data, we obtain 97 million non-redundant genes. Of the 5721 Pfam families that lack experimental structures, 2801 have at least one member associated with the oceanic metagenomics dataset. We apply C-QUARK, a deep-learning contact-guided ab initio structure prediction pipeline, to model 27 families, where 20 are predicted to have a reliable fold with estimated template modeling score (TM-score) at least 0.5. Detailed analyses reveal that the abundance of microbial genera in the ocean is highly correlated to the frequency of occurrence in the modeled Pfam families, suggesting the significant role of the Tara Oceans genomes in the contact-map prediction and subsequent ab initio folding simulations. Of interesting note, PF15461, which has a majority of members coming from ocean-related bacteria, is identified as an important photosynthetic protein by structure-based function annotations. The pipeline is extended to a set of 417 Pfam families, built on the combination of Tara with other metagenomics datasets, which results in 235 families with an estimated TM-score over 0.5.ConclusionsThese results demonstrate a new avenue to improve the capacity of protein structure and function modeling through marine metagenomics, especially for difficult proteins with few homologous sequences.
Motivation Accurate delineation of protein domain boundary plays an important role for protein engineering and structure prediction. Although machine-learning methods are widely used to predict domain boundary, these approaches often ignore long-range interactions among residues, which have been proven to improve the prediction performance. However, how to simultaneously model the local and global interactions to further improve domain boundary prediction is still a challenging problem. Results This article employs a hybrid deep learning method that combines convolutional neural network and gate recurrent units’ models for domain boundary prediction. It not only captures the local and non-local interactions, but also fuses these features for prediction. Additionally, we adopt balanced Random Forest for classification to deal with high imbalance of samples and high dimensions of deep features. Experimental results show that our proposed approach (DNN-Dom) outperforms existing machine-learning-based methods for boundary prediction. We expect that DNN-Dom can be useful for assisting protein structure and function prediction. Availability and implementation The method is available as DNN-Dom Server at http://isyslab.info/DNN-Dom/. Supplementary information Supplementary data are available at Bioinformatics online.
We develop a hierarchical pipeline, ThreaDomEx, for both continuous domain (CD) and discontinuous domain (DCD) structure predictions. Starting from a query sequence, ThreaDomEx first threads it through the PDB to identify multiple structure templates, where a profile of domain conservation score (DC-score) is derived for domain-segment assignment. To further detect DCDs that consist of separated segments along the sequence, a boundary-clustering algorithm is used to refine the DCD-linker locations. In case that the templates do not contain DCDs, a domain-segment assembly process, guided by symmetry comparison, is applied for further DCD detections. ThreaDomEx was tested a set of 1111 proteins and achieved a normalized domain overlap score of 89.3% compared to experimental data, which is significantly higher than other state-of-the-art methods. It also recalls 26.7% of DCDs with 72.7% precision on the proteins for which threading failed to detect any DCDs. The server provides facilities for users to interactively refine the domain models by adjusting DC-score threshold, deleting and adding domain linkers, and assembling domain segments, which are particularly helpful for the hard targets for which current methods have a low accuracy while human-expert knowledge and experimental insights can be used for refining models. ThreaDomEX server is available at http://zhanglab.ccmb.med.umich.edu/ThreaDomEx.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.