Given the close relationship between protein structure and function, protein structure searches have long played an established role in bioinformatics. Despite their maturity, existing protein structure searches either use simplifying assumptions or compromise between fast response times and quality of results. These limitations can prevent the easy and efficient exploration of relationships between protein structures, which is the norm in other areas of inquiry. To address these limitations we have developed RUPEE, a fast and accurate purely geometric structure search combining techniques from information retrieval and big data with a novel approach to encoding sequences of torsion angles. Comparing our results to the output of mTM, SSM, and the CATHEDRAL structural scan, it is clear that RUPEE has set a new bar for purely geometric big data approaches to protein structure searches. RUPEE in top-aligned mode produces equal or better results than the best available protein structure searches, and RUPEE in fast mode demonstrates the fastest response times coupled with high quality results. The RUPEE protein structure search is available at https://ayoubresearch.com. Code and data are available at https://github.com/rayoub/rupee.
Given the close relationship between protein structure and function, protein structure searches have long played an established role in bioinformatics. Despite their maturity, existing protein structure searches either use simplifying assumptions or compromise between fast response times and quality of results. These limitations can prevent the easy and efficient exploration of relationships between protein structures, which is the norm in other areas of inquiry. We have developed RUPEE, a fast, scalable, and purely geometric structure search combining techniques from information retrieval and big data with a novel approach to encoding sequences of torsion angles.Comparing our results to the output of mTM, SSM, and the CATHEDRAL structural scan, it is clear that RUPEE has set a new bar for purely geometric big data approaches to protein structure searches. RUPEE in top-aligned mode produces equal or better results than the best available protein structure searches, and RUPEE in fast mode demonstrates the fastest response times coupled with high quality results.The RUPEE protein structure search is available at http://www.ayoubresearch.com. Code and data are available at https://github.com/rayoub/rupee.Proteins represent the functional end-product within the central dogma of molecular 2 biology [1]. As such, understanding protein structure is a central goal within structural 3 bioinformatics. Protein structure determination, prediction, alignment, and search all 4 serve to advance this understanding. Below, we present our approach to a fast, scalable, 5 and purely geometric protein structure search we refer to with the acronym of RUn 6 Position Encoded Encodings of residue descriptors (RUPEE). 7 Given a protein domain identifier, whole chain identifier or an uploaded PDB file, 8 RUPEE can search for matches among domains defined in SCOPe 2.07 [2], CATH 9 v4.2 [3], ECOD develop210 [4], or among whole chains defined in the PDB. RUPEE is 10 able to search either of these databases using any identifier. For instance, you can 11 search SCOPe using a CATH domain identifier. 12RUPEE has two modes of operation, fast and top-aligned. Fast mode is significantly 13 faster than all other protein structure searches discussed below but at the expensive of 14 accuracy. Despite this, we will show that the accuracy of RUPEE in fast mode is not far 15 below that of the best available structure searches. On the other hand, the accuracy 16 and response times of RUPEE in top-aligned mode are comparable to currently 17 available protein structure searches that are commonly considered fast. 18 November 16, 2018 1/13 63 the most popular. Nonetheless, these searches are slow in comparison to mTM, SSM, 64 and CATHEDRAL when pre-calculated results are not used. If given a known protein 65 domain, VAST can return structural neighbors in seconds using pre-calculated results. 66 However, if uploading a PDB file where pre-calculated results are not used, response 67 times for VAST can exceed 30 minutes. Similarly, the FATCAT server, that does n...
Protein structure prediction is a long-standing unsolved problem in molecular biology that has seen renewed interest with the recent success of deep learning with AlphaFold at CASP13. While developing and evaluating protein structure prediction methods, researchers may want to identify the most similar known structures to their predicted structures. These predicted structures often have low sequence and structure similarity to known structures. We show how RUPEE, a purely geometric protein structure search, is able to identify the structures most similar to structure predictions, regardless of how they vary from known structures, something existing protein structure searches struggle with. RUPEE accomplishes this through the use of a novel linear encoding of protein structures as a sequence of residue descriptors. Using a fast Needleman-Wunsch algorithm, RUPEE is able to perform alignments on the sequences of residue descriptors for every available structure. This is followed by a series of increasingly accurate structure alignments from TM-align alignments initialized with the Needleman-Wunsch residue descriptor alignments to standard TMalign alignments of the final results. By using alignment normalization effectively at each stage, RUPEE also can execute containment searches in addition to full-length searches to identify structural motifs within proteins. We compare the results of RUPEE to the protein structure searches mTM-align, SSM, CATHEDRAL, and VAST using a benchmark derived from the protein structure predictions submitted to CASP13. RUPEE identifies better alignments on average with respect to TM-score as well as scores specific to SSM and CATHEDRAL, Q-score and SSAP-score, respectively.
Protein structure prediction is a long-standing unsolved problem in molecular biology that has seen renewed interest with the recent success of deep learning with AlphaFold at CASP13. While developing and evaluating protein structure prediction methods, researchers may want to identify the most similar known structures to their predicted structures. These predicted structures often have low sequence and structure similarity to known structures. We show how RUPEE, a purely geometric protein structure search, is able to identify the structures most similar to structure predictions, regardless of how they vary from known structures, something existing protein structure searches struggle with. RUPEE accomplishes this through the use of a novel linear encoding of protein structures as a sequence of residue descriptors. Using a fast Needleman-Wunsch algorithm, RUPEE is able to perform alignments on the sequences of residue descriptors for every available structure. This is followed by a series of increasingly accurate structure alignments from TM-align alignments initialized with the Needleman-Wunsch residue descriptor alignments to standard TM-align alignments of the final results. By using alignment normalization effectively at each stage, RUPEE also can execute containment searches in addition to full-length searches to identify structural motifs within proteins. We compare the results of RUPEE to mTM-align, SSM, CATHEDRAL and VAST using a benchmark derived from the protein structure predictions submitted to CASP13. RUPEE identifies better alignments on average with respect to RMSD and TM-score as well as Q-score and SSAP-score, scores specific to SSM and CATHEDRAL, respectively. Finally, we show a sample of the top-scoring alignments that RUPEE identified that none of the other protein structure searches we compared to were able to identify.The RUPEE protein structure search is available at https://ayoubresearch.com. Code and data are available at https://github.com/rayoub/rupee.Determining the structure of a protein is an important step toward understanding its 2 function. There are approximately 150,000 solved protein structures currently stored in 3 the protien data bank (PDB) [1], the global repository for experimentally determined 4 protein structures. On the other hand, UniProt [2], the universal protein knowledgebase, 5currently provides over 60 million protein sequences. From this, it is apparent that 6 1/16 protein structure determination is moving at a slower pace than protein sequencing and 7 may be serving as a bottleneck in a variety of research efforts from protein design to 8 drug discovery. Being able to predict a protein structure from its amino acid sequence 9 would address this problem. However, protein structure prediction remains a central 10 unsolved problem in molecular biology [3]. 11CASP is a biannual blind competition for protein structure prediction that began in 12 1994 [4]. Progress had been slow until the success of coevolutionary methods in contact 13 prediction demonstrated in CASP11...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.