Abstract:Given the close relationship between protein structure and function, protein structure searches have long played an established role in bioinformatics. Despite their maturity, existing protein structure searches either use simplifying assumptions or compromise between fast response times and quality of results. These limitations can prevent the easy and efficient exploration of relationships between protein structures, which is the norm in other areas of inquiry. We have developed RUPEE, a fast, scalable, and … Show more
“…Recognizing the need for a structure search with more sensitivity 52 than top-aligned search mode and still having no dependence on sequences or clustering, 53 we have added an additional search mode to RUPEE with increased sensitivity called 54 all-aligned search mode. 55 Like our previous work on RUPEE [7], again we compare the results of RUPEE 56 against mTM-align [8], SSM [9] and CATHEDRAL [10], but this time we do so for 57 all-aligned search mode. Additionally, this time we also compare to the VAST protein 58 2/16 structure search [13].…”
mentioning
confidence: 99%
“…We first give a brief outline of our linear encoding of protein structures described in 111 more detail in our previous work on RUPEE [7], which still remains at the core of the 112 RUPEE protein structure search. Then, we describe our approach to top-aligned to 113 provide context followed by the addition of all-aligned.…”
mentioning
confidence: 99%
“…Linear encoding of protein structure 115 Previously [7], we introduced a linear encoding of protein structures based on torsion 116 angle regions. We determined these regions by plotting a random sampling of torsion 117 angles.…”
mentioning
confidence: 99%
“…The underlined elements in (1) correspond to the underlined elements in (2), (4), and (5) below to help illustrate the subsequent transformations from descriptors to shingles and finally to hashes. [ 5,5,5,5,5,5,7,5,11,11,5,5,5,5,5,5 ] (1)…”
Protein structure prediction is a long-standing unsolved problem in molecular biology that has seen renewed interest with the recent success of deep learning with AlphaFold at CASP13. While developing and evaluating protein structure prediction methods, researchers may want to identify the most similar known structures to their predicted structures. These predicted structures often have low sequence and structure similarity to known structures. We show how RUPEE, a purely geometric protein structure search, is able to identify the structures most similar to structure predictions, regardless of how they vary from known structures, something existing protein structure searches struggle with. RUPEE accomplishes this through the use of a novel linear encoding of protein structures as a sequence of residue descriptors. Using a fast Needleman-Wunsch algorithm, RUPEE is able to perform alignments on the sequences of residue descriptors for every available structure. This is followed by a series of increasingly accurate structure alignments from TM-align alignments initialized with the Needleman-Wunsch residue descriptor alignments to standard TM-align alignments of the final results. By using alignment normalization effectively at each stage, RUPEE also can execute containment searches in addition to full-length searches to identify structural motifs within proteins. We compare the results of RUPEE to mTM-align, SSM, CATHEDRAL and VAST using a benchmark derived from the protein structure predictions submitted to CASP13. RUPEE identifies better alignments on average with respect to RMSD and TM-score as well as Q-score and SSAP-score, scores specific to SSM and CATHEDRAL, respectively. Finally, we show a sample of the top-scoring alignments that RUPEE identified that none of the other protein structure searches we compared to were able to identify.The RUPEE protein structure search is available at https://ayoubresearch.com. Code and data are available at https://github.com/rayoub/rupee.Determining the structure of a protein is an important step toward understanding its 2 function. There are approximately 150,000 solved protein structures currently stored in 3 the protien data bank (PDB) [1], the global repository for experimentally determined 4 protein structures. On the other hand, UniProt [2], the universal protein knowledgebase, 5currently provides over 60 million protein sequences. From this, it is apparent that 6 1/16 protein structure determination is moving at a slower pace than protein sequencing and 7 may be serving as a bottleneck in a variety of research efforts from protein design to 8 drug discovery. Being able to predict a protein structure from its amino acid sequence 9 would address this problem. However, protein structure prediction remains a central 10 unsolved problem in molecular biology [3].
11CASP is a biannual blind competition for protein structure prediction that began in 12 1994 [4]. Progress had been slow until the success of coevolutionary methods in contact 13 prediction demonstrated in CASP11...
“…Recognizing the need for a structure search with more sensitivity 52 than top-aligned search mode and still having no dependence on sequences or clustering, 53 we have added an additional search mode to RUPEE with increased sensitivity called 54 all-aligned search mode. 55 Like our previous work on RUPEE [7], again we compare the results of RUPEE 56 against mTM-align [8], SSM [9] and CATHEDRAL [10], but this time we do so for 57 all-aligned search mode. Additionally, this time we also compare to the VAST protein 58 2/16 structure search [13].…”
mentioning
confidence: 99%
“…We first give a brief outline of our linear encoding of protein structures described in 111 more detail in our previous work on RUPEE [7], which still remains at the core of the 112 RUPEE protein structure search. Then, we describe our approach to top-aligned to 113 provide context followed by the addition of all-aligned.…”
mentioning
confidence: 99%
“…Linear encoding of protein structure 115 Previously [7], we introduced a linear encoding of protein structures based on torsion 116 angle regions. We determined these regions by plotting a random sampling of torsion 117 angles.…”
mentioning
confidence: 99%
“…The underlined elements in (1) correspond to the underlined elements in (2), (4), and (5) below to help illustrate the subsequent transformations from descriptors to shingles and finally to hashes. [ 5,5,5,5,5,5,7,5,11,11,5,5,5,5,5,5 ] (1)…”
Protein structure prediction is a long-standing unsolved problem in molecular biology that has seen renewed interest with the recent success of deep learning with AlphaFold at CASP13. While developing and evaluating protein structure prediction methods, researchers may want to identify the most similar known structures to their predicted structures. These predicted structures often have low sequence and structure similarity to known structures. We show how RUPEE, a purely geometric protein structure search, is able to identify the structures most similar to structure predictions, regardless of how they vary from known structures, something existing protein structure searches struggle with. RUPEE accomplishes this through the use of a novel linear encoding of protein structures as a sequence of residue descriptors. Using a fast Needleman-Wunsch algorithm, RUPEE is able to perform alignments on the sequences of residue descriptors for every available structure. This is followed by a series of increasingly accurate structure alignments from TM-align alignments initialized with the Needleman-Wunsch residue descriptor alignments to standard TM-align alignments of the final results. By using alignment normalization effectively at each stage, RUPEE also can execute containment searches in addition to full-length searches to identify structural motifs within proteins. We compare the results of RUPEE to mTM-align, SSM, CATHEDRAL and VAST using a benchmark derived from the protein structure predictions submitted to CASP13. RUPEE identifies better alignments on average with respect to RMSD and TM-score as well as Q-score and SSAP-score, scores specific to SSM and CATHEDRAL, respectively. Finally, we show a sample of the top-scoring alignments that RUPEE identified that none of the other protein structure searches we compared to were able to identify.The RUPEE protein structure search is available at https://ayoubresearch.com. Code and data are available at https://github.com/rayoub/rupee.Determining the structure of a protein is an important step toward understanding its 2 function. There are approximately 150,000 solved protein structures currently stored in 3 the protien data bank (PDB) [1], the global repository for experimentally determined 4 protein structures. On the other hand, UniProt [2], the universal protein knowledgebase, 5currently provides over 60 million protein sequences. From this, it is apparent that 6 1/16 protein structure determination is moving at a slower pace than protein sequencing and 7 may be serving as a bottleneck in a variety of research efforts from protein design to 8 drug discovery. Being able to predict a protein structure from its amino acid sequence 9 would address this problem. However, protein structure prediction remains a central 10 unsolved problem in molecular biology [3].
11CASP is a biannual blind competition for protein structure prediction that began in 12 1994 [4]. Progress had been slow until the success of coevolutionary methods in contact 13 prediction demonstrated in CASP11...
An evolutionary-based definition and classification of target evaluation units (EUs) is presented for the 14th round of the critical assessment of structure prediction (CASP14). CASP14 targets included 84 experimental models submitted by various structural groups (designated T1024-T1101). Targets were split into EUs based on the domain organization of available templates and performance of server groups.Several targets required splitting (19 out of 25 multidomain targets) due in part to observed conformation changes. All in all, 96 CASP14 EUs were defined and assigned to tertiary structure assessment categories (Topology-based FM or High Accuracybased TBM-easy and TBM-hard) considering their evolutionary relationship to existing ECOD fold space: 24 family level, 50 distant homologs (H-group), 12 analogs (Xgroup), and 10 new folds. Principal component analysis and heatmap visualization of sequence and structure similarity to known templates as well as performance of servers highlighted trends in CASP14 target difficulty. The assigned evolutionary levels (i.e., H-groups) and assessment classes (i.e., FM) displayed overlapping clusters of EUs.Many viral targets diverged considerably from their template homologs and thus were more difficult for prediction than other homology-related targets. On the other hand, some targets did not have sequence-identifiable templates, but were predicted better than expected due to relatively simple arrangements of secondary structural elements.An apparent improvement in overall server performance in CASP14 further complicated traditional classification, which ultimately assigned EUs into high-accuracy modeling (27 TBM-easy and 31 TBM-hard), topology (23 FM), or both (15 FM/TBM).
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.