Abstract. Structural studies of proteins for motif mining and other pattern recognition techniques require the abstraction of the structure into simpler elements for robust matching. In this study, we propose the use of bondorientational order parameters, a well-established metric usually employed to compare atom packing in crystals and liquids. Creating a vector of orientational order parameters of residue centers in a sliding window fashion provides us with a descriptor of local structure and connectivity around each residue that is easy to calculate and compare. To test whether this representation is feasible and applicable to protein structures, we tried to predict the secondary structure of protein segments from those descriptors, resulting in 0.99 AUC (area under the ROC curve). Clustering those descriptors to 6 clusters also yield 0.93 AUC, showing that these descriptors can be used to capture and distinguish local structural information.Keywords: bond-orientational order, secondary structure, machine learning, structural alphabet.
IntroductionIn analysis protein structures, different models of representations on various levels of structural details are used. From coarse-grained to all-atom models, simplified lattice to continuous representations, each model can be used in different areas of research. The need for abstraction in computational methods (such as structure search and comparison, fold matching, structural motif mining and other areas of pattern recognition) is especially high. The very high amount of data and precision in the 3D coordinates makes computational analysis very complex and very rigid in its applicability. Simplified models capture relevant information and hide unimportant details through abstraction, conferring the ability to group complex 3D information into manageable clusters that can be searched for, compared and "learned" by machine-learning algorithms in a flexible fashion.The most common simplified representation of the protein states are the secondary structural assignments to the coordinates, which can be overlaid onto the sequence to create a 1D representation.There have been other studies with aims to create local structural alphabets to represent the structure as a 1D sequence of structural blocks [1]. A structural alphabet