PD-(D/E)XK nucleases, initially represented by only Type II restriction enzymes, now comprise a large and extremely diverse superfamily of proteins. They participate in many different nucleic acids transactions including DNA degradation, recombination, repair and RNA processing. Different PD-(D/E)XK families, although sharing a structurally conserved core, typically display little or no detectable sequence similarity except for the active site motifs. This makes the identification of new superfamily members using standard homology search techniques challenging. To tackle this problem, we developed a method for the detection of PD-(D/E)XK families based on the binary classification of profile–profile alignments using support vector machines (SVMs). Using a number of both superfamily-specific and general features, SVMs were trained to identify true positive alignments of PD-(D/E)XK representatives. With this method we identified several PFAM families of uncharacterized proteins as putative new members of the PD-(D/E)XK superfamily. In addition, we assigned several unclassified restriction enzymes to the PD-(D/E)XK type. Results show that the new method is able to make confident assignments even for alignments that have statistically insignificant scores. We also implemented the method as a freely accessible web server at http://www.ibt.lt/bioinformatics/software/pdexk/.
Along with over 150 other groups we have tested our template-based protein structure prediction approach by submitting models for 30 target proteins to the sixth round of the Critical Assessment of Protein Structure Prediction Methods (CASP6, http://predictioncenter.org). Most of our modeled proteins fall into the comparative or homology modeling (CM) category, and some are fold recognition (FR) targets. The key feature of our structure prediction strategy in CASP6 was an attempt to optimally select structural templates and to make accurate sequence-structure alignments. Template selection was based mainly on consensus results of multiple sequence searches. Likewise, the consensus of multiple alignment variants (or lack of it) was used to initially delineate reliable and unreliable alignment regions. Structure evaluation approaches were then used to identify the correct sequence-structure mapping. Our results suggest that in many cases use of multiple templates is advantageous. Selecting correct alignments even within the context of a three-dimensional structure remains a challenge. Together with more effective energy evaluation methods the simultaneous relaxation/refinement of a "frozen" backbone inherited from the template is likely needed to see a clear progress in tackling this problem. Our analysis also suggests that human input has little to contribute to automatic methods in modeling high homology targets. On the other hand, human expertise can be very valuable in modeling distantly related proteins and critical in cases of unexpected evolutionary changes in protein structure.
BackgroundDetection of common evolutionary origin (homology) is a primary means of inferring protein structure and function. At present, comparison of protein families represented as sequence profiles is arguably the most effective homology detection strategy. However, finding the best way to represent evolutionary information of a protein sequence family in the profile, to compare profiles and to estimate the biological significance of such comparisons, remains an active area of research.ResultsHere, we present a new homology detection method based on sequence profile-profile comparison. The method has a number of new features including position-dependent gap penalties and a global score system. Position-dependent gap penalties provide a more biologically relevant way to represent and align protein families as sequence profiles. The global score system enables an analytical solution of the statistical parameters needed to estimate the statistical significance of profile-profile similarities. The new method, together with other state-of-the-art profile-based methods (HHsearch, COMPASS and PSI-BLAST), is benchmarked in all-against-all comparison of a challenging set of SCOP domains that share at most 20% sequence identity. For benchmarking, we use a reference ("gold standard") free model-based evaluation framework. Evaluation results show that at the level of protein domains our method compares favorably to all other tested methods. We also provide examples of the new method outperforming structure-based similarity detection and alignment. The implementation of the new method both as a standalone software package and as a web server is available at http://www.ibt.lt/bioinformatics/coma.ConclusionDue to a number of developments, the new profile-profile comparison method shows an improved ability to match distantly related protein domains. Therefore, the method should be useful for annotation and homology modeling of uncharacterized proteins.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.