Sequential search leads to faster, more efficient fragment-based<i>de novo</i>protein structure prediction

Oliveira, Saulo H. P. de; Law, Eleanor C.; Shi, Jiye; Deane, Charlotte M.

doi:10.1093/bioinformatics/btx722

“…by EigenTHREADER for each model as a feature. 60% of the targets in each set, in line with numbers reported previously [1].…”

supporting

confidence: 85%

“…In particular, proteins in the Validation set with B eff <100 102 tended to be longer than proteins on the Training set with B eff <100, which suggests 103 that the Validation set may be more challenging for protein structure prediction. 104 Protein Structure Prediction 105 To produce models for all targets in our Training and Validation sets, we used our 106 fragment-assembly protocol SAINT2 [1] (for details, see SI Section 4 and [1]) with the 107 parameters given in the original publication, with the exception of secondary structure 108 prediction. We used DeepCNF Q8 to predict secondary structure, as DeepCNF Q8 had 109 a slightly higher precision for targets with large B eff values, and results in marginal calculated across all models produced for each target.…”

mentioning

confidence: 99%

“…Contact-based features (2): The contact component of the SAINT2 score 162 (see [1] for more details) and the proportion of satisfied predicted contacts (positive 163 predictive value, PPV). Here, we considered a predicted contact to be a satisfied if the 164 C-β atoms (C-α in the case of glycine) of the two residues predicted to be in contact 165 were less than 8Å apart in the model output by SAINT2.…”

mentioning

confidence: 99%

See 1 more Smart Citation

RFQAmodel: Random Forest Quality Assessment to identify a predicted protein structure in the correct fold

West

¹

,

Oliveira

²

,

Deane

³

2019

Preprint

0

View full text Add to dashboard Cite

While template-free protein structure prediction protocols now produce good quality models for many targets, modelling failure remains common. For these methods to be useful it is important that users can both choose the best model from the hundreds to thousands of models that are commonly generated for a target, and determine whether this model is likely to be correct. We have developed Random Forest Quality Assessment (RFQAmodel), which assesses whether models produced by a protein structure prediction pipeline have the correct fold. RFQAmodel uses a combination of existing quality assessment scores with two predicted contact map alignment scores.These alignment scores are able to identify correct models for targets that are not otherwise captured. Our classifier was trained on a large set of protein domains that are structurally diverse and evenly balanced in terms of protein features known to have an effect on modelling success, and then tested on a second set of 244 protein domains with a similar spread of properties. When models for each target in this second set were ranked according to the RFQAmodel score, the highest-ranking model had a high-confidence RFQAmodel score for 67 modelling targets, of which 52 had the correct fold. At the other end of the scale RFQAmodel correctly predicted that for 59 targets the highest-ranked model was incorrect. In comparisons to other methods we found that May 23, 2019 1/23 RFQAmodel is better able to identify correct models for targets where only a few of the models are correct. We found that RFQAmodel achieved a similar performance on the model sets for CASP12 and CASP13 free-modelling targets. Finally, by iteratively generating models and running RFQAmodel until a model is produced that is predicted to be correct with high confidence, we demonstrate how such a protocol can be used to focus computational efforts on difficult modelling targets. Introduction 1 Template-free protein structure prediction protocols routinely produce hundreds to 2 thousands of models for a given target [1]. Users need to be able to identify if a good 3 model exists in this ensemble. The final step in a typical structure prediction pipeline is 4 therefore to select a representative subset of five or fewer models as output [2]. This 5 model selection step is critical, and the community's ability to select good models is 6 assessed as part of the Critical Assessment of protein Structure Prediction (CASP) 7 experiments [3]. 8 Protocols for model quality assessment can be divided into three classes: 9 single-model methods, quasi-single model methods, and consensus methods [2]. 10 Single-model methods calculate a score for each model independently, and this score 11 does not take into account any of the other models generated for a particular target. 12 The objective function optimised during protein structure prediction can usually be 13 used as a single-model quality estimator, but better results have been reported if 14 different scores are used for modelling and ranking [2]. Examples of single...

show abstract

“…Template-free protein structure prediction protocols routinely produce hundreds to thousands of models for a given target [1]. Users need to be able to identify if a good model exists in this ensemble.…”

Section: Introductionmentioning

confidence: 99%

“…We ensured that these sets were well-balanced in terms of protein length, number of effective sequences [7], SCOP class [13], and other properties that are known to have an effect on modelling success. We used our sequential protein structure prediction protocol SAINT2 [1] to generate 500 models for each of the 488 protein domains. Using the Training set, we show that predicted contact map alignment scores are as effective for ranking models as existing state-of-the-art quality assessment scores.…”

Section: Introductionmentioning

confidence: 99%

RFQAmodel: Random Forest Quality Assessment to identify a predicted protein structure in the correct fold

West

¹

,

Oliveira

²

,

Deane

³

2019

View full text Add to dashboard Cite

While template-free protein structure prediction protocols now produce good quality models for many targets, modelling failure remains common. For these methods to be useful it is important that users can both choose the best model from the hundreds to thousands of models that are commonly generated for a target, and determine whether this model is likely to be correct. We have developed Random Forest Quality Assessment (RFQAmodel), which assesses whether models produced by a protein structure prediction pipeline have the correct fold. RFQAmodel uses a combination of existing quality assessment scores with two predicted contact map alignment scores. These alignment scores are able to identify correct models for targets that are not otherwise captured. Our classifier was trained on a large set of protein domains that are structurally diverse and evenly balanced in terms of protein features known to have an effect on modelling success, and then tested on a second set of 244 protein domains with a similar spread of properties. When models for each target in this second set were ranked according to the RFQAmodel score, the highest-ranking model had a high-confidence RFQAmodel score for 67 modelling targets, of which 52 had the correct fold. At the other end of the scale RFQAmodel correctly predicted that for 59 targets the highest-ranked model was incorrect. In comparisons to other methods we found that RFQAmodel is better able to identify correct models for targets where only a few of the models are correct. We found that RFQAmodel achieved a similar performance on the model sets for CASP12 and CASP13 free-modelling targets. Finally, by iteratively generating models and running RFQAmodel until a model is produced that is predicted to be correct with high confidence, we demonstrate how such a protocol can be used to focus computational efforts on difficult modelling targets. RFQAmodel and the accompanying data can be downloaded from http://opig.stats.ox.ac.uk/resources.

show abstract

A self-adaptive evolutionary algorithm using Monte Carlo Fragment insertion and conformation clustering for the protein structure prediction problem

Parpinelli

¹

,

Will

²

,

Silva

³

2022

Nat Comput

0

View full text Add to dashboard Cite

Sequential search leads to faster, more efficient fragment-basedde novoprotein structure prediction

Cited by 15 publications

References 60 publications

RFQAmodel: Random Forest Quality Assessment to identify a predicted protein structure in the correct fold

RFQAmodel: Random Forest Quality Assessment to identify a predicted protein structure in the correct fold

RFQAmodel: Random Forest Quality Assessment to identify a predicted protein structure in the correct fold

A self-adaptive evolutionary algorithm using Monte Carlo Fragment insertion and conformation clustering for the protein structure prediction problem

Contact Info

Product

Resources

About