The structure of a protein is of great importance in
determining
its functionality, and this characteristic can be leveraged to train
data-driven prediction models. However, the limited number of available
protein structures severely limits the performance of these models.
AlphaFold2 and its open-source data set of predicted protein structures
have provided a promising solution to this problem, and these predicted
structures are expected to benefit the model performance by increasing
the number of training samples. In this work, we constructed a new
data set that acted as a benchmark and implemented a state-of-the-art
structure-based approach for determining whether the performance of
the function prediction model can be improved by putting additional
AlphaFold-predicted structures into the training set and further compared
the performance differences between two models separately trained
with real structures only and AlphaFold-predicted structures only.
Experimental results indicated that structure-based protein function
prediction models could benefit from virtual training data consisting
of AlphaFold-predicted structures. First, model performances were
improved in all three categories of Gene Ontology terms (GO terms)
after adding predicted structures as training samples. Second, the
model trained only on AlphaFold-predicted virtual samples achieved
comparable performances to the model based on experimentally solved
real structures, suggesting that predicted structures were almost
equally effective in predicting protein functionality.