Thirty-eight percent
of protein structures in the Protein Data
Bank contain at least one metal ion. However, not all these metal
sites are biologically relevant. Cations present as impurities during
sample preparation or in the crystallization buffer can cause the
formation of protein–metal complexes that do not exist in vivo.
We implemented a deep learning approach to build a classifier able
to distinguish between physiological and adventitious zinc-binding
sites in the 3D structures of metalloproteins. We trained the classifier
using manually annotated sites extracted from the MetalPDB database.
Using a 10-fold cross validation procedure, the classifier achieved
an accuracy of about 90%. The same neural classifier could predict
the physiological relevance of non-heme mononuclear iron sites with
an accuracy of nearly 80%, suggesting that the rules learned on zinc
sites have general relevance. By quantifying the relative importance
of the features describing the input zinc sites from the network perspective
and by analyzing the characteristics of the MetalPDB datasets, we
inferred some common principles. Physiological sites present a low
solvent accessibility of the aminoacids forming coordination bonds
with the metal ion (the metal ligands), a relatively large number
of residues in the metal environment (≥20), and a distinct
pattern of conservation of Cys and His residues in the site. Adventitious
sites, on the other hand, tend to have a low number of donor atoms
from the polypeptide chain (often one or two). These observations
support the evaluation of the physiological relevance of novel metal-binding
sites in protein structures.