Small molecules play
a critical role in modulating biological systems.
Knowledge of chemical–protein interactions helps address fundamental
and practical questions in biology and medicine. However, with the
rapid emergence of newly sequenced genes, the endogenous or surrogate
ligands of a vast number of proteins remain unknown. Homology modeling
and machine learning are two major methods for assigning new ligands
to a protein but mostly fail when sequence homology between an unannotated
protein and those with known functions or structures is low. In this
study, we develop a new deep learning framework to predict chemical
binding to evolutionary divergent unannotated proteins, whose ligand
cannot be reliably predicted by existing methods. By incorporating
evolutionary information into self-supervised learning of unlabeled
protein sequences, we develop a novel method, distilled sequence alignment
embedding (DISAE), for the protein sequence representation. DISAE
can utilize all protein sequences and their multiple sequence alignment
(MSA) to capture functional relationships between proteins without
the knowledge of their structure and function. Followed by the DISAE
pretraining, we devise a module-based fine-tuning strategy for the
supervised learning of chemical–protein interactions. In the
benchmark studies, DISAE significantly improves the generalizability
of machine learning models and outperforms the state-of-the-art methods
by a large margin. Comprehensive ablation studies suggest that the
use of MSA, sequence distillation, and triplet pretraining critically
contributes to the success of DISAE. The interpretability analysis
of DISAE suggests that it learns biologically meaningful information.
We further use DISAE to assign ligands to human orphan G-protein coupled
receptors (GPCRs) and to cluster the human GPCRome by integrating
their phylogenetic and ligand relationships. The promising results
of DISAE open an avenue for exploring the chemical landscape of entire
sequenced genomes.