Activity prediction
plays an essential role in drug discovery by
directing search of drug candidates in the relevant chemical space.
Despite being applied successfully to image recognition and semantic
similarity, the Siamese neural network has rarely been explored in
drug discovery where modelling faces challenges such as insufficient
data and class imbalance. Here, we present a Siamese recurrent neural
network model (SiameseCHEM) based on bidirectional long short-term
memory architecture with a self-attention mechanism, which can automatically
learn discriminative features from the SMILES representations of small
molecules. Subsequently, it is used to categorize bioactivity of small
molecules via
N
-shot learning. Trained on random
SMILES strings, it proves robust across five different datasets for
the task of binary or categorical classification of bioactivity. Benchmarking
against two baseline machine learning models which use the chemistry-rich
ECFP fingerprints as the input, the deep learning model outperforms
on three datasets and achieves comparable performance on the other
two. The failure of both baseline methods on SMILES strings highlights
that the deep learning model may learn task-specific chemistry features
encoded in SMILES strings.