Investigating Why Contrastive Learning Benefits Robustness Against Label Noise

Xue, Yihao; Whitecross, Kyle; Mirzasoleiman, Baharan

doi:10.48550/arxiv.2201.12498

Cited by 1 publication

(1 citation statement)

References 24 publications

(50 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These pretrained, generalizable encoders have become a popular molecular design tool in recent years. − However, these models may operate on different chemical representations with no clear optimal choice. Contrastive learning approaches are able to integrate several data modalities, can boost robustness on downstream tasks, and have been shown to be successful in multiple fields. − We explore a scheme that uses contrastive learning of multiple molecular modalities, and our experiments show that this strategy leads to broadly applicable and robust representations. More generally, we seek a generative foundation model of small molecules that decouples conditional generation from fine-tuning of the foundation model and provides a path forward for future multimodal representation learning advances.…”

Section: Introductionmentioning

confidence: 98%

COATI: Multimodal Contrastive Pretraining for Representing and Traversing Chemical Space

Kaufman,

Williams,

Underkoffler

et al. 2024

J. Chem. Inf. Model.

View full text Add to dashboard Cite

Creating a successful small molecule drug is a challenging multiparameter optimization problem in an effectively infinite space of possible molecules. Generative models have emerged as powerful tools for traversing data manifolds composed of images, sounds, and text and offer an opportunity to dramatically improve the drug discovery and design process. To create generative optimization methods that are more useful than brute-force molecular generation and filtering via virtual screening, we propose that four integrated features are necessary: large, quantitative data sets of molecular structure and activity, an invertible vector representation of realistic accessible molecules, smooth and differentiable regressors that quantify uncertainty, and algorithms to simultaneously optimize properties of interest. Over the course of 12 months, Terray Therapeutics has collected a data set of 2 billion quantitative binding measurements of small molecules to therapeutic targets, which directly motivates multiparameter generative optimization of molecules conditioned on these data. To this end, we present contrastive optimization for accelerated therapeutic inference (COATI), a pretrained, multimodal encoder-decoder model of druglike chemical space. COATI is constructed without any human biasing of features, using contrastive learning from text and 3D representations of molecules to allow for downstream use with structural models. We demonstrate that COATI possesses many of the desired properties of universal molecular embedding: fixed-dimension, invertibility, autoencoding, accurate regression, and low computation cost. Finally, we present a novel metadynamics algorithm for generative optimization using a small subset of our proprietary data collected for a model protein, carbonic anhydrase, designing molecules that satisfy the multiparameter optimization task of potency, solubility, and drug likeness. This work sets the stage for fully integrated generative molecular design and optimization for small molecules.

show abstract