Differentiable sampling of molecular geometries with uncertainty-based adversarial attacks

Schwalbe-Koda, Daniel; Tan, Aik Rui; Gómez‐Bombarelli, Rafael

doi:10.1038/s41467-021-25342-8

Cited by 57 publications

(61 citation statements)

References 72 publications

(100 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use the SchNet [59], PaiNN [36], Allegro [10], and SpookyNet [37] models. Model implementations are from the NeuralForceField repository [34,60,61] and the Allegro repository [10]. Model sizes (w in Equation 6) were varied between 16, 64, and 256, while the number of layers/convolutions (d in Equation 6) was chosen to be 2, 3, or 4.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Neural Scaling of Deep Chemical Models

Frey

Soklaski

Axelrod

et al. 2022

Preprint

View full text Add to dashboard Cite

Massive scale, both in terms of data availability and computation, enables significant breakthroughs in key application areas of deep learning such as natural language processing (NLP) and computer vision. There is emerging evidence that scale may be a key ingredient in scientific deep learning, but the importance of physical priors in scientific domains makes the strategies and benefits of scaling uncertain. Here, we investigate neural scaling behavior in large chemical models by varying model and dataset sizes over many orders of magnitude, studying models with over one billion parameters, pre-trained on datasets of up to ten million datapoints. We consider large language models for generative chemistry and graph neural networks for machine-learned interatomic potentials. To enable large-scale scientific deep learning studies under resource constraints, we develop the Training Performance Estimation (TPE) framework to reduce the costs of scalable hyperparameter optimization by up to 90%. Using this framework, we discover empirical neural scaling relations for deep chemical models and investigate the interplay between physical priors and scale. Potential applications of large, pre-trained models for "prompt engineering" and unsupervised representation learning of molecules are shown.

show abstract

Section: Methodsmentioning

confidence: 99%

“…where α E and α F are coefficients that determine the relative weighting of energy and force predictions during training [34]. For scaling experiments we use the L1 loss or mean absolute error,…”

Section: Mainmentioning

confidence: 99%

Neural Scaling of Deep Chemical Models

Frey

Soklaski

Axelrod

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…To bypass these forward simulations, we developed an inverse sampling strategy that chooses the most informative geometries to annotate with ground-truth calculations. 40 The approach is based on adversarial attacks, a concept developed in ML for image classification. 41 By computing the gradient of the error with respect to the input and performing gradient ascent to modify the input, one generates a new image with maximal model error.…”

Section: Differentiable Uncertainty For Active Learningmentioning

confidence: 99%

“…Particularly for high-dimensional systems, this computational overhead might negate some of the benefits provided by the NNPs. To bypass these forward simulations, we developed an inverse sampling strategy that chooses the most informative geometries to annotate with ground-truth calculations . The approach is based on adversarial attacks, a concept developed in ML for image classification .…”

Section: Enhanced Atomistic Simulationmentioning

confidence: 99%

Learning Matter: Materials Design with Machine Learning and Atomistic Simulations

et al. 2022

Self Cite

View full text Add to dashboard Cite

Metrics & MoreArticle Recommendations CONSPECTUS: Designing new materials is vital for addressing pressing societal challenges in health, energy, and sustainability. The combination of physicochemical laws and empirical trial and error has long guided material design, but this approach is limited by the cost of experiments and the difficulty of deriving complex guiding principles. The space of hypothetical materials to be considered is incredibly large, and only a small fraction of possible compounds can ever be tested experimentally. The computational techniques of atomistic simulation and machine learning (ML) offer an avenue to rapidly invent new materials and navigate this enormous space. Together, they can be used to infer complex design principles and identify high-quality candidates more rapidly than trial-and-error experimentation. In this Account, we review our group's recent contributions to simulation and ML for materials design. We begin by discussing the numerical representation of materials for use in ML. Representations can be produced through deterministic algorithms, learnable encodings, or physics-based methods and lead to vector, graph, and matrix outputs. We describe how these different approaches offer distinct material-and application-specific advantages. We provide demonstrations from our own work on small-molecule drugs, macromolecules, dyes, electrolytes, and zeolites. In several cases, we show how the appropriate representation led to guiding principles that facilitated experimental materials design. Next, we highlight the development of ML methods for enhancing atomistic simulation. These advances help to improve simulation accuracy and expand the time and length scales that can be explored. They include differentiable atomistic simulations in which ensemble-averaged quantities are differentiated with respect to system parameters, and novel autoregressive methods for enhanced sampling of challenging physical distributions. Other developments include learnable coarse-grained models, which can accelerate molecular dynamics while minimizing the loss of all-atom information, and ML interatomic potentials, which can be trained on maximally informative quantum chemistry data through active learning and adversarial uncertainty attacks. Next, we show how these combined computational advances have enabled high-throughput virtual screening. This has led to the discovery of low-cost organic structure-directing agents for zeolite synthesis, polymer electrolytes, and efficient photoswitches for targeted medicine. We conclude by discussing the limitations of ML and simulation. These include the large data requirements and limited chemical transferability of the former and the speed−accuracy trade-offs of the latter. We predict that advancements in quantum chemistry will further accelerate simulations, while the incorporation of physical principles will improve the reliability of ML.

show abstract

“…Assessing and benchmarking the robustness of ML or DL approaches by a series of adversarial attacks are popular in the image classification domain [20], but there are others that are closer to the domain of molecular data. In [21], the authors provide a series of realistic adversarial attacks to benchmark methods that predict chemical properties from atomistic simulations e.g., molecular conformation, reactions, and phase transitions. Even closer to the subject of our paper -protein sequences -the authors of [22] show that methods, such as AlphaFold [23] and RoseTTAFold [24] which employ deep neural networks to predict protein conformation are not robust: producing drastically different protein structures as a result of very small biologically meaningful perturbations in the protein sequence.…”

Section: Related Workmentioning

confidence: 99%

Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence Classification

Ali¹,

Sahoo²,

Zelikovskiy³

et al. 2022

Preprint

View full text Add to dashboard Cite

The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome -millions of sequences and counting. This amount of data, while being orders of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics, and evolution of viruses is nonetheless a rich resource for machine learning (ML) approaches as alternatives for extracting such important information from these data. It is of hence utmost importance to design a framework for testing and benchmarking the robustness of these ML models. This paper makes the first effort (to our knowledge) to benchmark the robustness of ML models by simulating biological sequences with errors. In this paper, we introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show from experiments on a wide array of ML models that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences. Our benchmarking framework may assist researchers in properly assessing different ML models and help them understand the behavior of the SARS-CoV-2 virus or avoid possible future pandemics.

show abstract

Differentiable sampling of molecular geometries with uncertainty-based adversarial attacks

Cited by 57 publications

References 72 publications

Neural Scaling of Deep Chemical Models

Neural Scaling of Deep Chemical Models

Learning Matter: Materials Design with Machine Learning and Atomistic Simulations

Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence Classification

Contact Info

Product

Resources

About