Infrared (IR) spectroscopy remains an important tool for chemical characterization and identification. Chemprop-IR has been developed as a software package for the prediction of IR spectra through the use of machine learning. This work serves the dual purpose of providing a trained general-purpose model for the prediction of IR spectra with ease and providing the Chemprop-IR software framework for the training of new models. In Chemprop-IR, molecules are encoded using a directed message passing neural network, allowing for molecule latent representations to be learned and optimized for the task of spectral predictions. Model training incorporates spectra metrics and normalization techniques that offer better performance with spectral predictions than standard practice in regression models. The model makes use of pretraining using quantum chemistry calculations and ensembling of multiple submodels to improve generalizability and performance. The spectral predictions that result are of high quality, showing capability to capture the extreme diversity of spectral forms over chemical space and represent complex peak structures.
Modular oxacyclophanes featuring m-terphenyl units scaffold inter-pi-system interaction in face-to-face stacked or orthogonal orientations, leading to distinct photophysical properties.
Characterizing uncertainty in machine learning models has recently gained interest in the context of machine learning reliability, robustness, safety, and active learning. Here, we separate the total uncertainty into contributions from noise in the data (aleatoric) and shortcomings of the model (epistemic), further dividing epistemic uncertainty into model bias and variance contributions. We systematically address the influence of noise, model bias, and model variance in the context of chemical property predictions, where the diverse nature of target properties and the vast chemical chemical space give rise to many different distinct sources of prediction error. We demonstrate that different sources of error can each be significant in different contexts and must be individually addressed during model development. Through controlled experiments on datasets of molecular properties, we show important trends in model performance associated with the level of noise in the dataset, size of the dataset, model architecture, molecule representation, ensemble size, and dataset splitting. In particular, we show that 1) noise in the test set can limit a model's observed performance when the actual performance is much better, 2) using size-extensive model aggregation structures is crucial for extensive property prediction, 3) ensembling is a reliable tool for uncertainty quantification and improvement specifically for the contribution of model variance, and 4) evaluations of cross-validation models understate their performance. We develop general guidelines on how to improve an underperforming model when falling into different uncertainty contexts.
Deep learning has become a powerful and frequently employed tool for the prediction of molecular properties, thus creating a need for open-source and versatile software solutions that can be operated by non-experts. Among current approaches, directed message-passing neural networks (D-MPNNs) have proven to perform well on a variety of property prediction tasks. The software package Chemprop implements the D-MPNN architecture, and offers simple, easy, and fast access to machine-learned molecular properties. Compared to its initial version, we present a multitude of new Chemprop functionalities such as the support of multi-molecule properties, reactions, atom/bond-level properties, and spectra. Further, we incorporate various uncertainty quantification and calibration methods along with related metrics, as well as pretraining and transfer learning workflows, improved hyperparameter optimization, and other customization options concerning loss functions or atom/bond features. We benchmark D-MPNN models trained using Chemprop with the new reaction, atom-level and spectra functionality on a variety of property prediction datasets, including MoleculeNet and SAMPL, and observe state-of-the-art performance on the prediction of water-octanol partition coefficients, reaction barrier heights, atomic partial charges, and absorption spectra. Chemprop enables out-of-the-box training of D-MPNN models for a variety of problem settings in a fast, user-friendly, and open-source software.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.