Machine learning (ML) models can potentially accelerate the discovery of tailored materials by learning a function that maps chemical compounds into their respective target properties. In this realm, a crucial step is encoding the molecular systems into the ML model, in which the molecular representation plays a crucial role. Most of the representations are based on the use of atomic coordinates (structure); however, it can increase ML training and predictions' computational cost. Herein, we investigate the impact of choosing free-coordinate descriptors based on the Simplified Molecular Input Line Entry System (SMILES) representation, which can substantially reduce the ML predictions' computational cost. Therefore, we evaluate a feed-forward neural network (FNN) model's prediction performance over five feature selection methods and nine ground-state properties (including energetic, electronic, and thermodynamic properties) from a public data set composed of ∼130k organic molecules. Our best results reached a mean absolute error, close to chemical accuracy, of ∼0.05 eV for the atomization energies (internal energy at 0 K, internal energy at 298.15 K, enthalpy at 298.15 K, and free energy at 298.15 K). Moreover, for the atomization energies, the results obtained an out-of-sample error nine times less than the same FNN model trained with the Coulomb matrix, a traditional coordinate-based descriptor. Furthermore, our results showed how limited the model's accuracy is by employing such low computational cost representation that carries less information about the molecular structure than the most state-of-the-art methods.
Machine
learning as a tool for chemical space exploration broadens
horizons to work with known and unknown molecules. At its core lies
molecular representation, an essential key to improve learning about
structure–property relationships. Recently, contrastive frameworks
have been showing impressive results for representation learning in
diverse domains. Therefore, this paper proposes a contrastive framework
that embraces multimodal molecular data. Specifically, our approach
jointly trains a graph encoder and an encoder for the simplified molecular-input
line-entry system (SMILES) string to perform the contrastive learning
objective. Since SMILES is the basis of our method, i.e., we built
the molecular graph from the SMILES, we call our framework as SMILES
Contrastive Learning (SMICLR). When stacking a nonlinear regressor
on the SMICLR’s pretrained encoder and fine-tuning the entire
model, we reduced the prediction error by, on average, 44% and 25%
for the energetic and electronic properties of the QM9 data set, respectively,
over the supervised baseline. We further improved our framework’s
performance when applying data augmentations in each molecular-input
representation. Moreover, SMICLR demonstrated competitive representation
learning results in an unsupervised setting.
Most machine learning applications in quantumchemistry (QC) data sets rely on a single statistical error parameter such as the mean square error (MSE) to evaluate their performance. However, this approach has limitations or can even yield incorrect interpretations. Here, we report a systematic investigation of the two components of the MSE, i.e., the bias and variance, using the QM9 data set. To this end, we experiment with three descriptors, namely (i) symmetry functions (SF, with two-body and three-body functions), (ii) many-body tensor representation (MBTR, with two-and three-body terms), and (iii) smooth overlap of atomic positions (SOAP), to evaluate the prediction process's performance using different numbers of molecules in training samples and the effect of bias and variance on the final MSE. Overall, low sample sizes are related to higher MSE. Moreover, the bias component strongly influences the larger MSEs. Furthermore, there is little agreement among molecules with higher errors (outliers) across different descriptors. However, there is a high prevalence among the outliers intersection set and the convex hull volume of geometric coordinates (VCH). According to the obtained results with the distribution of MSE (and its components bias and variance) and the appearance of outliers, it is suggested to use ensembles of models with a low bias to minimize the MSE, more specifically when using a small number of molecules in the training set. Article pubs.acs.org/jcim
Ionic
liquids have attracted the attention of researchers as possible
electrolytes for electrochemical energy storage devices. However,
their properties, such as the electrochemical stability window (ESW),
ionic conductivity, and diffusivity, are influenced both by the chemical
structures of cations and anions and by their combinations. Most studies
in the literature focus on the understanding of common ionic liquids,
and little effort has been made to find ways to improve our atomistic
understanding of those systems. The goal of this paper is to explore
the structural characteristics of cations and anions that form ionic
liquids that can expand the HOMO/LUMO gap, a property directly linked
to the ESW of the electrolyte. For that, we design a framework for
randomly generating new ions by combining their fragments. Within
this framework, we generate about 104 cations and 104 anions and fully optimize their structures using density
functional theory. Our calculations show that aromatic cations are
less stable ionic liquids than aliphatic ones, an expected result
if chemical rationale is used. More importantly, we can improve the
gap by adding electron-donating and electron-withdrawing functional
groups to the cations and anions, respectively. The increase can be
about 2 V, depending on the case. This improvement is reflected in
a wider ESW.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.