Machine learning (ML) algorithms were explored for the classification of the UV–Vis absorption spectrum of organic molecules based on molecular descriptors and fingerprints generated from 2D chemical structures. Training and test data (~ 75 k molecules and associated UV–Vis data) were assembled from a database with lists of experimental absorption maxima. They were labeled with positive class (related to photoreactive potential) if an absorption maximum is reported in the range between 290 and 700 nm (UV/Vis) with molar extinction coefficient (MEC) above 1000 Lmol−1 cm−1, and as negative if no such a peak is in the list. Random forests were selected among several algorithms. The models were validated with two external test sets comprising 998 organic molecules, obtaining a global accuracy up to 0.89, sensitivity of 0.90 and specificity of 0.88. The ML output (UV–Vis spectrum class) was explored as a predictor of the 3T3 NRU phototoxicity in vitro assay for a set of 43 molecules. Comparable results were observed with the classification directly based on experimental UV–Vis data in the same format.
In this study, machine learning algorithms were investigated for the classification of organic molecules with one carbon chiral center according to the sign of optical rotation. Diverse heterogeneous data sets comprising up to 13,080 compounds and their corresponding optical rotation were retrieved from Reaxys and processed independently for three solvents: dichloromethane, chloroform, and methanol. The molecular structures were represented by chiral descriptors based on the physicochemical and topological properties of ligands attached to the chiral center. The sign of optical rotation was predicted by random forests (RF) and artificial neural networks for independent test sets with an accuracy of up to 75% for dichloromethane, 82% for chloroform, and 82% for methanol. RF probabilities and the availability of structures in the training set with the same spheres of atom types around the chiral center defined applicability domains in which the accuracy is higher.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.