Machine learning algorithms were explored for the fast estimation of HOMO and LUMO orbital energies calculated by DFT B3LYP, on the basis of molecular descriptors exclusively based on connectivity. The whole project involved the retrieval and generation of molecular structures, quantum chemical calculations for a database with >111 000 structures, development of new molecular descriptors, and training/validation of machine learning models. Several machine learning algorithms were screened, and an applicability domain was defined based on Euclidean distances to the training set. Random forest models predicted an external test set of 9989 compounds achieving mean absolute error (MAE) up to 0.15 and 0.16 eV for the HOMO and LUMO orbitals, respectively. The impact of the quantum chemical calculation protocol was assessed with a subset of compounds. Inclusion of the orbital energy calculated by PM7 as an additional descriptor significantly improved the quality of estimations (reducing the MAE in >30%).
A highly discriminating topological index, EAID, is generated in our laboratory. A systematic search for degeneracy was performed on a total of over 14 million structures, and no duplicate occurred. These structures are as follows: over 3.8 million alkane trees with 1-22 carbon atoms; over 0.38 million structures containing heteroatoms; over 4 million benzenoids with 1-13 benzene rings; and over 5.9 million compounds from three reality databases. However, in a search of over 20 million alkane trees with 23 and 24 carbon atoms, five and 13 duplicates occurred, respectively, and for over 20 million compounds from the ZINC database, 10 duplicates occurred. To increase the discriminating power of the index, EAID has been extended, and the resulting index is termed 2-EAID. All of the over 55 million structures mentioned above were uniquely identified by 2-EAID except for two duplicates that occurred for the ZINC database. EAID and 2-EAID are the most highly discriminating indices examined to date. Thus, the two indices possess not only theoretical significance but also potential applications. For example, they could possibly be used as a supplementary reference for CAS Registry Numbers for structure documentation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.