2019
DOI: 10.1186/s13321-019-0391-2
|View full text |Cite
|
Sign up to set email alerts
|

Dataset’s chemical diversity limits the generalizability of machine learning predictions

Abstract: The QM9 dataset has become the golden standard for Machine Learning (ML) predictions of various chemical properties. QM9 is based on the GDB, which is a combinatorial exploration of the chemical space. ML molecular predictions have been recently published with an accuracy on par with Density Functional Theory calculations. Such ML models need to be tested and generalized on real data. PC9, a new QM9 equivalent dataset (only H, C, N, O and F and up to 9 “heavy” atoms) of the PubChemQC project is presented in th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

18
121
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
1

Relationship

1
6

Authors

Journals

citations
Cited by 94 publications
(145 citation statements)
references
References 53 publications
(57 reference statements)
18
121
0
Order By: Relevance
“…Creating a chemical space that reflects the similarity in the physical properties as well as molecular structures will drastically improve the efficiency of molecular design. And a wide diversity of molecular property in the dataset is important to generalize prediction [2] . In this paper, we introduce a model to embed molecules into a latent space using deep learning to reproduce the distance between the properties of chemicals based on their molecular structures.…”
Section: Figurementioning
confidence: 99%
“…Creating a chemical space that reflects the similarity in the physical properties as well as molecular structures will drastically improve the efficiency of molecular design. And a wide diversity of molecular property in the dataset is important to generalize prediction [2] . In this paper, we introduce a model to embed molecules into a latent space using deep learning to reproduce the distance between the properties of chemicals based on their molecular structures.…”
Section: Figurementioning
confidence: 99%
“…In a previous article we studied the QM9 and PC9 datasets that together encompass more than 200k different molecular calculations with up to 9 heavy atoms of C, N, O and F types [ 31 ]. Initially computed with different methods, we relaunched them using our BOINC collaborative computing project, called QuChemPedIA@home, in order to have a homogeneous and clean dataset.…”
Section: Resultsmentioning
confidence: 99%
“…One could hope for a fast evaluation of quantum mechanics properties thanks to machine learning predictions to limit the cost of computation. However, we have demonstrated that the currently available datasets of molecular quantum chemistry results, like QM9 and PC9, are not diverse enough to train a general predictor [ 31 ]. It is clear that solving this issue in the future would significantly accelerate the generation of molecules with such objectives.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…including ANI-1, 14,15 PC-9, 69,70 and ISO-17, [71][72][73] and more are continuously added. Datasets are ingested into a common format of structured metadata and are available for download in text format, or in structured HDF5 format.…”
Section: Machine Learning Datasets Web Appmentioning
confidence: 99%