2020
DOI: 10.26434/chemrxiv.11879193
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Is Domain Knowledge Necessary for Machine Learning Materials Properties?

Abstract: <div>New methods for describing materials as vectors in order to predict their properties using machine learning are common in the field of material informatics. However, little is known about the comparative efficacy of these methods. This work sets out to make clear which featurization methods should be used across various circumstances. Our findings include, surprisingly, that simple one-hot encoding of elements can be as effective as traditional and new descriptors when using large amounts of data. H… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
4
1

Relationship

2
3

Authors

Journals

citations
Cited by 6 publications
(8 citation statements)
references
References 0 publications
0
8
0
Order By: Relevance
“…Band gap, formation energy, shear modulus, bulk modulus, Debye temperature, thermal expansion, and thermal conductivity data were then collected from the ICSD catalogue of the AFLOW database [16]. Duplicate entries were removed and each material property's formulae and ground-truth values were randomly partitioned into training, validation, and test sets (the full code is available in the GitHub repository [22]. Note, for this work, the associated Crystal Information Files (CIF) were discarded.…”
Section: Data Acquisitionmentioning
confidence: 99%
See 1 more Smart Citation
“…Band gap, formation energy, shear modulus, bulk modulus, Debye temperature, thermal expansion, and thermal conductivity data were then collected from the ICSD catalogue of the AFLOW database [16]. Duplicate entries were removed and each material property's formulae and ground-truth values were randomly partitioned into training, validation, and test sets (the full code is available in the GitHub repository [22]. Note, for this work, the associated Crystal Information Files (CIF) were discarded.…”
Section: Data Acquisitionmentioning
confidence: 99%
“…The model was then tested on these withheld formulae. The code for these methods is also available on GitHub [22].…”
Section: Model Trainingmentioning
confidence: 99%
“…The preprocessing step of featurizing data is crucial for successful implementation of machine learning algorithms. Improper featurization of data can impact prediction and classification errors [30].…”
Section: The Featurization and Curation Of Am Datamentioning
confidence: 99%
“…In general, a good ML project should do one or more of the following: screen or downselect candidate materials from a pool of known compounds for a given application or property, [1][2][3] acquire and process data to gain new insights, 4,5 conceptualize new modeling ap-proaches, [6][7][8][9][10] or explore ML in materials-specific applications. 1,[11][12][13] Consider these points when you judge the applicability of ML for your project.…”
Section: Meaningful Machine Learningmentioning
confidence: 99%
“…For sufficiently large datasets and for more "capable" learning architectures like very deep, fully-connected networks 7,122 or novel attention-based architectures such as CrabNet, 6 feature engineering and the integration of domain knowledge (such as through the use of CBFVs) in the input data becomes irrelevant and does not contribute to a better model performance compared to a simple onehot-encoding. 11 Therefore, due to the effort required to curate and evaluate domain knowledge-informed features specific to your research, you may find it more beneficial to seek out additional sources of data, already-established featurization schemes, or use learning methods that don't require domain-derived features 6 instead.…”
Section: Choosing Appropriate Models and Features*mentioning
confidence: 99%