This Editorial is intended for materials scientists interested in performing machine learning-centered research. We cover broad guidelines and best practices regarding the obtaining and treatment of data, feature engineering, model training, validation, evaluation and comparison, popular repositories for materials data and benchmarking datasets, model and architecture sharing, and finally publication.In addition, we include interactive Jupyter notebooks with example Python code to demonstrate some of the concepts, workflows, and best practices discussed. Overall, the data-driven methods and machine learning workflows and considerations are presented in a simple way, allowing interested readers to more intelligently guide their machine learning research using the suggested references, best practices, and their own materials domain expertise. File list (2) download file view on ChemRxiv BestPractices_submitted.pdf (2.22 MiB) download file view on ChemRxiv BestPractices paper-SI.pdf (3.00 MiB)
Predicting crystal structure has always been a challenging problem for physical sciences. Recently, computational methods have been built to predict crystal structure with success but have been limited in scope and computational time. In this paper, we review computational methods such as density functional theory and machine learning methods used to predict crystal structure. We also explored the breadth versus accuracy of building a model to predict across any crystal structure using machine learning. We extracted 24 913 unique chemical formulas existing between 290 and 310 K from the Pearson Crystal Database. Of these 24 913 formulas, there exists 10 711 unique crystal structures referred to as entry prototypes. Common entries might have hundreds of chemical compositions, while the vast majority of entry prototypes is represented by fewer than ten unique compositions. To include all data in our predictions, entry prototypes that lacked a minimum number of representatives were relabeled as “Other”. By selecting the minimum numbers to be 150, 100, 70, 40, 20, and 10, we explored how limiting class sizes affected performance. Using each minimum number to reorganize the data, we looked at the classification performance metrics: accuracy, precision, and recall. Accuracy ranged from 97 ± 2 to 85 ± 2%; average precision ranged from 86 ± 2 to 79 ± 2%, while average recall ranged from 73 ± 2 to 54 ± 2% for minimum-class representatives from 150 to 10, respectively.
In this paper, we demonstrate an application of the Transformer self-attention mechanism in the context of materials science. Our network, the Compositionally Restricted Attention-Based network (), explores the area of structure-agnostic materials property predictions when only a chemical formula is provided. Our results show that ’s performance matches or exceeds current best-practice methods on nearly all of 28 total benchmark datasets. We also demonstrate how ’s architecture lends itself towards model interpretability by showing different visualization approaches that are made possible by its design. We feel confident that and its attention-based framework will be of keen interest to future materials informatics researchers.
New methods for describing materials as vectors in order to predict their properties using machine learning are common in the field of material informatics. However, little is known about the comparative efficacy of these methods. This work sets out to make clear which featurization methods should be used across various circumstances. Our findings include, surprisingly, that simple one-hot encoding of elements can be as effective as traditional and new descriptors when using large amounts of data. However, in the absence of large datasets or data that is not fully representative we show that domain knowledge offers advantages in predictive ability.
Many thermodynamic calculations and engineering applications require the temperature-dependent heat capacity (Cp) of a material to be known a priori. First-principle calculations of heat capacities can stand in place of experimental information, but these calculations are costly and expensive. Here, we report on our creation of a high-throughput supervised machine learning-based tool to predict temperature-dependent heat capacity. We demonstrate that material heat capacity can be correlated to a number of elemental and atomic properties. The machine learning method predicts heat capacity for thousands of compounds in seconds, suggesting facile implementation into integrated computational materials engineering (ICME) processes. In this context, we consider its use to replace Neumann-Kopp predictions as a high-throughput screening tool to help identify new materials as candidates for engineering processes. Also promising is the enhanced speed and performance compared to cation/anion contribution methods at elevated temperatures as well as the ability to improve future predictions as more data are made available. This machine learning method only requires formula inputs when calculating heat capacity and can be completely automated. This is an improvement to common best-practice methods such as cation/anion contributions or mixed-oxide approaches which are limited in application to specific materials and require case-by-case considerations.
Batteries are a critical component of modern society. The growing demand for new battery materials—coupled with a historically long materials development time—highlights the need for advances in battery materials development. Understanding battery systems has been frustratingly slow for the materials science community. In particular, the discovery of more abundant battery materials has been difficult. In this paper, we describe how machine learning tools can be exploited to predict the properties of battery materials. In particular, we report the challenges associated with a data-driven investigation of battery systems. Using a dataset of cathode materials and various statistical models, we predicted the specific discharge capacity at 25 cycles. We discuss the present limitations of this approach and propose a paradigm shift in the materials research process that would better allow data-driven approaches to excel in aiding the discovery of battery materials.
<div>In this paper, we evaluate an attention-based neural network architecture for the prediction of inorganic materials properties given access to nothing but each materials' chemical composition. We demonstrate that this novel application of self-attention for material property predictions strikingly outperforms both statistical and ensemble machine learning methods, as well as a fully-connected neural network.This Compositionally-Restricted Attention-Based network, referred to as CrabNet, is associated with improved test metrics across six of seven different tested materials properties from the AFLOW database. Moreover, we show that CrabNet outperforms other methods in the absence of chemical information, even when the statistical and ensemble learning techniques are given domain-specific chemical knowledge about the materials. Given its impressive improvement in predictive accuracy compared to previous methods, as well as its minimal hardware requirements for training and prediction, we feel confident that CrabNet, and the ideas explored within, will be central for future materials informatics research.</div>
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.