A very active area of materials research is to devise methods that use machine learning to automatically extract predictive models from existing materials data. While prior examples have demonstrated successful models for some applications, many more applications exist where machine learning can make a strong impact. To enable faster development of machine-learningbased models for such applications, we have created a framework capable of being applied to a broad range of materials data. Our method works by using a chemically diverse list of attributes, which we demonstrate are suitable for describing a wide variety of properties, and a novel method for partitioning the data set into groups of similar materials in order to boost the predictive accuracy. In this manuscript, we demonstrate how this new method can be used to predict diverse properties of crystalline and amorphous materials, such as band gap energy and glass-forming ability.
As materials data sets grow in size and scope, the role of data mining and statistical learning methods to analyze these materials data sets and build predictive models is becoming more important. This manuscript introduces matminer, an open-source, Python-based software platform to facilitate datadriven methods of analyzing and predicting materials properties. Matminer provides modules for retrieving large data sets from external databases such as the Materials Project, Citrination, Materials Data Facility, and Materials Platform for Data Science. It also provides implementations for an extensive library of feature extraction routines developed by the materials community, with 44 featurization classes that can generate thousands of individual descriptors and combine them into mathematical functions. Finally, matminer provides a visualization module for producing interactive, shareable plots. These functions are designed in a way that integrates closely with machine learning and data analysis packages already developed and in use by the Python data science community. We explain the structure and logic of matminer, provide a description of its various modules, and showcase several examples of how matminer can be used to collect data, reproduce data mining studies reported in the literature, and test new methodologies.
While high-throughput Density Functional Theory (DFT) has become a prevalent tool for materials discovery, it is limited by the relatively large computational cost. In this paper, we explore using DFT data from high-throughput calculations to create faster, surrogate models with machine learning (ML) that can be used to guide new searches. Our method works by using decision tree models to map DFT-calculated formation enthalpies to a set of attributes consisting of two distinct types: (i) composition-dependent attributes of elemental properties (as have been used in previous ML models of DFT formation energies), combined with (ii) attributes derived from the Voronoi tessellation of the compound's crystal structure. ML models created using this method have half the cross-validation error and similar training and evaluation speeds to models created with the Coulomb matrix and Pair Radial Distribution Function (PRDF) methods. For a dataset of 435,000 formation energies taken from the Open Quantum Materials Database (OQMD), our model achieves a mean absolute error (MAE) of 80 meV/atom in cross-validation, which is lower than the approximate error between DFTcomputed and experimentally-measured formation enthalpies and below 15% of the mean absolute deviation of the training set. We also demonstrate our method can accurately estimate the formation energy of materials outside of the training set and be used to identify materials with especially-large formation enthalpies. We propose that our models can be used to accelerate the discovery of new materials by identifying the most promising materials to study with DFT at little additional computational cost.
Conventional machine learning approaches for predicting material properties from elemental compositions have emphasized the importance of leveraging domain knowledge when designing model inputs. Here, we demonstrate that by using a deep learning approach, we can bypass such manual feature engineering requiring domain knowledge and achieve much better results, even with only a few thousand training samples. We present the design and implementation of a deep neural network model referred to as ElemNet; it automatically captures the physical and chemical interactions and similarities between different elements using artificial intelligence which allows it to predict the materials properties with better accuracy and speed. The speed and best-in-class accuracy of ElemNet enable us to perform a fast and robust screening for new material candidates in a huge combinatorial space; where we predict hundreds of thousands of chemical systems that could contain yet-undiscovered compounds.
Coupling artificial intelligence with high-throughput experimentation accelerates discovery of amorphous alloys.
Traditional machine learning (ML) metrics overestimate model performance for materials discovery. We introduce (1) leave-onecluster-out cross-validation (LOCO CV) and (2) a simple nearestneighbor benchmark to show that model performance in discovery applications strongly depends on the problem, data sampling, and extrapolation. Our results suggest that ML-guided iterative experimentation may outperform standard high-throughput screening for discovering breakthrough materials like high-T c superconductors with ML.Materials informatics (MI), or the application of data-driven algorithms to materials problems, has grown quickly as a field in recent years. 9 Across all of these applications, a training database of simulated or experimentally-measured materials properties serves as input to a ML algorithm that predictively maps features (i.e., materials descriptors) to target materials properties. Ideally, the result of training such models would be the experimental realization of new materials with promising properties. The MI community has produced several such success stories, including thermoelectric compounds, 10,11 shapememory alloys, 12 superalloys, 13 and 3d-printable high-strength aluminum alloys. 14 However, in many cases, a model is itself the output of a study, and the question becomes: to what extent could the model be used to drive materials discovery? Typically, the performance of ML models of materials properties is quantified via cross-validation (CV). CV can be performed either in a single division of the available data into a training set (to build the model) and a test set (to evaluate its performance), or as an ensemble process known as k-fold CV wherein the data are partitioned into k nonoverlapping subsets of nearly equal size (folds) and model performance is averaged across each combination of k-1 training folds and one test fold. Leave-one-out crossvalidation (LOOCV) is the limit where k is the number of total examples in the dataset. Table 1 summarizes some examples of model performance statistics as reported in the aforementioned studies (some studies involved testing multiple algorithms across multiple properties).In Table 1, the reported model performance is uniformly excellent across all studies. A tempting conclusion is that any of these models could be used for one-shot high-throughput screening of large numbers of materials for desired properties. However, as we discuss below, traditional CV has critical shortcomings in terms of quantifying ML model performance for materials discovery. Issues with traditional crossvalidation for materials discoveryMany ML benchmark problems consist of data classification into discrete bins, i.e., pattern matching. For example, the Design, System, ApplicationMachine learning (ML) has become a widely-adopted predictive tool for materials design and discovery. Random k-fold cross-validation (CV), the traditional gold-standard approach for evaluating the quality of ML models, is fundamentally mismatched to the nature of materials discovery, and leads to ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.