We estimate a statistical model to predict the superconducting critical temperature based on the features extracted from the superconductor's chemical formula. The statistical model gives reasonable out-of-sample predictions: ±9.5 K based on root-mean-squared-error. Features extracted based on thermal conductivity, atomic radius, valence, electron affinity, and atomic mass contribute the most to the model's predictive accuracy. It is crucial to note that our model does not predict whether a material is a superconductor or not; it only gives predictions for superconductors.Out-Of-Sample RMSE Estimation Procedure:1. At random, divide the data into 2/3 train data and 1/3 test data.2. Fit the model using the train data.3. Predict T c of the test data.4. Obtain an estimate of the out-of-sample mean-squared-error (mse) by using the predictions from the last step and the observed T c values in the test data:out-of-sample mse = Average of (observed -predicted) 2 5. Repeat steps 1 through 4, 25 times to collect 25 out-of-sample mse's.6. Take the mean of the 25 collected out-of-sample mse's and report the square root of this average as the final estimate of the out-of-sample rmse.
The Multiple Regression ModelThe multiple regression model's out-of-sample rmse estimated by the procedure above is about 17.6 K. The out-of-sample R 2 is about 0.74. Figure (8) shows the predicted T c versus the observed T c when we use all the data to fit the model. The line has an intercept of zero and a slope of 1. The plot indicates that the multiple regression model under-predicts T c of high temperature superconductors since many predicted points are below the line for the high temperature superconductors. The model over-predicts low temperature superconductors' T c . The multiple regression model simply serves as a benchmark model and should not be used for prediction. There would be no use in predicting T c using a sophisticated model such as XGBoost, if a commonly used multiple regression model does a good job. Here, the XGBoost model vastly improves the prediction accuracy.
The XGBoost ModelBefore we go on, we give a brief description of XGBoost set up. XGBoost is described in detail in Chen and Guestrin (2016). A readable summary is given at https://xgboost.readthedocs. io/en/latest/model.html. Hastie et al. (2009) and Izenman (2008) give general overviews on boosting as well. The functional form of XGBoost is:ŷ i = K k=1 f k (x i ),
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.