In
the field of polymer informatics, utilizing machine learning
(ML) techniques to evaluate the glass transition temperature T
g and other properties of polymers has attracted
extensive attention. This data-centric approach is much more efficient
and practical than the laborious experimental measurements when encountered
a daunting number of polymer structures. Various ML models are demonstrated
to perform well for T
g prediction. Nevertheless,
they are trained on different data sets, using different structure
representations, and based on different feature engineering methods.
Thus, the critical question arises on selecting a proper ML model
to better handle the T
g prediction with
generalization ability. To provide a fair comparison of different
ML techniques and examine the key factors that affect the model performance,
we carry out a systematic benchmark study by compiling 79 different
ML models and training them on a large and diverse data set. The three
major components in setting up an ML model are structure representations,
feature representations, and ML algorithms. In terms of polymer structure
representation, we consider the polymer monomer, repeat unit, and
oligomer with longer chain structure. Based on that feature, representation
is calculated, including Morgan fingerprinting with or without substructure
frequency, RDKit descriptors, molecular embedding, molecular graph,
etc. Afterward, the obtained feature input is trained using different
ML algorithms, such as deep neural networks, convolutional neural
networks, random forest, support vector machine, LASSO regression,
and Gaussian process regression. We evaluate the performance of these
ML models using a holdout test set and an extra unlabeled data set
from high-throughput molecular dynamics simulation. The ML model’s
generalization ability on an unlabeled data set is especially focused,
and the model’s sensitivity to topology and the molecular weight
of polymers is also taken into consideration. This benchmark study
provides not only a guideline for the T
g prediction task but also a useful reference for other polymer informatics
tasks.