SolvBERT for solvation free energy and solubility prediction: a demonstration of an NLP model for predicting the properties of molecular complexes

Yu, Jiahui; Zhang, Chengwei; Cheng, Yingying; Yang, Yun‐Fang; She, Yuan-Bin; Liu, Fengfan; Su, Weike; Su, An

doi:10.1039/d2dd00107a

Cited by 17 publications

(26 citation statements)

References 66 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since the BERT models have a self-contained unsupervised pre-training stage, 63 it is pre-trained by clustering the molecular structures which will not be affected by the distribution of property data. 8,42 In contrast, as the pre-training phase of the D-MPNN models is supervised, the significant gap between the property distribution of pre-training data and fine-tuning data may have a negative impact on the prediction accuracy for D-MPNN-based models, as we observed that the D-MPNN model without pre-training showed better performance in LUMO than the fully pre-trained PorphyDMPNN. On the other hand, a supervised pre-training may benefit more when the property data in the pre-training set has a similar distribution to the fine-tuning set.…”

Section: Performance Of Porphybert and Porphydmpnnmentioning

confidence: 69%

“…This benefit of shared pretraining was also observed in one of our previous studies for the multitask prediction of solubility and solvation free energy. 42 In addition, the unsupervised pre-training of PorphyBERT would benefit from future expansion of the pre-training database. Researchers can add more MpP structures to the pretraining database regardless of the design purpose of the MpP and the availability of property data, as more data typically enhance the ability of BERT-based models in clustering molecular structures.…”

Section: Discussionmentioning

confidence: 99%

“…63 It was first adapted by Schwaller et al as the rxnfp framework () for the classification of chemical reactions 8 and prediction of chemical reaction yields. 64 Our group further refined the model for multitask property prediction of molecular complexes 42 and E_gap prediction of porphyrin dyes. 33 In this study, we developed PorphyBERT by first pre-training the BERT-based model using the MpP structures of PBDD in an unsupervised learning manner ( i.e.…”

Section: Methodsmentioning

confidence: 99%

“…66 Unlike many other dimensionality reduction algorithms that present data as independent points, TMAP displays a unique tree-like layout that is created by a series of algorithms including locality-sensitive hashing (LSH) indexing, k -nearest-neighbor (kNN) graph generation, and minimum spanning tree (MST) calculation 8 followed by the Faerun visualization. 67 Because its tree layout helps to show the connections between similar molecules and between similar branches of molecular families, TMAP has been applied to display the chemical space of molecules, 66 molecular complexes, 42 and chemical reactions. 8 In this work, we input the SMILES of PBDD and MpPD to the BERT model, obtained vector representations of molecules from the unsupervised training phase of BERT, and then visualized them by TMAP to show the differences in the chemical spaces of the two databases.…”

Section: Methodsmentioning

confidence: 99%

“…A recent study by our group found that an attention-based neural network model with built-in transfer learning can be used for multitask prediction of solvation properties. 42 In that study, each solvation was represented by the SMILES combination of solute and solvent, and we believe that the model can also be used to predict the properties of MpPs, a group of complexed compounds with coordinated porphyrins and transition metals. 22 In this study, we proposed a deep transfer learning approach to predict the HOMO/LUMO energy levels and energy gaps, and applied the model to guide the screening of MpP photocatalysts.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations