A Systematic Survey of Molecular Pre-trained Models

Xia, Jun; Zhu, Yanqiao; Du, Yuanqi; Li, Stan Z.

doi:10.48550/arxiv.2210.16484

Cited by 4 publications

(6 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Considering that 3D geometric information plays a vital role in predicting molecular properties, several recent works Stärk et al, 2021;Fang et al, 2022a;Zhu et al, 2022) pre-train the GNN encoders on molecular datasets with 3D geometric information. We recommend readers refer to a recent survey (Xia et al, 2022f) for more relevant literature. Many above-mentioned works adopt AttrMask (Hu et al, 2020) as a fundamental pre-training sub-task.…”

Section: Pre-training On Moleculesmentioning

confidence: 99%

Mole-BERT: Rethinking Pre-training Graph Neural Networks for Molecules

Xia

Zhao

Hu³

et al. 2023

Preprint

Self Cite

View full text Add to dashboard Cite

Recent years have witnessed the prosperity of pre-training graph neural networks (GNNs) for molecules. Typically, atom types as node attributes are randomly masked and GNNs are then trained to predict masked types as in AttrMask \citep{hu2020strategies}, following the Masked Language Modeling (MLM) task of BERT~\citep{devlin2019bert}. However, unlike MLM where the vocabulary is large, the AttrMask pre-training does not learn informative molecular representations due to small and unbalanced atom `vocabulary'. To amend this problem, we propose a variant of VQ-VAE~\citep{van2017neural} as a context-aware tokenizer to encode atom attributes into chemically meaningful discrete codes. This can enlarge the atom vocabulary size and mitigate the quantitative divergence between dominant (e.g., carbons) and rare atoms (e.g., phosphorus). With the enlarged atom `vocabulary', we propose a novel node-level pre-training task, dubbed Masked Atoms Modeling (MAM), to mask some discrete codes randomly and then pre-train GNNs to predict them. MAM also mitigates another issue of AttrMask, namely the negative transfer. It can be easily combined with various pre-training tasks to improve their performance. Furthermore, we propose triplet masked contrastive learning (TMCL) for graph-level pre-training to model the heterogeneous semantic similarity between molecules for effective molecule retrieval. MAM and TMCL constitute a novel pre-training framework, Mole-BERT, which can match or outperform state-of-the-art methods in a fully data-driven manner. We release the code at \textcolor{magenta}{\url{https://github.com/junxia97/Mole-BERT}}.

show abstract

Section: Pre-training On Moleculesmentioning

confidence: 99%

Mole-BERT: Rethinking Pre-training Graph Neural Networks for Molecules

Xia

Zhao

Hu³

et al. 2023

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Deep learning has been successful in many domain including computer vision (Zhou et al 2020;Wang et al 2022aWang et al , 2023, time series analysis (Xie et al 2022;Meng Liu 2021Liu et al 2022a), bioinformatics (Xia et al 2022b;Gao et al 2022;, and graph data mining (Wang et al 2020(Wang et al , 2021bZeng et al 2022Zeng et al , 2023Wu et al 2022;Duan et al 2022;Yang et al 2022b;Liang et al 2022b). Among these directions, deep graph clustering, which aims to encode nodes with neural networks and divide them into disjoint clusters, has attracted great attention in recent years.…”

Section: Deep Graph Clusteringmentioning

confidence: 99%

Hard Sample Aware Network for Contrastive Deep Graph Clustering

Liu

Yang

Zhou

et al. 2023

AAAI

View full text Add to dashboard Cite

Contrastive deep graph clustering, which aims to divide nodes into disjoint groups via contrastive mechanisms, is a challenging research spot. Among the recent works, hard sample mining-based algorithms have achieved great attention for their promising performance. However, we find that the existing hard sample mining methods have two problems as follows. 1) In the hardness measurement, the important structural information is overlooked for similarity calculation, degrading the representativeness of the selected hard negative samples. 2) Previous works merely focus on the hard negative sample pairs while neglecting the hard positive sample pairs. Nevertheless, samples within the same cluster but with low similarity should also be carefully learned. To solve the problems, we propose a novel contrastive deep graph clustering method dubbed Hard Sample Aware Network (HSAN) by introducing a comprehensive similarity measure criterion and a general dynamic sample weighing strategy. Concretely, in our algorithm, the similarities between samples are calculated by considering both the attribute embeddings and the structure embeddings, better revealing sample relationships and assisting hardness measurement. Moreover, under the guidance of the carefully collected high-confidence clustering information, our proposed weight modulating function will first recognize the positive and negative samples and then dynamically up-weight the hard sample pairs while down-weighting the easy ones. In this way, our method can mine not only the hard negative samples but also the hard positive sample, thus improving the discriminative capability of the samples further. Extensive experiments and analyses demonstrate the superiority and effectiveness of our proposed method. The source code of HSAN is shared at https://github.com/yueliu1999/HSAN and a collection (papers, codes and, datasets) of deep graph clustering is shared at https://github.com/yueliu1999/Awesome-Deep-Graph-Clustering on Github.

show abstract

“…The pretraining dataset primarily consists of unlabeled molecular data from extensive public databases like ChEMBL, PubChem, and ZINC . Popular pretraining strategies fall under self-supervised learning (SSL), including masked component modeling, − context prediction, replaced component detection, and contrastive learning. , SSL methods start with the molecular structure itself to reveal inherent patterns. Given the close link between molecular structures and physicochemical properties, these methods play a crucial role in predicting molecular properties. , They are frequently employed to establish a versatile pretraining model for various downstream tasks.…”

Section: Introductionmentioning

confidence: 99%

Enhancing Molecular Property Prediction through Task-Oriented Transfer Learning: Integrating Universal Structural Insights and Domain-Specific Knowledge

Duan,

Yang,

Zeng

et al. 2024

J. Med. Chem.

View full text Add to dashboard Cite

Precisely predicting molecular properties is crucial in drug discovery, but the scarcity of labeled data poses a challenge for applying deep learning methods. While large-scale self-supervised pretraining has proven an effective solution, it often neglects domain-specific knowledge. To tackle this issue, we introduce Task-Oriented Multilevel Learning based on BERT (TOML-BERT), a dual-level pretraining framework that considers both structural patterns and domain knowledge of molecules. TOML-BERT achieved state-of-the-art prediction performance on 10 pharmaceutical datasets. It has the capability to mine contextual information within molecular structures and extract domain knowledge from massive pseudo-labeled data. The dual-level pretraining accomplished significant positive transfer, with its two components making complementary contributions. Interpretive analysis elucidated that the effectiveness of the dual-level pretraining lies in the prior learning of a task-related molecular representation. Overall, TOML-BERT demonstrates the potential of combining multiple pretraining tasks to extract task-oriented knowledge, advancing molecular property prediction in drug discovery.

show abstract

A Systematic Survey of Molecular Pre-trained Models

Cited by 4 publications

References 40 publications

Mole-BERT: Rethinking Pre-training Graph Neural Networks for Molecules

Mole-BERT: Rethinking Pre-training Graph Neural Networks for Molecules

Hard Sample Aware Network for Contrastive Deep Graph Clustering

Enhancing Molecular Property Prediction through Task-Oriented Transfer Learning: Integrating Universal Structural Insights and Domain-Specific Knowledge

Contact Info

Product

Resources

About