Unifying Molecular and Textual Representations via Multi-task Language Modelling

Christofidellis, Dimitrios; Giannone, Giorgio; Born, Jannis; Winther, Ole; Laino, Teodoro; Manica, Matteo

doi:10.48550/arxiv.2301.12586

Cited by 2 publications

(2 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CLAMP [29] introduced a fusion approach, combining a molecule encoder and a text encoder for property prediction tasks. Christofidellis et al [30] presented a unified model capable of handling various text-to-text, text-tomolecule, molecule-to-text, and molecule-to-molecule tasks. MolReGPT [31] implemented tasks such as molecule captioning and text-based molecule generation by assigning ChatGPT a role as a biochemist, facilitating in-context learning.…”

Section: Data Generation Frameworkmentioning

confidence: 99%

Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model

Zeng,

Zhou,

Wang

et al. 2024

Preprint

View full text Add to dashboard Cite

While various models and computational tools have been proposed for structure and property analysis of molecules, generating molecules that conform to all desired structures and properties remains a challenge. Here, we introduce a multi-constraint molecular generation large language model, TSMMG, which, akin to a student, incorporates knowledge from various small models and tools, namely, the 'teachers'. To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers', enabling it to generate novel molecules that conform to the descriptions through various text prompts. We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements across two-, three-, and four-constraint tasks, with an average molecular validity of over 99% and success ratio of 88.08%, 65.27%, and 61.44%, respectively. The model also exhibits adaptability through zero-shot testing, creating molecules that satisfy combinations of properties that have not been encountered. It can comprehend text inputs with various language styles, extending beyond the confines of outlined prompts, as confirmed through empirical validation. Additionally, the knowledge distillation feature of TSMMG contributes to the continuous enhancement of small models, while the innovative approach to dataset construction effectively addresses the issues of data scarcity and quality, which positions TSMMG as a promising tool in the domains of drug discovery and materials science.

show abstract

Section: Data Generation Frameworkmentioning

confidence: 99%

Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model

Zeng,

Zhou,

Wang

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…To enable higher-level control over molecular design, multi-modal models (Edwards et al, 2021;Vall et al, 2021;Zeng et al, 2022;Xu and Wang, 2022;Su et al, 2022;Seidl et al, 2023;Xu et al, 2023;Zhao et al, 2023;Liu et al, 2023b) have been proposed. Existing work focuses on cross-modal retrieval (Edwards et al, 2021;Zeng et al, 2022), translation (Edwards et al, 2022;Liu et al, 2023c;Christofidellis et al, 2023), and editing (Liu et al, 2022).…”

Section: B1 Multi-modal Models For Chemistrymentioning

confidence: 99%

Monte Carlo Thought Search: Large Language Model Querying for Complex Scientific Reasoning in Catalyst Design

Sprueill,

Edwards,

Olarte

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Discovering novel catalysts requires complex reasoning involving multiple chemical properties and resultant trade-offs, leading to a combinatorial growth in the search space. While large language models (LLM) have demonstrated novel capabilities for chemistry through complex instruction following capabilities and high quality reasoning, a goal-driven combinatorial search using LLMs has not been explored in detail. In this work, we present a Monte Carlo Tree Search-based approach that improves beyond state-of-the-art chain-of-thought prompting variants to augment scientific reasoning. We introduce two new reasoning datasets: 1) a curation of computational chemistry simulations, and 2) diverse questions written by catalysis researchers for reasoning about novel chemical conversion processes. We improve over the best baseline by 25.8% and find that our approach can augment scientist's reasoning and discovery process with novel insights. 1

show abstract

Unifying Molecular and Textual Representations via Multi-task Language Modelling

Cited by 2 publications

References 31 publications

Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model

Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model

Monte Carlo Thought Search: Large Language Model Querying for Complex Scientific Reasoning in Catalyst Design

Contact Info

Product

Resources

About