Accelerating Material Design with the Generative Toolkit for Scientific Discovery

Manica, Matteo; Cadow, Joris; Christofidellis, Dimitrios; Dave, Ashish; Born, Jannis; Clarke, Dean; Teukam, Yves Gaëtan Nana; Hoffman, Samuel C.; Buchan, Matthew J.; Chenthamarakshan, Vijil; Donovan, Timothy P.; Hsu, Hsiang Han; Zipoli, Federico; Schilter, Oliver; Giannone, Giorgio; Kishimoto, Akihiro; Hamada, Lisa; Padhi, Inkit; Wehden, Karl; McHugh, Lauren; Khrabrov, Alexy; Das, Payel; Takeda, Seiji; Smith, J. R.

doi:10.48550/arxiv.2207.03928

Cited by 4 publications

(4 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, in the Generative Toolkit for Scientific Discovery (GT4SD), we provide an example on leveraging the affinity predictor as a reward function in a protein-driven molecular generative model: .…”

Section: Data and Software Availabilitymentioning

confidence: 99%

On the Choice of Active Site Sequences for Kinase-Ligand Affinity Prediction

Born

Shoshan

Huynh

et al. 2022

J. Chem. Inf. Model.

Self Cite

View full text Add to dashboard Cite

Recent work showed that active site rather than full-protein-sequence information improves predictive performance in kinase-ligand binding affinity prediction. To refine the notion of an “active site”, we here propose and compare multiple definitions. We report significant evidence that our novel definition is superior to previous definitions and better models of ATP-noncompetitive inhibitors. Moreover, we leverage the discontiguity of the active site sequence to motivate novel protein-sequence augmentation strategies and find that combining them further improves performance.

show abstract

Section: Data and Software Availabilitymentioning

confidence: 99%

On the Choice of Active Site Sequences for Kinase-Ligand Affinity Prediction

Born

Shoshan

Huynh

et al. 2022

J. Chem. Inf. Model.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Additional training epochs did not improve the performance (see Appendix A: Table A7). To train these models, we relied on the Generative Toolkit for Scientific Discovery (GT4SD) library 24 and its LM trainer.…”

Section: Resultsmentioning

confidence: 99%

Automated patent classification for crop protection via domain adaptation

et al. 2023

View full text Add to dashboard Cite

Patents show how technology evolves in most scientific fields over time. The best way to use this valuable knowledge base is to use efficient and effective information retrieval and searches for related prior art. Patent classification, that is, assigning a patent to one or more predefined categories, is a fundamental step towards synthesizing the information content of an invention. To this end, architectures based on Transformers, especially those derived from the BERT family have already been proposed in the literature and they have shown remarkable results by setting a new state-of-the-art performance for the classification task. Here, we study how domain adaptation can push the performance boundaries in patent classification by rigorously evaluating and implementing a collection of recent transfer learning techniques, for example, domain-adaptive pretraining and adapters. Our analysis shows how leveraging these advancements enables the development of state-of-the-art models with increased precision, recall, and F1-score. We base our evaluation on both standard patent classification datasets derived from patent offices-defined code hierarchies and more practical real-world use-case scenarios containing labels from the agrochemical industrial domain. The application of these domain adapted techniques to patent classification in a multilingual setting is also examined and evaluated.

show abstract

“…We evaluate the model's performance on five tasks: forward and backward reaction prediction in chemistry, textconditional de novo molecule generation and molecule captioning across domains, and paragraph-to-action in the language domain. The training process is carried out using the language modeling trainer based on Hugging Face transformers (Wolf et al, 2020) and PyTorch Lightning (Falcon and The PyTorch Lightning team, 2019) from the GT4SD library (Manica et al, 2022). To initialize our transformer model, we choose to use the natural language domain, as it has the most available data.…”

Section: Methodsmentioning

confidence: 99%

Unifying Molecular and Textual Representations via Multi-task Language Modelling

Christofidellis¹,

Giannone²,

Born³

et al. 2023

Preprint

View full text Add to dashboard Cite

The recent advances in neural language models have also been successfully applied to the field of chemistry, offering generative solutions for classical problems in molecular design and synthesis planning. These new methods have the potential to optimize laboratory operations and fuel a new era of data-driven automation in scientific discovery. However, specialized models are still typically required for each task, leading to the need for problem-specific fine-tuning and neglecting task interrelations. The main obstacle in this field is the lack of a unified representation between natural language and chemical representations, complicating and limiting human-machine interaction.Here, we propose a multi-domain, multi-task language model to solve a wide range of tasks in both the chemical and natural language domains. By leveraging multi-task learning, our model can handle chemical and natural language concurrently, without requiring expensive pre-training on single domains or task-specific models. Interestingly, sharing weights across domains remarkably improves our model when benchmarked against state-of-the-art baselines on single-domain and cross-domain tasks. In particular, sharing information across domains and tasks gives rise to large improvements in cross-domain tasks, the magnitude of which increase with scale, as measured by more than a dozen of relevant metrics. Our work suggests that such models can robustly and efficiently accelerate discovery in physical sciences by superseding problem-specific fine-tuning and enhancing human-model interactions.

show abstract

Accelerating Material Design with the Generative Toolkit for Scientific Discovery

Cited by 4 publications

References 19 publications

On the Choice of Active Site Sequences for Kinase-Ligand Affinity Prediction

On the Choice of Active Site Sequences for Kinase-Ligand Affinity Prediction

Automated patent classification for crop protection via domain adaptation

Unifying Molecular and Textual Representations via Multi-task Language Modelling

Contact Info

Product

Resources

About