Experimental procedures for chemical synthesis are commonly reported in prose in patents or in the scientific literature. The extraction of the details necessary to reproduce and validate a synthesis in a chemical laboratory is often a tedious task requiring extensive human intervention. We present a method to convert unstructured experimental procedures written in English to structured synthetic steps (action sequences) reflecting all the operations needed to successfully conduct the corresponding chemical reactions. To achieve this, we design a set of synthesis actions with predefined properties and a deep-learning sequence to sequence model based on the transformer architecture to convert experimental procedures to action sequences. The model is pretrained on vast amounts of data generated automatically with a custom rule-based natural language processing approach and refined on manually annotated samples. Predictions on our test set result in a perfect (100%) match of the action sequence for 60.8% of sentences, a 90% match for 71.3% of sentences, and a 75% match for 82.4% of sentences.
<div> <div> <div> <p>Experimental procedures for chemical synthesis are commonly reported in prose in patents or in the scientific literature. The automatic extraction of the details necessary to reproduce and validate a synthesis in a chemical laboratory is quite often a tedious task, requiring extensive human intervention. We present a method to convert unstructured experimental procedures written in English to structured synthetic steps (action sequences) reflecting all the operations needed to successfully conduct the corresponding chemical reactions. To achieve this, we design a set of synthesis actions with predefined properties and a deep-learning sequence to sequence model based on the transformer architecture to convert experimental procedures to action sequences. The model is pretrained on vast amounts of data generated automatically with a custom rule-based natural language processing approach and refined on a smaller set of manually annotated samples. Predictions on our test set resulted in a perfect (100%) match of the action sequence for 60.8% of sentences, a 90% match for 71.3% of sentences, and a 75% match for 82.4% of sentences. </p> </div> </div> </div>
Existing deep learning models applied to reaction prediction in organic chemistry can reach high levels of accuracy (> 90% for Natural Language Processing-based ones).With no chemical knowledge embedded than the information learnt from reaction data, the quality of the data sets plays a crucial role in the performance of the prediction models. While human curation is prohibitively expensive, the need for unaided approaches to remove chemically incorrect entries from existing data sets is essential to improve artificial intelligence models' performance in synthetic chemistry tasks. Here we propose a machine learning-based, unassisted approach to remove chemically wrong entries from chemical reaction collections. We applied this method to the collection of chemical reactions Pistachio and to an open data set, both extracted from USPTO (United States Patent Office) patents. Our results show an improved prediction quality for models trained on the cleaned and balanced data sets. For the retrosynthetic models, the round-trip accuracy metric grows by 13 percentage points and the value of the cumulative Jensen Shannon divergence decreases by 30% compared to its original record. The coverage remains high with 97%, and the value of the class-diversity is not affected by the cleaning. The proposed strategy is the first unassisted rule-free technique to address automatic noise reduction in chemical data sets.
The experimental execution of chemical reactions is a context-dependent and time-consuming process, often solved using the experience collected over multiple decades of laboratory work or searching similar, already executed, experimental protocols. Although data-driven schemes, such as retrosynthetic models, are becoming established technologies in synthetic organic chemistry, the conversion of proposed synthetic routes to experimental procedures remains a burden on the shoulder of domain experts. In this work, we present data-driven models for predicting the entire sequence of synthesis steps starting from a textual representation of a chemical equation, for application in batch organic chemistry. We generated a data set of 693,517 chemical equations and associated action sequences by extracting and processing experimental procedure text from patents, using state-of-the-art natural language models. We used the attained data set to train three different models: a nearest-neighbor model based on recently-introduced reaction fingerprints, and two deep-learning sequence-to-sequence models based on the Transformer and BART architectures. An analysis by a trained chemist revealed that the predicted action sequences are adequate for execution without human intervention in more than 50% of the cases.
The experimental execution of chemical reactions is a context-dependent and time-consuming process, often solved using the experience collected over multiple decades of laboratory work or searching similar, already executed, experimental protocols. Although data-driven schemes, such as retrosynthetic models, are becoming established technologies in synthetic organic chemistry, the conversion of proposed synthetic routes to experimental procedures remains a burden on the shoulder of domain experts. In this work, we present, for the first time, data-driven models for predicting the entire sequence of synthesis steps starting from a textual representation of a chemical equation. We generated a data set of 693,517 chemical equations and associated action sequences by extracting and processing experimental procedure text from patents, using state-of-the-art natural language models. We used the attained data set to train three different models: a nearest-neighbor model based on recently-introduced reaction fingerprints, and two deep-learning sequence-to-sequence models based on the Transformer and BART architectures. When evaluated on the ground truth data, the best performing model (transformer) achieves an accuracy of 72.7% for single action predictions, and a 100% match of the full action sequence for 3.6% of experimental procedures. An analysis by a trained chemist revealed that the predicted action sequences are adequate for execution without human intervention in more than 50% of the cases.<br>
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.