A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing

Shetty, Pranav; Rajan, Arunkumar Chitteth; Kuenneth, Christopher; Gupta, Sonkakshi; Panchumarti, Lakshmi Prerana; Holm, Lauren J.; Zhang, Chao; Ramprasad, Rampi

doi:10.1038/s41524-023-01003-w

Cited by 34 publications

(11 citation statements)

References 59 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Capturing experimental data from the scientific literature is generally nontrivial, requiring significant time and human effort. Thus, in order to significantly reduce the time required to curate a comprehensive Δ H expt ROP data set, a natural language processing (NLP) based information extraction (IE) technique to get Δ H expt ROP data from literature was employed, building on recent work . Starting from millions of HTML/XML formatted articles, the procedure then occurred in four steps, including (1) document parsing, converting original documents to a format that is suitable for NLP, (2) coarse-grained filtering, where appropriate keywords were used to downselect several to thousands of articles from the initial set, (3) extracting useful information from the downselected papers, and (4) validating the extracted data by domain experts.…”

Section: Methodsmentioning

confidence: 99%

Accelerated Scheme to Predict Ring-Opening Polymerization Enthalpy: Simulation-Experimental Data Fusion and Multitask Machine Learning

Toland,

Tran,

Chen

et al. 2023

J. Phys. Chem. A

Self Cite

View full text Add to dashboard Cite

Ring-opening enthalpy (Δ H ROP ) is a fundamental thermodynamic quantity controlling the polymerization and depolymerization of an important class of recyclable polymers, namely, those created from ring-opening polymerization (ROP). Highly accurate first-principles-based computational methods to compute Δ H ROP are computationally too demanding to efficiently guide the design of depolymerizable polymers. In this work, we develop a generalizable machine-learning model that was trained on experimental measurements and reliably computed simulation results of Δ H ROP (the latter provides a pathway to systematically increase the chemical diversity of the data). Predictions of Δ H ROP using this machine-learning model require essentially no time while the prediction accuracy is about ∼8 kJ/mol, approaching the well-known chemical accuracy. We hope that this effort will contribute to the future development of new depolymerizable polymers.

show abstract

Section: Methodsmentioning

confidence: 99%

Accelerated Scheme to Predict Ring-Opening Polymerization Enthalpy: Simulation-Experimental Data Fusion and Multitask Machine Learning

Toland,

Tran,

Chen

et al. 2023

J. Phys. Chem. A

Self Cite

View full text Add to dashboard Cite

show abstract

“…With new AI technologies, datasets can be automatically extracted from text and figures, even from complex structures such as metal organics frameworks, 82 catalysts, 83 and chemical reaction schemes. [84][85][86] The rapid rise of generative models can also be used to aggregate molecule data from public resources. 87 Although not yet widely used in the field, deep learning has significant potential to accelerate polymer design in drug delivery.…”

Section: Building Digital Infrastructure For Machine Actionable Datamentioning

confidence: 99%

Frontiers in nonviral delivery of small molecule and genetic drugs, driven by polymer chemistry and machine learning for materials informatics

Ting,

Tamayo-Mendoza,

Petersen

et al. 2023

Chem. Commun.

View full text Add to dashboard Cite

show abstract

“…The increase in the number of model parameters and different training strategies have improved the performance of these models on natural language tasks such as question answering, 2,3 text summarization, 4,5 sentiment analysis, 1,3 machine translation, 6 conversational abilities, [7][8][9] and code generation. 10 In the materials science domain, existing datasets are mainly related to tasks like named entity recognition (NER), 11,12 text classication, [13][14][15] synthesis process and relation classication, 16 and composition extraction from tables, 17 which are used by researchers to benchmark the performance of materials domain language models like MatS-ciBERT 14 (the rst materials-domain language model), Mat-BERT, 18 MaterialsBERT, 19 OpticalBERT, 20 and BatteryBERT. 15 Recently, Song et al (2023) reported better performance of materials science domain specic language models compared to BERT and SciBERT on seven materials domain datasets related to named entity recognition, relation classication, and text classication.…”

Section: Introductionmentioning

confidence: 99%

“…This information is further essential to understand the lacunae of the understanding of such LLMs, which are being proposed to be used for several domains such as manufacturing, planning, material synthesis, and materials discovery. 14,19 To this end, we collected questions that require students to have a undergraduate-level understanding of materials science topics to solve them. These questions and answers are carefully curated from the original questions in the graduate aptitude test in engineering (GATE) exam-a national-level examination for graduate admission in India.…”

Section: Introductionmentioning

confidence: 99%

MaScQA: investigating materials science knowledge of large language models

Zaki,

Jayadeva,

Mausam

et al. 2024

Digital Discovery

View full text Add to dashboard Cite

show abstract

A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing

Cited by 34 publications

References 59 publications

Accelerated Scheme to Predict Ring-Opening Polymerization Enthalpy: Simulation-Experimental Data Fusion and Multitask Machine Learning

Accelerated Scheme to Predict Ring-Opening Polymerization Enthalpy: Simulation-Experimental Data Fusion and Multitask Machine Learning

Frontiers in nonviral delivery of small molecule and genetic drugs, driven by polymer chemistry and machine learning for materials informatics

MaScQA: investigating materials science knowledge of large language models

Contact Info

Product

Resources

About