2023
DOI: 10.1038/s41524-023-01003-w
|View full text |Cite
|
Sign up to set email alerts
|

A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing

Abstract: The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from literature. We used natural language processing methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline, we trained MaterialsBERT, a language model, using 2.4 million materials science abstracts, which outperforms other baseline models in three out of five named entity recognition datasets. Using this pipeline, we ob… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
9
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
9

Relationship

1
8

Authors

Journals

citations
Cited by 34 publications
(11 citation statements)
references
References 59 publications
0
9
0
Order By: Relevance
“…Capturing experimental data from the scientific literature is generally nontrivial, requiring significant time and human effort. Thus, in order to significantly reduce the time required to curate a comprehensive Δ H expt ROP data set, a natural language processing (NLP) based information extraction (IE) technique to get Δ H expt ROP data from literature was employed, building on recent work . Starting from millions of HTML/XML formatted articles, the procedure then occurred in four steps, including (1) document parsing, converting original documents to a format that is suitable for NLP, (2) coarse-grained filtering, where appropriate keywords were used to downselect several to thousands of articles from the initial set, (3) extracting useful information from the downselected papers, and (4) validating the extracted data by domain experts.…”
Section: Methodsmentioning
confidence: 99%
“…Capturing experimental data from the scientific literature is generally nontrivial, requiring significant time and human effort. Thus, in order to significantly reduce the time required to curate a comprehensive Δ H expt ROP data set, a natural language processing (NLP) based information extraction (IE) technique to get Δ H expt ROP data from literature was employed, building on recent work . Starting from millions of HTML/XML formatted articles, the procedure then occurred in four steps, including (1) document parsing, converting original documents to a format that is suitable for NLP, (2) coarse-grained filtering, where appropriate keywords were used to downselect several to thousands of articles from the initial set, (3) extracting useful information from the downselected papers, and (4) validating the extracted data by domain experts.…”
Section: Methodsmentioning
confidence: 99%
“…With new AI technologies, datasets can be automatically extracted from text and figures, even from complex structures such as metal organics frameworks, 82 catalysts, 83 and chemical reaction schemes. [84][85][86] The rapid rise of generative models can also be used to aggregate molecule data from public resources. 87 Although not yet widely used in the field, deep learning has significant potential to accelerate polymer design in drug delivery.…”
Section: Building Digital Infrastructure For Machine Actionable Datamentioning
confidence: 99%
“…The increase in the number of model parameters and different training strategies have improved the performance of these models on natural language tasks such as question answering, 2,3 text summarization, 4,5 sentiment analysis, 1,3 machine translation, 6 conversational abilities, [7][8][9] and code generation. 10 In the materials science domain, existing datasets are mainly related to tasks like named entity recognition (NER), 11,12 text classication, [13][14][15] synthesis process and relation classication, 16 and composition extraction from tables, 17 which are used by researchers to benchmark the performance of materials domain language models like MatS-ciBERT 14 (the rst materials-domain language model), Mat-BERT, 18 MaterialsBERT, 19 OpticalBERT, 20 and BatteryBERT. 15 Recently, Song et al (2023) reported better performance of materials science domain specic language models compared to BERT and SciBERT on seven materials domain datasets related to named entity recognition, relation classication, and text classication.…”
Section: Introductionmentioning
confidence: 99%
“…This information is further essential to understand the lacunae of the understanding of such LLMs, which are being proposed to be used for several domains such as manufacturing, planning, material synthesis, and materials discovery. 14,19 To this end, we collected questions that require students to have a undergraduate-level understanding of materials science topics to solve them. These questions and answers are carefully curated from the original questions in the graduate aptitude test in engineering (GATE) exam-a national-level examination for graduate admission in India.…”
Section: Introductionmentioning
confidence: 99%