2019
DOI: 10.1021/acsomega.9b02060
|View full text |Cite
|
Sign up to set email alerts
|

Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding

Abstract: In recent years, data-driven methods and artificial intelligence have been widely used in chemoinformatic and material informatics domains, for which the success is critically determined by the availability of training data with good quality and large quantity. A potential approach to break this bottleneck is by leveraging the chemical literature such as papers and patents as alternative data resources to high throughput experiments and simulation. Compared to other domains where natural language processing te… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
10
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
7

Relationship

2
5

Authors

Journals

citations
Cited by 9 publications
(10 citation statements)
references
References 56 publications
0
10
0
Order By: Relevance
“…text mining tools have been developed to harvest information from materials literature following the general overflow of acquiring text content, recognizing entities of interest, collecting and storing the entity information and performing post analysis and modeling (Fig. 2a) [81][82][83] . The usage of these tools generates libraries of information to explore, which forms the foundation for the designing and performing next phase research.…”
Section: Collect Unstructured Data From Literaturementioning
confidence: 99%
“…text mining tools have been developed to harvest information from materials literature following the general overflow of acquiring text content, recognizing entities of interest, collecting and storing the entity information and performing post analysis and modeling (Fig. 2a) [81][82][83] . The usage of these tools generates libraries of information to explore, which forms the foundation for the designing and performing next phase research.…”
Section: Collect Unstructured Data From Literaturementioning
confidence: 99%
“…This significantly modifies the meaning of the tokens and usually results in lowered accuracy of the named entity recognition (see below). Currently, this problem is solved case-by-case by creating task-specific wrappers for existing tokenizers and named entity recognition models ( Huang and Ling, 2019 ; Alperin et al., 2016 ; He et al., 2020 ). Building a robust approach for chemistry-specific sentence tokenization and data extraction requires a thorough development of standard nomenclature for complex chemical terms and materials names.…”
Section: Text Mining Of Scientific Literaturementioning
confidence: 99%
“…Besides being intuitively simple, the main advantage of word embedding models is their ability to capture similarity and relations between words based on mutual associations. Word embeddings are applied ubiquitously in materials science TM and NLP to engineer words features that are used as an input in various named entity recognition tasks ( Kononova et al., 2019 ; Kim et al., 2020a ; Huang and Ling, 2019 ; Weston et al., 2019 ). Moreover, they also seem to be a promising tool to discover properties of materials through words association ( Tshitoyan et al., 2019 ).…”
Section: Text Mining Of Scientific Literaturementioning
confidence: 99%
“…This involves recognizing the multi-word phrases in the chemical literature through unsupervised methods and then representing the phrases in the vocabulary. 73 Typically, word embedding is performed after tokenization with phrase representation obtained based on a post-vector addition. In this method, a new step is incorporated to identify multi-word phrases and add the detected terms to the vocabulary.…”
Section: Named Entity Recognition (Ner)mentioning
confidence: 99%