SciNER: Extracting Named Entities from Scientific Literature

Hong, Zhi; Tchoua, Roselyne; Chard, Kyle; Foster, Ian

doi:10.1007/978-3-030-50417-5_23

Cited by 11 publications

(7 citation statements)

References 27 publications

(28 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In another study, using a domain-specific ontology for the Hotels domain, the authors have shown accurate extraction of named entities [2]. Hong et al [5] discussed the implementation of scientific named entity extraction. This is solely for entities that are scientific names in a given text; this can be further improved to a generalized entity extraction which does not require using a restricted entity.…”

Section: Domain-specific Approaches For Entity Extractionmentioning

confidence: 99%

Keyword extraction and summarization from unstructured text: A case study with open data from legal domain

Singh

Bansal

2022

Electronic Workshops in Computing

View full text Add to dashboard Cite

Information Extraction (IE) is an important and crucial task in the world of web and open data. IE is achieved using Natural language Processing (NLP). There are various techniques used for extraction of information, however coming up with useful and meaningful information is the most important task. Many search engines rely heavily on IE. This paper focuses on entity extraction of named entities from natural language and converting them into knowledge graph of triples. The goal is to answer two types of queries (i) Keyword search that returns exact information; (ii) Summarization of a keyword in question. A case study using open data from legal domain is presented.

show abstract

Section: Domain-specific Approaches For Entity Extractionmentioning

confidence: 99%

Keyword extraction and summarization from unstructured text: A case study with open data from legal domain

Singh

Bansal

2022

Electronic Workshops in Computing

View full text Add to dashboard Cite

show abstract

“…While SpaCy is easy to use, it lacks flexibility: its end-to-end encapsulation does not expose many tunable parameters. Thus we also explore the use of a Keras-LSTM model that we developed in previous work for identification of polymers in materials science literature (Hong et al, 2020b). This model is based on the Bidirectional LSTM network with a conditional random field (CRF) layer added on top.…”

Section: Spacy and Keras-long-short Term Memory Modelsmentioning

confidence: 99%

“…The Keras LSTM model requires external word vectors since, unlike SpaCy, it does not include a word embedding model. To explore the affect of different word embedding models we trained both BERT (Devlin et al, 2018), a top-performing language model developed by Google, and FastText (Bojanowski et al, 2016), a model shown to have outperformed traditional Word2Vec models such as CBOW and Skipgram in our previous work (Hong et al, 2020b). While Google has released pre-trained BERT models, and researchers often build upon these models by "fine-tuning" them with additional training on small external datasets, it is not suitable to our problem as the vocabulary used in the CORD-19 is very different than the datasets used to train these models.…”

Section: Word Embedding Modelsmentioning

confidence: 99%

“…Having collected adequate training data via this model-guided human annotation process, we then use the resulting labeled data to re-train a NER model originally developed to identify polymer names in materials science publications (Hong et al, 2020b) and apply this trained model to We show that the labeled data produced by our approach are of sufficiently high quality than when used to train NER models, which achieves a best F-1 score of 80.5%-roughly equivalent to that achieved by nonexpert humans.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Models and Processes to Extract Drug-like Molecules From Natural Language Text

Hong

Pauloski

Ward

et al. 2021

Front. Mol. Biosci.

Self Cite

View full text Add to dashboard Cite

Researchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be drug-like in the context of viral research. However, this literature is too large for human review and features unusual vocabularies for which existing named entity recognition (NER) models are ineffective. We report here on a project that leverages both human and artificial intelligence to detect references to such molecules in free text. We present 1) a iterative model-in-the-loop method that makes judicious use of scarce human expertise in generating training data for a NER model, and 2) the application and evaluation of this method to the problem of identifying drug-like molecules in the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of 198,875 papers. We show that by repeatedly presenting human labelers only with samples for which an evolving NER model is uncertain, our human-machine hybrid pipeline requires only modest amounts of non-expert human labeling time (tens of hours to label 1778 samples) to generate an NER model with an F-1 score of 80.5%—on par with that of non-expert humans—and when applied to CORD’19, identifies 10,912 putative drug-like molecules. This enriched the computational screening team’s targets by 3,591 molecules, of which 18 ranked in the top 0.1% of all 6.6 million molecules screened for docking against the 3CLPro protein.

show abstract

“…The lack of large data sets is currently being tackled by several efforts to compile data, ,,,, including through natural language processing − and the generation of large computational data sets. − It is also being tackled by transfer learning, a growing technique in ML where knowledge is transferred between tasks (e.g., prediction of properties), domains (e.g., scientific literature or English literature), or both, as detailed in an excellent review . Most commonly knowledge is transferred from a task or domain where data is plentiful to a task or domain where data is limited.…”

mentioning

confidence: 99%

Leveraging Theory for Enhanced Machine Learning

2022

View full text Add to dashboard Cite

The application of machine learning to the materials domain has traditionally struggled with two major challenges: a lack of large, curated data sets and the need to understand the physics behind the machine-learning prediction. The former problem is particularly acute in the polymers domain.Here we aim to simultaneously tackle these challenges through the incorporation of scientific knowledge, thus, providing improved predictions for smaller data sets, both under interpolation and extrapolation, and a degree of explainability. We focus on imperfect theories, as they are often readily available and easier to interpret. Using a system of a polymer in different solvent qualities, we explore numerous methods for incorporating theory into machine learning using different machine-learning models, including Gaussian process regression. Ultimately, we find that encoding the functional form of the theory performs best followed by an encoding of the numeric values of the theory.

show abstract

SciNER: Extracting Named Entities from Scientific Literature

Cited by 11 publications

References 27 publications

Keyword extraction and summarization from unstructured text: A case study with open data from legal domain

Keyword extraction and summarization from unstructured text: A case study with open data from legal domain

Models and Processes to Extract Drug-like Molecules From Natural Language Text

Leveraging Theory for Enhanced Machine Learning

Contact Info

Product

Resources

About