Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature

Beck, Tim; Shorter, Tom; Hu, Yan; Li, Zhuoyu; Sun, Shujian; Popovici, Casiana M.; McQuibban, Nicholas A. R.; Makraduli, Filip; Yeung, Cheng S.; Rowlands, Thomas; Posma, Joram M.

doi:10.3389/fdgth.2022.788124

Cited by 9 publications

(17 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We are currently working on addressing a limitation of the Auto-CORPus package (25) that we used to process the fulltext. In investigating our results, we found that any abbreviations that cannot be mapped to full names are not annotated, this includes for example cases where Greek letters are abbreviated by letters of Latin alphabet ( α as ‘a’).…”

Section: Discussionmentioning

confidence: 99%

“…For the TABoLiSTM model (BioBERT embedding achieving higher precision by 4% compared to the annotation pipeline) this may be to learning contexts rather than learning the rules and regular structures designated to the annotation pipeline. The algorithms were (trained and) evaluated on the full text output from Auto-CORPus (25), however Auto-CORPus also provides separate JSON output files for table data and abbreviations and these files mostly contain single terms without context. Empirically, we found that although the DL models are context sensitive by construction (BiLSTM network and BioBERT embedding) they detect entities in tables and abbreviation lists with high accuracy comparable to the full text results.…”

Section: Discussionmentioning

confidence: 99%

“…The metabolomics articles were stored in HyperText Markup Language (HTML) format, which were then processed by the Auto-CORPus package (25), and standardised into machine-readable JavaScript Object Notation (JSON) documents. Auto-CORPus outputs three JSON files based on the input of an HTML file of an article: ‘maintext’, ‘table’ and ‘abbreviation’.…”

Section: Methodsmentioning

confidence: 99%

“…These abbreviations were also not annotated in the (semi-automatically annotated) metabolomics corpus, but could easily be added as a set of rules to the algorithm. Lastly, superscripts and subscripts in the corpus are encoded differently from normal text by Auto-CORPus (25) and this causes negative impact to the annotation algorithms. We therefore anticipate periodically updating the metabolomics corpus based on user feedback and further research.…”

Section: Metabolomics Corpusmentioning

confidence: 99%

“…Hence, new algorithms are needed that focus specifically on recognising metabolites as well as not constrain these algorithm to abstracts, but on full-text paragraphs. Here, we describe the development of a standardised, machine-readable metabolomics corpus of full-text Open Access (OA) PMC articles by using the Auto-CORPus (Automated and Consistent Outputs from Research Publications) package (25) for text standardisation in conjunction with semi-automatic annotation of metabolites in full texts using a combination of dictionary searching (using HMDB to stay in line with prior work (10)), regular expression matching and rule-based approaches. We then use this corpus to train two DL-based algorithms to perform metabolite NER with the aim to obtain a generalisable model that can be used to speed up metabolomics literature review.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

MetaboListem and TABoLiSTM: Two Deep Learning Algorithms for Metabolite Named Entity Recognition

Yeung

Beck

Posma

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Reviewing the metabolomics literature is becoming increasingly difficult because of the rapid expansion of relevant journal literature. Text-mining technologies are therefore needed to facilitate more efficient literature review. Here we contribute a standardised corpus of full-text publications from metabolomics studies and describe the development of two new metabolite named entity recognition (NER) methods. We introduce two deep learning methods for metabolite NER based on Bidirectional Long Short-Term Memory (BiLSTM) networks incorporating different transfer learning techniques. Our first model (MetaboListem) follows prior methodology using GloVe word embeddings. Our second model exploits BERT and BioBERT for embedding and is named TABoLiSTM (Transformer-Affixed BiLSTM). The methods are trained on a novel corpus annotated using rule-based methods, and evaluated on manually annotated metabolomics articles. MetaboListem (F1 score 0.890, precision 0.892, recall 0.888) and TABoLiSTM (BioBERT version: F1 score 0.909, precision 0.926, recall 0.893) have achieved state-of-the-art performance on metabolite NER. A corpus with >1,200 full-text Open Access metabolomics publications and >116,000 annotated metabolites was created. This work demonstrates that deep learning algorithms are capable of identifying metabolite names accurately and efficiently in text. The proposed corpus and NER algorithms can be used for metabolomics text-mining tasks such as information retrieval, document classification and literature-based discovery.AvailabilityThe corpus and NER algorithms are freely available with detailed instructions from Github at https://github.com/omicsNLP/MetaboliteNER.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Metabolomics Corpusmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

MetaboListem and TABoLiSTM: Two Deep Learning Algorithms for Metabolite Named Entity Recognition

Yeung

Beck

Posma

2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition

Wang,

Vijayaraghavan,

Beck

et al. 2024

J. Proteome Res.

Self Cite

View full text Add to dashboard Cite

Enzymes are indispensable in many biological processes, and with biomedical literature growing exponentially, effective literature review becomes increasingly challenging. Natural language processing methods offer solutions to streamline this process. This study aims to develop an annotated enzyme corpus for training and evaluating enzyme named entity recognition (NER) models. A novel pipeline, combining dictionary matching and rule-based keyword searching, automatically annotated enzyme entities in >4800 full-text publications. Four deep learning NER models were created with different vocabularies (BioBERT/SciBERT) and architectures (BiLSTM/transformer) and evaluated on 526 manually annotated full-text publications. The annotation pipeline achieved an F1-score of 0.86 (precision = 1.00, recall = 0.76), surpassed by fine-tuned transformers for F1-score (BioBERT: 0.89, SciBERT: 0.88) and recall (0.86) with BiLSTM models having higher precision (0.94) than transformers (0.92). The annotation pipeline runs in seconds on standard laptops with almost perfect precision, but was outperformed by fine-tuned transformers in terms of F1-score and recall, demonstrating generalizability beyond the training data. In comparison, SciBERT-based models exhibited higher precision, and BioBERT-based models exhibited higher recall, highlighting the importance of vocabulary and architecture. These models, representing the first enzyme NER algorithms, enable more effective enzyme text mining and information extraction. Codes for automated annotation and model generation are available from and .

show abstract

A survey on clinical natural language processing in the United Kingdom from 2007 to 2022

Wang

et al. 2022

npj Digit. Med.

View full text Add to dashboard Cite

Much of the knowledge and information needed for enabling high-quality clinical research is stored in free-text format. Natural language processing (NLP) has been used to extract information from these sources at scale for several decades. This paper aims to present a comprehensive review of clinical NLP for the past 15 years in the UK to identify the community, depict its evolution, analyse methodologies and applications, and identify the main barriers. We collect a dataset of clinical NLP projects (n = 94; £ = 41.97 m) funded by UK funders or the European Union’s funding programmes. Additionally, we extract details on 9 funders, 137 organisations, 139 persons and 431 research papers. Networks are created from timestamped data interlinking all entities, and network analysis is subsequently applied to generate insights. 431 publications are identified as part of a literature review, of which 107 are eligible for final analysis. Results show, not surprisingly, clinical NLP in the UK has increased substantially in the last 15 years: the total budget in the period of 2019–2022 was 80 times that of 2007–2010. However, the effort is required to deepen areas such as disease (sub-)phenotyping and broaden application domains. There is also a need to improve links between academia and industry and enable deployments in real-world settings for the realisation of clinical NLP’s great potential in care delivery. The major barriers include research and development access to hospital data, lack of capable computational resources in the right places, the scarcity of labelled data and barriers to sharing of pretrained models.

show abstract

Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature

Cited by 9 publications

References 12 publications

MetaboListem and TABoLiSTM: Two Deep Learning Algorithms for Metabolite Named Entity Recognition

MetaboListem and TABoLiSTM: Two Deep Learning Algorithms for Metabolite Named Entity Recognition

Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition

A survey on clinical natural language processing in the United Kingdom from 2007 to 2022

Contact Info

Product

Resources

About