2022
DOI: 10.3389/fdgth.2022.788124
|View full text |Cite
|
Sign up to set email alerts
|

Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature

Abstract: To analyse large corpora using machine learning and other Natural Language Processing (NLP) algorithms, the corpora need to be standardized. The BioC format is a community-driven simple data structure for sharing text and annotations, however there is limited access to biomedical literature in BioC format and a lack of bioinformatics tools to convert online publication HTML formats to BioC. We present Auto-CORPus (Automated pipeline for Consistent Outputs from Research Publications), a novel NLP tool for the s… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1

Relationship

3
2

Authors

Journals

citations
Cited by 9 publications
(17 citation statements)
references
References 12 publications
0
9
0
Order By: Relevance
“…We are currently working on addressing a limitation of the Auto-CORPus package (25) that we used to process the fulltext. In investigating our results, we found that any abbreviations that cannot be mapped to full names are not annotated, this includes for example cases where Greek letters are abbreviated by letters of Latin alphabet ( α as ‘a’).…”
Section: Discussionmentioning
confidence: 99%
See 4 more Smart Citations
“…We are currently working on addressing a limitation of the Auto-CORPus package (25) that we used to process the fulltext. In investigating our results, we found that any abbreviations that cannot be mapped to full names are not annotated, this includes for example cases where Greek letters are abbreviated by letters of Latin alphabet ( α as ‘a’).…”
Section: Discussionmentioning
confidence: 99%
“…For the TABoLiSTM model (BioBERT embedding achieving higher precision by 4% compared to the annotation pipeline) this may be to learning contexts rather than learning the rules and regular structures designated to the annotation pipeline. The algorithms were (trained and) evaluated on the full text output from Auto-CORPus (25), however Auto-CORPus also provides separate JSON output files for table data and abbreviations and these files mostly contain single terms without context. Empirically, we found that although the DL models are context sensitive by construction (BiLSTM network and BioBERT embedding) they detect entities in tables and abbreviation lists with high accuracy comparable to the full text results.…”
Section: Discussionmentioning
confidence: 99%
See 3 more Smart Citations