Background Currently, existing biomedical literature repositories do not commonly provide users with specific means to locate and remotely access biomedical databases. Objective To address this issue, we developed the Biomedical Database Inventory (BiDI), a repository linking to biomedical databases automatically extracted from the scientific literature. BiDI provides an index of data resources and a path to access them seamlessly. Methods We designed an ensemble of deep learning methods to extract database mentions. To train the system, we annotated a set of 1242 articles that included mentions of database publications. Such a data set was used along with transfer learning techniques to train an ensemble of deep learning natural language processing models targeted at database publication detection. Results The system obtained an F1 score of 0.929 on database detection, showing high precision and recall values. When applying this model to the PubMed and PubMed Central databases, we identified over 10,000 unique databases. The ensemble model also extracted the weblinks to the reported databases and discarded irrelevant links. For the extraction of weblinks, the model achieved a cross-validated F1 score of 0.908. We show two use cases: one related to “omics” and the other related to the COVID-19 pandemic. Conclusions BiDI enables access to biomedical resources over the internet and facilitates data-driven research and other scientific initiatives. The repository is openly available online and will be regularly updated with an automatic text processing pipeline. The approach can be reused to create repositories of different types (ie, biomedical and others).
BACKGROUND Currently, existing biomedical literature repositories do not commonly provide users with specific means to locate and remotely access biomedical databases. OBJECTIVE To address this issue we developed BiDI (Biomedical Database Inventory), a repository linking to biomedical databases automatically extracted from the scientific literature. BiDI provides an index of data resources and a path to access them in a seamless manner. METHODS We designed an ensemble of Deep Learning methods to extract database mentions. To train the system we annotated a set of 1,242 articles that included mentions to database publications. Such a dataset was used along with transfer learning techniques to train an ensemble of deep learning NLP models based on the task of database publication detection. RESULTS The system obtained an f1-score of 0.929 on database detection, showing high precision and recall values. Applying this model to the PubMed and PubMed Central databases we identified over 10,000 unique databases. The ensemble also extracts the web links to the reported databases, discarding the irrelevant links. For the extraction of web links the model achieved a cross-validated f1-score of 0.908. We show two use cases, related to “omics” and the COVID-19 pandemia. CONCLUSIONS BiDI enables the access of biomedical resources over the Internet and facilitates data-driven research and other scientific initiatives. The repository is available at (http://gib.fi.upm.es/bidi/) and will be regularly updated with an automatic text processing pipeline. The approach can be reused to create repositories of different types (biomedical and others).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.