UPCLASS: a deep learning-based classifier for UniProtKB entry publications

Teodoro, Douglas; Knafou, Julien; Naderi, Nona; Pasche, Emilie; Gobeill, Julien; Arighi, Cecilia N.; Ruch, Patrick

doi:10.1093/database/baaa026

Cited by 7 publications

(5 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Deep learning approaches like Convolutional Neural Networks (CNN) (Lecun et al, 1998;Teodoro et al, 2020), Recurrent Neural Networks (RNN) (Rumelhart et al, 1986), Long Short-Term Memory Networks (LSTM) (Hochreiter and Schmidhuber, 1997), and Transformer-based architectures (Vaswani et al, 2017), including pretrained language models such as BERT (Devlin et al, 2018), RoBERTa (Liu et al, 2019), and XL-Net (Yang et al, 2019), have demonstrated stateof-the-art efficacy in a diverse range of domains . Leveraging the hierarchical structure of documents, graph neural networks (GNNs) have also been effectively proposed to assign categories to biomedical documents (Ferdowsi et al, 2023(Ferdowsi et al, , 2022(Ferdowsi et al, , 2021.…”

Section: Related Workmentioning

confidence: 99%

DS4DH at MEDIQA-Chat 2023: Leveraging SVM and GPT-3 Prompt Engineering for Medical Dialogue Classification and Summarization

Zhang

Mishra

Teodoro

2023

Preprint

Self Cite

View full text Add to dashboard Cite

This paper presents the results of the Data Science for Digital Health (DS4DH) group in the MEDIQA-Chat Tasks at ACL-ClinicalNLP 2023. Our study combines the power of a classical machine learning method, Support Vector Machine, for classifying medical dialogues, along with the implementation of one-shot prompts using GPT-3.5. We employ dialogues and summaries from the same category as prompts to generate summaries for novel dialogues. Our findings exceed the average benchmark score, offering a robust reference for assessing performance in this field.

show abstract

Section: Related Workmentioning

confidence: 99%

DS4DH at MEDIQA-Chat 2023: Leveraging SVM and GPT-3 Prompt Engineering for Medical Dialogue Classification and Summarization

Zhang

Mishra

Teodoro

2023

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Automatic text classification appears as an essential methodology to ensure high quality of living evidence updates. Text classification consists of assigning categorical labels to a given text passage (e.g., an abstract) based on its similarity to the existing labeled examples [ 23 – 25 ]. Classical text classifiers use statistical document representations, in which the relevance of a word to a document is proportional to its frequency in the document and inversely proportional to its frequency in the collection (the so-called term frequency-inverse document frequency (tf-idf) framework), to create a vectorial representations of the documents [ 26 ].…”

Section: Introductionmentioning

confidence: 99%

Ensemble of deep learning language models to support the creation of living systematic reviews for the COVID-19 literature

Knafou

Haas²,

Borissov

et al. 2023

Syst Rev

Self Cite

View full text Add to dashboard Cite

Background The COVID-19 pandemic has led to an unprecedented amount of scientific publications, growing at a pace never seen before. Multiple living systematic reviews have been developed to assist professionals with up-to-date and trustworthy health information, but it is increasingly challenging for systematic reviewers to keep up with the evidence in electronic databases. We aimed to investigate deep learning-based machine learning algorithms to classify COVID-19-related publications to help scale up the epidemiological curation process. Methods In this retrospective study, five different pre-trained deep learning-based language models were fine-tuned on a dataset of 6365 publications manually classified into two classes, three subclasses, and 22 sub-subclasses relevant for epidemiological triage purposes. In a k-fold cross-validation setting, each standalone model was assessed on a classification task and compared against an ensemble, which takes the standalone model predictions as input and uses different strategies to infer the optimal article class. A ranking task was also considered, in which the model outputs a ranked list of sub-subclasses associated with the article. Results The ensemble model significantly outperformed the standalone classifiers, achieving a F1-score of 89.2 at the class level of the classification task. The difference between the standalone and ensemble models increases at the sub-subclass level, where the ensemble reaches a micro F1-score of 70% against 67% for the best-performing standalone model. For the ranking task, the ensemble obtained the highest recall@3, with a performance of 89%. Using an unanimity voting rule, the ensemble can provide predictions with higher confidence on a subset of the data, achieving detection of original papers with a F1-score up to 97% on a subset of 80% of the collection instead of 93% on the whole dataset. Conclusion This study shows the potential of using deep learning language models to perform triage of COVID-19 references efficiently and support epidemiological curation and review. The ensemble consistently and significantly outperforms any standalone model. Fine-tuning the voting strategy thresholds is an interesting alternative to annotate a subset with higher predictive confidence.

show abstract

“…To address these issues, automated and augmented curation systems for extracting protein functional data from scientific literature are becoming increasingly desired. In particular, Machine Learning and Natural Language Processing techniques are beginning to be employed for biocuration efforts 1 , 2 for extracting and organising unstructured biological information into a structured form that is accessible to biologists. Central to these automated systems, is the process of unambiguously extracting semantic relationships between two or more biological entities in the literature 3 .…”

Section: Introductionmentioning

confidence: 99%

Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network

David

Menezes

Klerk

et al. 2021

Sci Rep

View full text Add to dashboard Cite

The increased diversity and scale of published biological data has to led to a growing appreciation for the applications of machine learning and statistical methodologies to gain new insights. Key to achieving this aim is solving the Relationship Extraction problem which specifies the semantic interaction between two or more biological entities in a published study. Here, we employed two deep neural network natural language processing (NLP) methods, namely: the continuous bag of words (CBOW), and the bi-directional long short-term memory (bi-LSTM). These methods were employed to predict relations between entities that describe protein subcellular localisation in plants. We applied our system to 1700 published Arabidopsis protein subcellular studies from the SUBA manually curated dataset. The system combines pre-processing of full-text articles in a machine-readable format with relevant sentence extraction for downstream NLP analysis. Using the SUBA corpus, the neural network classifier predicted interactions between protein name, subcellular localisation and experimental methodology with an average precision, recall rate, accuracy and F1 scores of 95.1%, 82.8%, 89.3% and 88.4% respectively (n = 30). Comparable scoring metrics were obtained using the CropPAL database as an independent testing dataset that stores protein subcellular localisation in crop species, demonstrating wide applicability of prediction model. We provide a framework for extracting protein functional features from unstructured text in the literature with high accuracy, improving data dissemination and unlocking the potential of big data text analytics for generating new hypotheses.

show abstract

UPCLASS: a deep learning-based classifier for UniProtKB entry publications

Cited by 7 publications

References 11 publications

DS4DH at MEDIQA-Chat 2023: Leveraging SVM and GPT-3 Prompt Engineering for Medical Dialogue Classification and Summarization

DS4DH at MEDIQA-Chat 2023: Leveraging SVM and GPT-3 Prompt Engineering for Medical Dialogue Classification and Summarization

Ensemble of deep learning language models to support the creation of living systematic reviews for the COVID-19 literature

Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network

Contact Info

Product

Resources

About