BiTeM at WNUT 2020 Shared Task-1: Named Entity Recognition over Wet Lab Protocols using an Ensemble of Contextual Language Models

Knafou, Julien; Naderi, Nona; Copara, Jenny; Teodoro, Douglas; Ruch, Patrick

doi:10.18653/v1/2020.wnut-1.40

Cited by 15 publications

(14 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Particularly, the ensemble of masked language models brought the highest performance gain to the search pipeline. Indeed, ensembles of language models have proved to be a robust methodology to improve predictive performance [ 53 - 55 ].…”

Section: Discussionmentioning

confidence: 99%

Information Retrieval in an Infodemic: The Case of COVID-19 Publications

Teodoro¹,

Ferdowsi²,

Borissov³

et al. 2021

J Med Internet Res

Self Cite

View full text Add to dashboard Cite

Background The COVID-19 global health crisis has led to an exponential surge in published scientific literature. In an attempt to tackle the pandemic, extremely large COVID-19–related corpora are being created, sometimes with inaccurate information, which is no longer at scale of human analyses. Objective In the context of searching for scientific evidence in the deluge of COVID-19–related literature, we present an information retrieval methodology for effective identification of relevant sources to answer biomedical queries posed using natural language. Methods Our multistage retrieval methodology combines probabilistic weighting models and reranking algorithms based on deep neural architectures to boost the ranking of relevant documents. Similarity of COVID-19 queries is compared to documents, and a series of postprocessing methods is applied to the initial ranking list to improve the match between the query and the biomedical information source and boost the position of relevant documents. Results The methodology was evaluated in the context of the TREC-COVID challenge, achieving competitive results with the top-ranking teams participating in the competition. Particularly, the combination of bag-of-words and deep neural language models significantly outperformed an Okapi Best Match 25–based baseline, retrieving on average, 83% of relevant documents in the top 20. Conclusions These results indicate that multistage retrieval supported by deep learning could enhance identification of literature for COVID-19–related questions posed using natural language.

show abstract

Section: Discussionmentioning

confidence: 99%

Information Retrieval in an Infodemic: The Case of COVID-19 Publications

Teodoro¹,

Ferdowsi²,

Borissov³

et al. 2021

J Med Internet Res

Self Cite

View full text Add to dashboard Cite

show abstract

Section: Discussionmentioning

confidence: 99%

“…Second, identifying textual evidence in publication to re-rank publications might have a positive impact on the literature triage [25,26]. Finally, pre-trained language and ensemble learning models [27] could be opportunely used to provide the curator with a more focused evidence passage to support the curation work of mutation databases [18] To conclude, the system we developed has the potential to significantly propel variant curation. It is however to be noted that such a system is neither intended to replace human curators, nor clinical expertise, but rather to support these professionals by cutting down the cost of the manual triage of the literature.…”

Section: Discussionmentioning

confidence: 99%

Variomes: a high recall search engine to support the curation of genomic variants

Pasche

Mottaz

Caucheteur

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Precision oncology relies on the use of treatments targeting specific genetic variants. However, identifying clinically actionable variants as well as relevant information likely to be used to treat a patient with a given cancer is a labor-intensive task, which includes searching the literature for a large set of variants. The lack of universally adopted standard nomenclature for variants requires the development of variant-specific literature search engines. We develop a system to perform triage of publications relevant to support an evidence-based decision. Together with providing a ranked list of articles for a given variant, the system is also able to prioritize variants, as found in a Variant Calling Format, assuming that the clinical actionability of a genetic variant is correlated with the volume of literature published about the variant. Our system searches within three pre-annotated document collections: MEDLINE abstracts, PubMed Central full-text articles and ClinicalTrials.gov clinical trials. A variant synonym generator is used to increase the comprehensiveness of the set of retrieved documents. We then apply different strategies to rank the publications. We assess the search effectiveness of the system using different experimental settings. Experimental setting 1: The literature retrieval task is tuned and evaluated using the TREC Precision Medicine 2018 and 2019 benchmarks consisting respectively in 50 and 40 topics. Almost two thirds (62%) of the publications returned in the top-5 are relevant for clinical decision-support. Experimental setting 2: The evaluation of the variant prioritization task is based on a manually-created benchmark composed of eight patients for a total of 756 variants. For each patient, we used both their complete set of variants and tumor board reports. Our approach enabled identifying 81.8% of clinically actionable variants in the top-3. Experimental setting 3: A comparison of Variomes with LitVar, a well-known search engine for genetic variants is performed. Variomes was able to retrieve on average 90.8% of the content, while LitVar retrieved on average 58.6%. Out of the 9.2% articles, which are "missed" by Variomes, a per error analysis suggests that they are artefacts. To conclude, we are proposing here a competitive system to facilitate the curation of variants for personalized medicine.

show abstract

“…Conversely, for the clinical NER, for which a token could be assigned to more than one entity, we used a sigmoid function to provide a multi-class classifier. More information about the fine-tuning of the models and the hyper-parameter settings can be found in [Copara et al, 2020b,a, Knafou et al, 2020].…”

Section: Methodsmentioning

confidence: 99%

“…Our ensemble method is based on a voting strategy, where each model votes with its predictions and a simple majority of votes is necessary to assign the predictions [Copara et al, 2020b,a, Knafou et al, 2020]. In other words, for a given document, our models infer their predictions independently for each entity.…”

Section: Methodsmentioning

confidence: 99%

Ensemble of deep masked language models for effective named entity recognition in multi-domain corpora

Naderi

Knafou

Copara

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

The health and life science domains are well-known for their wealth of entities. These entities are presented as free text in large corpora, such as biomedical scientific and electronic health records. To enable the secondary use of these corpora and unlock their value, named entity recognition (NER) methods are proposed. Inspired by the success of deep masked language models, we present an ensemble approach for NER using these models. Results show statistically significant improvement of the ensemble models over baselines based on individual models in multiple domains - chemical, clinical and wet lab - and languages - English and French. The ensemble model achieves an overall performance of 79.2% macro F1-score, a 4.6 percentage point increase upon the baseline in multiple domains and languages. These results suggests that ensembles are a more effective strategy for tackling NER. We further perform a detailed analysis of their performance based on a set of entity properties.

show abstract

BiTeM at WNUT 2020 Shared Task-1: Named Entity Recognition over Wet Lab Protocols using an Ensemble of Contextual Language Models

Cited by 15 publications

References 11 publications

Information Retrieval in an Infodemic: The Case of COVID-19 Publications

Information Retrieval in an Infodemic: The Case of COVID-19 Publications

Variomes: a high recall search engine to support the curation of genomic variants

Ensemble of deep masked language models for effective named entity recognition in multi-domain corpora

Contact Info

Product

Resources

About