BioBERTpt - A Portuguese Neural Language Model for Clinical Named Entity Recognition

Schneider, Elisa Terumi Rubel; Souza, João Vitor Andrioli de; Knafou, Julien; Oliveira, Lucas Emanuel Silva e; Copara, Jenny; Gumiel, Yohan Bonescki; Oliveira, Lucas Ferro Antunes de; Paraíso, Emerson Cabrera; Teodoro, Douglas; Moro, Cláudia Maria Cabral

doi:10.18653/v1/2020.clinicalnlp-1.7

Cited by 46 publications

(33 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Alsentzer et al 2019 trained BERT andBioBERT (Lee et al, 2019), on MIMIC notes, and showed that Bio + Clinical BERT performed better than BERT and BioBERT trained on MedNLI dataset and i2b2 2010 datasets. Similarly, Schneider et al (2020) demonstrated that the fine-tuned BERT using Portuguese clinical notes outperformed BERT trained on general corpora.…”

Section: Clinical Named Entity Recognitionmentioning

confidence: 90%

“…Various NER challenges and shared tasks, such as the i2b2 and n2c2 NLP challenges (Uzuner et al, 2010;Suominen et al, 2013;Kelly et al, 2014;Bethard et al, 2015;Névéol et al, 2015;Henry et al, 2020), fostered the development of NER methods (De Bruijn et al, 2011;Jiang et al, 2011;Kim et al, 2015;Van Mulligen et al, 2016;El Boukkouri et al, 2019) for the clinical domain in different languages (Lopes et al, 2019;Sun and Yang, 2019;Andrioli de Souza et al, 2020;Schneider et al, 2020). The DEFT challenge proposed an information extraction task for the French clinical corpus, with entities distributed across four categories: anatomy, clinical practices, treatments, and time (Cardon et al, 2020).…”

Section: Clinical Named Entity Recognitionmentioning

confidence: 99%

See 1 more Smart Citation

Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora

Naderi

Knafou

Copara

et al. 2021

Front. Res. Metr. Anal.

Self Cite

View full text Add to dashboard Cite

The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains—biology, chemistry, and medicine—available in different languages—English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains.

show abstract

Section: Clinical Named Entity Recognitionmentioning

confidence: 90%

Section: Clinical Named Entity Recognitionmentioning

confidence: 99%

Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora

Naderi

Knafou

Copara

et al. 2021

Front. Res. Metr. Anal.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Various NER challanges and shared tasks [Uzuner et al, 2010, Kelly et al, 2014, Névéol et al, 2015, Suominen et al, 2013, Bethard et al, 2015] fostered the development of NER methods Van Mulligen et al, 2016 Kim et al, 2015 Jiang et al, 2011, De Bruijn et al, 2011, El Boukkouri et al, 2019] for clinical domain in different languages [Lopes et al, 2019, Schneider et al, 2020, Sun and Yang, 2019]. Performance varies greatly across the different methods and corpora, with more modern methods achieving F 1 -score as high as 95%.…”

Section: Related Workmentioning

confidence: 99%

Ensemble of deep masked language models for effective named entity recognition in multi-domain corpora

Naderi

Knafou

Copara

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

The health and life science domains are well-known for their wealth of entities. These entities are presented as free text in large corpora, such as biomedical scientific and electronic health records. To enable the secondary use of these corpora and unlock their value, named entity recognition (NER) methods are proposed. Inspired by the success of deep masked language models, we present an ensemble approach for NER using these models. Results show statistically significant improvement of the ensemble models over baselines based on individual models in multiple domains - chemical, clinical and wet lab - and languages - English and French. The ensemble model achieves an overall performance of 79.2% macro F1-score, a 4.6 percentage point increase upon the baseline in multiple domains and languages. These results suggests that ensembles are a more effective strategy for tackling NER. We further perform a detailed analysis of their performance based on a set of entity properties.

show abstract

“…As input features to the classic classifiers, we used the TF-IDF vector representations of the texts to be classified. As state-of-the-art approaches, we fine-tuned BERT-based models such as its multilingual version [Devlin et al 2019]; BERTimbau [Souza et al 2020], its Brazilian Portuguese version; and BioBERTpt [Schneider et al 2020], a Brazilian Portuguese version of BERT focused on the clinical domain.…”

Section: Modelsmentioning

confidence: 99%

Sentiment Analysis in Portuguese Texts from Online Health Community Forums: Data, Model and Evaluation

Gumiel

Lee²,

Soares³

et al. 2021

Anais Do XIII Simpósio Brasileiro De Tecnologia Da Informação E Da Linguagem Humana (STIL 2021)

Self Cite

View full text Add to dashboard Cite

This study introduces novel data and models for the task of Sentiment Analysis in Portuguese texts about Diabetes Mellitus. The corpus contains 1290 posts retrieved from online health community forums in Portuguese and annotated by two annotators according to 3 sentiment categories (e.g. Positive, Neutral and Negative). Evaluation of traditional (Support Vector Machine, Decision Tree, Random Forest and Logistic Regression classifiers) and state-ofthe-art (BERT-based models) machine learning classifiers for the task showed the advantage in performance of the latter models as expected. Data and models are available to the community upon request.

show abstract

BioBERTpt - A Portuguese Neural Language Model for Clinical Named Entity Recognition

Cited by 46 publications

References 19 publications

Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora

Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora

Ensemble of deep masked language models for effective named entity recognition in multi-domain corpora

Sentiment Analysis in Portuguese Texts from Online Health Community Forums: Data, Model and Evaluation

Contact Info

Product

Resources

About