SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks

Oliveira, Lucas Emanuel Silva e; Peters, Ana Carolina; Silva, Adalniza Moura Pucca da; Gebeluca, Caroline P.; Gumiel, Yohan Bonescki; Cintho, Lilian Mie Mukai; Carvalho, Déborah Ribeiro; Hasan, Sadid A.; Moro, Cláudia Maria Cabral

doi:10.1186/s13326-022-00269-1

Cited by 12 publications

(11 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We first analyzed the F1 score and accuracy calculated by the test set of the Mac-Morpho corpus to verify how accurate the model performed in texts from the same corpus of the training. Also, we evaluated the trained models on a set of clinical notes taken from SemClinBr [15], a corpus containing clinical narratives from Brazilian hospitals. We randomly selected 50 sentences containing between 6 and 15 tokens, which were manually POS-annotated by a human linguist, referred to in this paper as human annotation.…”

Section: Discussionmentioning

confidence: 99%

Developing a Transformer-based Clinical Part-of-Speech Tagger for Brazilian Portuguese

Schneider,

Gumiel,

Oliveira

et al. 2023

J Health Inform

View full text Add to dashboard Cite

Electronic Health Records are a valuable source of information to be extracted by means of natural language processing (NLP) tasks, such as morphosyntactic word tagging. Although there have been significant advances in health NLP, such as the Transformer architecture, languages such as Portuguese are still underrepresented. This paper presents taggers developed for Portuguese texts, fine-tuned using BioBERtpt (clinical/biomedical) and BERTimbau (generic) models on a POS-tagged corpus. We achieved an accuracy of 0.9826, state-of-the-art for the corpus used. In addition, we performed a human-based evaluation of the trained models and others in the literature, using authentic clinical narratives. Our clinical model achieved 0.8145 in accuracy compared to 0.7656 for the generic model. It also showed competitive results compared to models trained specifically with clinical texts, evidencing domain impact on the base model in NLP tasks.

show abstract

Section: Discussionmentioning

confidence: 99%

Developing a Transformer-based Clinical Part-of-Speech Tagger for Brazilian Portuguese

Schneider,

Gumiel,

Oliveira

et al. 2023

J Health Inform

View full text Add to dashboard Cite

show abstract

“…The checkpoints (intermediate saved versions of a pre-trained language model during the training process) involved the BERT-based models available for Portuguese, both generic domain and specialized in the clinical area. For each pre-trained model, we fine-tuned them to the NER task with two corpora in the clinical domain, TempClinBr [11], and SemClinBr [12].…”

Section: Methodsmentioning

confidence: 99%

CardioBERTpt: Transformer-based Models for Cardiology Language Representation in Portuguese

Schneider,

Gumiel,

de Souza

et al. 2023

2023 IEEE 36th International Symposium on Computer-Based Medical Systems (CBMS)

Self Cite

View full text Add to dashboard Cite

Contextual word embeddings and the Transformers architecture have reached state-of-the-art results in many natural language processing (NLP) tasks and improved the adaptation of models for multiple domains. Despite the improvement in the reuse and construction of models, few resources are still developed for the Portuguese language, especially in the health domain. Furthermore, the clinical models available for the language are not representative enough for all medical specialties. This work explores deep contextual embedding models for the Portuguese language to support clinical NLP tasks. We transferred learned information from electronic health records of a Brazilian tertiary hospital specialized in cardiology diseases and pre-trained multiple clinical BERT-based models. We evaluated the performance of these models in named entity recognition experiments, fine-tuning them in two annotated corpora containing clinical narratives. Our pre-trained models outperformed previous multilingual and Portuguese BERT-based models for cardiology and multi-specialty environments, reaching the state-of-the-art for analyzed corpora, with 5.5% F1 score improvement in TempClinBr (all entities) and 1.7% in SemClinBr (Disorder entity) corpora. Hence, we demonstrate that data representativeness and a high volume of training data can improve the results for clinical tasks, aligned with results for other languages.

show abstract

“…Despite the advancement of transfer learning for negation detection [76,79,80], rule-based [27] and supervised machine learning approaches [76,77,[81][82][83] for LoE continue to be researched and employed. One paper presented a corpus-free approach, which is an attractive prospect in a scenario where there is no annotated data [84].…”

Section: Recent Advances In Negation Resolution For Loementioning

confidence: 99%

Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey

Shaitarova,

Zaghir,

Lavelli

et al. 2023

Yearb Med Inform

View full text Add to dashboard Cite

Objectives: This survey aims to provide an overview of the current state of biomedical and clinical Natural Language Processing (NLP) research and practice in Languages other than English (LoE). We pay special attention to data resources, language models, and popular NLP downstream tasks. Methods: We explore the literature on clinical and biomedical NLP from the years 2020-2022, focusing on the challenges of multilinguality and LoE. We query online databases and manually select relevant publications. We also use recent NLP review papers to identify the possible information lacunae. Results: Our work confirms the recent trend towards the use of transformer-based language models for a variety of NLP tasks in medical domains. In addition, there has been an increase in the availability of annotated datasets for clinical NLP in LoE, particularly in European languages such as Spanish, German and French. Common NLP tasks addressed in medical NLP research in LoE include information extraction, named entity recognition, normalization, linking, and negation detection. However, there is still a need for the development of annotated datasets and models specifically tailored to the unique characteristics and challenges of medical text in some of these languages, especially low-resources ones. Lastly, this survey highlights the progress of medical NLP in LoE, and helps at identifying opportunities for future research and development in this field.

show abstract

SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks

Cited by 12 publications

References 52 publications

Developing a Transformer-based Clinical Part-of-Speech Tagger for Brazilian Portuguese

Developing a Transformer-based Clinical Part-of-Speech Tagger for Brazilian Portuguese

CardioBERTpt: Transformer-based Models for Cardiology Language Representation in Portuguese

Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey

Contact Info

Product

Resources

About