Clinical Flair: A Pre-Trained Language Model for Spanish Clinical Natural Language Processing

Rojas, Matias; Dunstan, Jocelyn; Villena, Fabián

doi:10.18653/v1/2022.clinicalnlp-1.9

Cited by 8 publications

(8 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Nowadays, there is a strong development of contextualized word embeddings that assign dynamic representations to words based on their contexts, achieving state-of-the-art performance in multiple tasks. For the clinical domain in Spanish, relevant works include (Akhtyamova et al, 2020 ; Carrino et al, 2022 ; Rojas et al, 2022 ). These contextualized word embeddings are challenging to compute and deploy in production environments due to their demanding infrastructure needs.…”

Section: Discussionmentioning

confidence: 99%

“…These embeddings, however, were not intrinsically evaluated nor compared performance-wise with other embeddings, and they were not made available for use. In another work, Akhtyamova et al ( 2020 ) used the Flair (Akbik et al, 2019 ) and BERT (Devlin et al, 2018 ) models to calculate word embeddings for the Spanish clinical domain as part of a named entity recognition (NER) task and Rojas et al ( 2022 ) computed another Flair language model from clinical narratives in Spanish. These models utilize contextualized word embeddings that take into account the word context upon embedding calculation.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Training and intrinsic evaluation of lightweight word embeddings for the clinical domain in Spanish

Chiu

Villena

Martin

et al. 2022

Front. Artif. Intell.

Self Cite

View full text Add to dashboard Cite

Resources for Natural Language Processing (NLP) are less numerous for languages different from English. In the clinical domain, where these resources are vital for obtaining new knowledge about human health and diseases, creating new resources for the Spanish language is imperative. One of the most common approaches in NLP is word embeddings, which are dense vector representations of a word, considering the word's context. This vector representation is usually the first step in various NLP tasks, such as text classification or information extraction. Therefore, in order to enrich Spanish language NLP tools, we built a Spanish clinical corpus from waiting list diagnostic suspicions, a biomedical corpus from medical journals, and term sequences sampled from the Unified Medical Language System (UMLS). These three corpora can be used to compute word embeddings models from scratch using Word2vec and fastText algorithms. Furthermore, to validate the quality of the calculated embeddings, we adapted several evaluation datasets in English, including some tests that have not been used in Spanish to the best of our knowledge. These translations were validated by two bilingual clinicians following an ad hoc validation standard for the translation. Even though contextualized word embeddings nowadays receive enormous attention, their calculation and deployment require specialized hardware and giant training corpora. Our static embeddings can be used in clinical applications with limited computational resources. The validation of the intrinsic test we present here can help groups working on static and contextualized word embeddings. We are releasing the training corpus and the embeddings within this publication1.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Training and intrinsic evaluation of lightweight word embeddings for the clinical domain in Spanish

Chiu

Villena

Martin

et al. 2022

Front. Artif. Intell.

Self Cite

View full text Add to dashboard Cite

show abstract

“… 50 Specifically, they used a Bi-LSTM and CRF layers to recognize each entity type and incorporated pretrained embeddings trained on the Chilean Waiting List corpus 44 , 51 and character-level contextualized embeddings. 52 The code is freely available. 53…”

Section: Methodsmentioning

confidence: 99%

Automatic Detection of Distant Metastasis Mentions in Radiology Reports in Spanish

Ahumada,

Dunstan,

Rojas

et al. 2024

JCO Clin Cancer Inform

Self Cite

View full text Add to dashboard Cite

PURPOSE A critical task in oncology is extracting information related to cancer metastasis from electronic health records. Metastasis-related information is crucial for planning treatment, evaluating patient prognoses, and cancer research. However, the unstructured way in which findings of distant metastasis are often written in radiology reports makes it difficult to extract information automatically. The main aim of this study was to extract distant metastasis findings from free-text imaging and nuclear medicine reports to classify the patient status according to the presence or absence of distant metastasis. MATERIALS AND METHODS We created a distant metastasis annotated corpus using positron emission tomography-computed tomography and computed tomography reports of patients with prostate, colorectal, and breast cancers. Entities were labeled M1 or M0 according to affirmative or negative metastasis descriptions. We used a named entity recognition model on the basis of a bidirectional long short-term memory model and conditional random fields to identify entities. Mentions were subsequently used to classify whole reports into M1 or M0. RESULTS The model detected distant metastasis mentions with a weighted average F1 score performance of 0.84. Whole reports were classified with an F1 score of 0.92 for M0 documents and 0.90 for M1 documents. CONCLUSION These results show the usefulness of the model in detecting distant metastasis findings in three different types of cancer and the consequent classification of reports. The relevance of this study is to generate structured distant metastasis information from free-text imaging reports in Spanish. In addition, the manually annotated corpus, annotation guidelines, and code are freely released to the research community.

show abstract

“…Regarding the experimental setup, the disease model was trained to 150 epochs using an SGD optimizer with mini-batches of size 32 and a learning rate of 0.1. As mentioned, to encode sentences, we used two types of representations; a 300-dimensional word embedding model trained on the Chilean Waiting List corpus 4 and characterlevel contextualized embeddings retrieved from the Clinical Flair model (Rojas et al, 2022b). To implement the model and perform our experiments, we used the Flair framework, widely used by the NLP research community (Akbik et al, 2019).…”

Section: Ner Modelmentioning

confidence: 99%

Automatic Coding at Scale: Design and Deployment of a Nationwide System for Normalizing Referrals in the Chilean Public Healthcare System

Villena¹,

Rojas²,

Arias³

et al. 2023

Proceedings of the 5th Clinical Natural Language Processing Workshop

View full text Add to dashboard Cite

The disease coding task involves assigning a unique identifier from a controlled vocabulary to each disease mentioned in a clinical document. This task is relevant since it allows information extraction from unstructured data to perform, for example, epidemiological studies about the incidence and prevalence of diseases in a determined context. However, the manual coding process is subject to errors as it requires medical personnel to be competent in coding rules and terminology. In addition, this process consumes a lot of time and energy, which could be allocated to more clinically relevant tasks. These difficulties can be addressed by developing computational systems that automatically assign codes to diseases. In this way, we propose a two-step system for automatically coding diseases in referrals from the Chilean public healthcare system. Specifically, our model uses a state-of-the-art NER model for recognizing disease mentions and a search engine system based on Elasticsearch for assigning the most relevant codes associated with these disease mentions. The system's performance was evaluated on referrals manually coded by clinical experts. Our system obtained a MAP score of 0.63 for the subcategory level and 0.83 for the category level, close to the best-performing models in the literature. This system could be a support tool for health professionals, optimizing the coding and management process. Finally, to guarantee reproducibility, we publicly release the code of our models and experiments.

show abstract

Clinical Flair: A Pre-Trained Language Model for Spanish Clinical Natural Language Processing

Cited by 8 publications

References 11 publications

Training and intrinsic evaluation of lightweight word embeddings for the clinical domain in Spanish

Training and intrinsic evaluation of lightweight word embeddings for the clinical domain in Spanish

Automatic Detection of Distant Metastasis Mentions in Radiology Reports in Spanish

Automatic Coding at Scale: Design and Deployment of a Nationwide System for Normalizing Referrals in the Chilean Public Healthcare System

Contact Info

Product

Resources

About