We present CoNTACT 1 : a Dutch language model adapted to the domain of COVID-19 tweets. The model was developed by continuing the pre-training phase of RobBERT [3] by using 2.8M Dutch COVID-19 related tweets posted in 2021. In order to test the performance of the model and compare it to RobBERT, the two models were tested on two tasks: (1) binary vaccine hesitancy detection and (2) detection of arguments for vaccine hesitancy. For both tasks, not only Twitter but also Facebook data was used to show cross-genre performance. In our experiments, CoNTACT showed statistically significant gains over RobBERT in all experiments for task 1. For task 2, we observed substantial improvements in virtually all classes in all experiments. An error analysis indicated that the domain adaptation yielded better representations of domain-specific terminology, causing CoNTACT to make more accurate classification decisions.
BACKGROUND Electronic Medical Records (EMRs) have opened up opportunities to analyze clinical practice at large scale. Structured registries and coding procedures such as the International Classification of Primary Care (ICPC) further improved these. However, a large part of the information about the state of patient and the doctors observations is still entered in free text fields. The main function of those fields is to report the doctors line of thought, to remind oneself and colleagues on follow-up actions and for later accountability of clinical decisions. These fields contain rich, complementary information to that in coded fields, and are today hardly being used for analysis. OBJECTIVE This study aimed to develop a prediction model approach to convert the free text information on COVID-related symptoms from out of hours care EMR into usable symptom-based data that can be analysed at large scale. The design was a feasibility study, in which we examined the content of the raw data, steps and methods for modelling, the precision and the accuracy of the models. METHODS A data prediction model for 27 pre-identified COVID-relevant symptoms was developed for a dataset derived from the database of primary-care out of hours consultations in Flanders. A multi-class multi-label categorization classifier was developed. We tested two approaches: a classical machine learning based text categorization approach Binary Relevance, and a deep neural network learning approach with BERTje, including a domain adapted version. RESULTS The normal BERTje model performed the best on the data, reaching an F1-macro score of 0.58 indication precision and recall, and an accuracy score of 0.38. As for the individual codes themselves, the domain adapted version of BERTje performs better on several of the less common objectives codes, while BERTje reaches higher F1-scores for the least common labels especially and most other codes in general. CONCLUSIONS The AI model BERTje can reliably and predict COVID-related information from medical records using text mining from the free text fields generated in primary care settings. This feasibility study invites researchers to further examine further possibilities to use primary care routine data.
Background Electronic medical records have opened opportunities to analyze clinical practice at large scale. Structured registries and coding procedures such as the International Classification of Primary Care further improved these procedures. However, a large part of the information about the state of patient and the doctors’ observations is still entered in free text fields. The main function of those fields is to report the doctor’s line of thought, to remind oneself and his or her colleagues on follow-up actions, and to be accountable for clinical decisions. These fields contain rich information that can be complementary to that in coded fields, and until now, they have been hardly used for analysis. Objective This study aims to develop a prediction model to convert the free text information on COVID-19–related symptoms from out of hours care electronic medical records into usable symptom-based data that can be analyzed at large scale. Methods The design was a feasibility study in which we examined the content of the raw data, steps and methods for modelling, as well as the precision and accuracy of the models. A data prediction model for 27 preidentified COVID-19–relevant symptoms was developed for a data set derived from the database of primary-care out-of-hours consultations in Flanders. A multiclass, multilabel categorization classifier was developed. We tested two approaches, which were (1) a classical machine learning–based text categorization approach, Binary Relevance, and (2) a deep neural network learning approach with BERTje, including a domain-adapted version. Ethical approval was acquired through the Institutional Review Board of the Institute of Tropical Medicine and the ethics committee of the University Hospital of Antwerpen (ref 20/50/693). Results The sample set comprised 3957 fields. After cleaning, 2313 could be used for the experiments. Of the 2313 fields, 85% (n=1966) were used to train the model, and 15% (n=347) for testing. The normal BERTje model performed the best on the data. It reached a weighted F1 score of 0.70 and an exact match ratio or accuracy score of 0.38, indicating the instances for which the model has identified all correct codes. The other models achieved respectable results as well, ranging from 0.59 to 0.70 weighted F1. The Binary Relevance method performed the best on the data without a frequency threshold. As for the individual codes, the domain-adapted version of BERTje performs better on several of the less common objective codes, while BERTje reaches higher F1 scores for the least common labels especially, and for most other codes in general. Conclusions The artificial intelligence model BERTje can reliably predict COVID-19–related information from medical records using text mining from the free text fields generated in primary care settings. This feasibility study invites researchers to examine further possibilities to use primary...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.