Traditional Machine Learning Models and Bidirectional Encoder Representations From Transformer (BERT)–Based Automatic Classification of Tweets About Eating Disorders: Algorithm Development and Validation Study

Bení­tez-Andrades, José Alberto; Alija-Pérez, José-Manuel; Vidal, María-Esther; Vargas, Rafael; Garcí­a-Ordás, María Teresa

doi:10.2196/34492

Cited by 25 publications

(21 citation statements)

References 55 publications

(87 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The size of the data set is quite similar to those of Kummervold et al [ 17 ] (1633 tweets for training and 544 for testing) and Benítez-Andrades et al [ 26 ] (n=1400 for training and n=600 for testing). Furthermore, the benefit of using a pretrained model such as the CamemBERT is that a large data set is not required to obtain good results.…”

Section: Methodsmentioning

confidence: 91%

“…This accuracy is slightly higher than that obtained by BERT for the same topic (vaccines) [ 17 ] and in the same range as previous findings [ 16 , 29 ]. However, CamemBERT obtained a better accuracy (78.7%-87.8%) in a study using dichotomous labels for tweets about eating disorders and using a preprocessing step, reducing the initial number of tweets by 2 [ 26 ]. However, by limiting the analysis to long tweets (170 or more characters, in accordance with the statistical analysis conducted on the performance of the model), the accuracy of classification model 2 improved significantly (from 62.9% to 72.4% for the F1-score).…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

An Analysis of French-Language Tweets About COVID-19 Vaccines: Supervised Learning Approach

Sauvayre¹,

Vernier²,

Chauvière³

2022

JMIR Med Inform

View full text Add to dashboard Cite

Background As the COVID-19 pandemic progressed, disinformation, fake news, and conspiracy theories spread through many parts of society. However, the disinformation spreading through social media is, according to the literature, one of the causes of increased COVID-19 vaccine hesitancy. In this context, the analysis of social media posts is particularly important, but the large amount of data exchanged on social media platforms requires specific methods. This is why machine learning and natural language processing models are increasingly applied to social media data. Objective The aim of this study is to examine the capability of the CamemBERT French-language model to faithfully predict the elaborated categories, with the knowledge that tweets about vaccination are often ambiguous, sarcastic, or irrelevant to the studied topic. Methods A total of 901,908 unique French-language tweets related to vaccination published between July 12, 2021, and August 11, 2021, were extracted using Twitter’s application programming interface (version 2; Twitter Inc). Approximately 2000 randomly selected tweets were labeled with 2 types of categorizations: (1) arguments for (pros) or against (cons) vaccination (health measures included) and (2) type of content (scientific, political, social, or vaccination status). The CamemBERT model was fine-tuned and tested for the classification of French-language tweets. The model’s performance was assessed by computing the F1-score, and confusion matrices were obtained. Results The accuracy of the applied machine learning reached up to 70.6% for the first classification (pro and con tweets) and up to 90% for the second classification (scientific and political tweets). Furthermore, a tweet was 1.86 times more likely to be incorrectly classified by the model if it contained fewer than 170 characters (odds ratio 1.86; 95% CI 1.20-2.86). Conclusions The accuracy of the model is affected by the classification chosen and the topic of the message examined. When the vaccine debate is jostled by contested political decisions, tweet content becomes so heterogeneous that the accuracy of the model drops for less differentiated classes. However, our tests showed that it is possible to improve the accuracy by selecting tweets using a new method based on tweet length.

show abstract

Section: Methodsmentioning

confidence: 91%

Section: Discussionmentioning

confidence: 99%

An Analysis of French-Language Tweets About COVID-19 Vaccines: Supervised Learning Approach

Sauvayre¹,

Vernier²,

Chauvière³

2022

JMIR Med Inform

View full text Add to dashboard Cite

show abstract

“…We note that prior to using BERT, we have attempted K-means clustering technique that failed at clustering the documents into interpretable topics. Also, in previous studies (Benitez-Andrades et al, 2022;Bilal and Almazroi, 2022) the authors observed that BERT-based classifiers outperform bag-of-words approaches. Though BERT-based models can be computationally expensive (Bhattacharjee et al, 2020), we utilized BERTopic to have a better accuracy in classifying documents into interpretable topics.…”

Section: Topic Modeling To Identify Subfields Within the Covid-vaccin...mentioning

confidence: 85%

The research foundation for COVID-19 vaccine development

Messan

Sulima

Ghosh

et al. 2023

Front. Res. Metr. Anal.

View full text Add to dashboard Cite

The development of effective vaccines in <1 year to combat the spread of coronavirus disease 19 (COVID-19) is an example of particularly rapid progress in biomedicine. However, this was only made possible by decades of investment in scientific research. Many important research commentaries and reviews have been provided to describe the various contributions and scientific breakthroughs that led to the development of COVID-19 vaccines. In this work, we sought to complement those efforts by adding a systematic and quantitative study of the research foundations that led to these vaccines. Here, we analyzed citations from COVID-19 vaccine research articles to determine which scientific areas of study contributed the most to this research. Our findings revealed that coronavirus research was cited most often, and by a large margin. However, significant contributions were also seen from a diverse set of fields such as cancer, diabetes, and HIV/AIDS. In addition, we examined the publication history of the most prolific authors of COVID-19 vaccine research to determine their research expertise prior to the pandemic. Interestingly, although COVID-19 vaccine research relied most heavily on previous coronavirus work, we find that the most prolific authors on these publications most often had expertise in other areas including influenza, cancer, and HIV/AIDS. Finally, we used machine learning to identify and group together publications based on their major topic areas. This allowed us to elucidate the differences in citations between research areas. These findings highlight and quantify the relevance of prior research from a variety of scientific fields to the rapid development of a COVID-19 vaccine. This study also illustrates the importance of funding and sustaining a diverse research enterprise to facilitate a rapid response to future pandemics.

show abstract

“…• when other sources of information are not freely available, such as in languages other than English [7,9,19,20,22,38,43,47]; • when researchers investigate questions related to patients and population, while these questions are not discussed with medical doctors or require large population samples. We can mention for instance sentiment analysis on medication and vaccines [7,20,33,38,43,47], and adverse drug effects [76,77]; • when mental health of patients is concerned in cases like depression [9, 4, 78], eating disorders [79], suicide detection and prevention [19,80], quality of life of patients [31,81], and drug misuse [82][83][84].…”

Section: Social Media As the Preferred Source Of Informationmentioning

confidence: 99%

Year 2022 in Medical Natural Language Processing: Availability of Language Models as a Step in the Democratization of NLP in the Biomedical Area

Grouin,

Grabar

2023

Yearb Med Inform

View full text Add to dashboard Cite

Objectives: To analyse the content of publications within the medical Natural Language Processing (NLP) domain in 2022. Methods: Automatic and manual preselection of publications to be reviewed, and selection of the best NLP papers of the year. Analysis of the important issues. Results: Three best papers have been selected. We also propose an analysis of the content of the NLP publications in 2022, stressing on some of the topics. Conclusion: The main trend in 2022 is certainly related to the availability of large language models, especially those based on Transformers, and to their use by non-NLP researchers. This leads to the democratization of the NLP methods. We also observe the renewal of interest to languages other than English, the continuation of research on information extraction and prediction, the massive use of data from social media, and the consideration of needs and interests of patients.

show abstract

Traditional Machine Learning Models and Bidirectional Encoder Representations From Transformer (BERT)–Based Automatic Classification of Tweets About Eating Disorders: Algorithm Development and Validation Study

Cited by 25 publications

References 55 publications

An Analysis of French-Language Tweets About COVID-19 Vaccines: Supervised Learning Approach

An Analysis of French-Language Tweets About COVID-19 Vaccines: Supervised Learning Approach

The research foundation for COVID-19 vaccine development

Year 2022 in Medical Natural Language Processing: Availability of Language Models as a Step in the Democratization of NLP in the Biomedical Area

Contact Info

Product

Resources

About