BACKGROUND
Rare diseases affect roughly 400 million people world-wide, and 30 million in Europe. Intangible costs and personal aspects for patients and their families are rarely accounted for, as most studies focus on easy-to-measure metrics. On the other hand, social media play an important role for these people who can easily feel isolated and seek both support and advice online.
OBJECTIVE
We aim to examine in the peer-reviewed academic literature how social media has been used to generate new knowledge, and the types of research questions that have been answered through social data mining. We explore what types of methods, data sources, and data types are used.
METHODS
We reviewed studies based on user-generated data, focusing on rare diseases, and published prior to May 2023. For included publications, a list of pertinent variables was established to cover data sources, data processing, and objectives. These variables were later on analysed quantitatively and qualitatively.
RESULTS
Eighty-seven studies were included. The vast majority of publications (94.3%) focused on one rare disease or on a family of rare diseases. Overall, only less than a hundred rare diseases were studied in the included publications. Moreover, 93.1% of the studies analysed contents in English. Surprisingly, automated methods were used in only seven studies, all published after 2020. These publications’ mean number of posts studied is 33,201 (compared to 1,405 for publications analysing the posts manually). Among these publications, three had a temporal range of five years or more, accounting for half of the publications with a temporal range of five years or more (the majority of publications had a temporal range of less than two years). Among the seven publications using AI methods, the two main AI-assisted tasks were sentiment analysis and topic identification.
CONCLUSIONS
This work allowed us to grasp what the reality of using user-generated social media data for rare disease research was in 2023. The opportunities of current AI research on NLP are still underexploited in this very specific field, resulting in an under exploitation of online data. Contrasting with the high expectancies of the rare disease research community, this review shows that social media based studies in this field are still at an early stage, with only a tiny portion of rare diseases studied, with only a few languages studied also, and mainly with only very few studies exploiting current NLP progress to extract knowledge from social media data.