Suicide is among the leading causes of death in China. However, technical approaches toward preventing suicide are challenging and remaining under development. Recently, several actual suicidal cases were preceded by users who posted microblogs with suicidal ideation to Sina Weibo, a Chinese social media network akin to Twitter. It would therefore be desirable to detect suicidal ideations from microblogs in real-time, and immediately alert appropriate support groups, which may lead to successful prevention. In this paper, we propose a real-time suicidal ideation detection system deployed over Weibo, using machine learning and known psychological techniques. Currently, we have identified 53 known suicidal cases who posted suicide notes on Weibo prior to their deaths. We explore linguistic features of these known cases using a psychological lexicon dictionary, and train an effective suicidal Weibo post detection model. 6714 tagged posts and several classifiers are used to verify the model. By combining both machine learning and psychological knowledge, SVM classifier has the best performance of different classifiers, yielding an F-measure of 68.3%, a Precision of 78.9%, and a Recall of 60.3%.
Abstract. If people with high risk of suicide can be identified through social media like microblog, it is possible to implement an active intervention system to save their lives. Based on this motivation, the current study administered the Suicide Probability Scale(SPS) to 1041 weibo users at Sina Weibo, which is a leading microblog service provider in China. Two NLP (Natural Language Processing) methods, the Chinese edition of Linguistic Inquiry and Word Count (LIWC) lexicon and Latent Dirichlet Allocation (LDA), are used to extract linguistic features from the Sina Weibo data. We trained predicting models by machine learning algorithm based on these two types of features, to estimate suicide probability based on linguistic features. The experiment results indicate that LDA can find topics that relate to suicide probability, and improve the performance of prediction. Our study adds value in prediction of suicidal probability of social network users with their behaviors.
Many corpora span broad periods of time. Language processing models trained during one time period may not work well in future time periods, and the best model may depend on specific times of year (e.g., people might describe hotels differently in reviews during the winter versus the summer). This study investigates how document classifiers trained on documents from certain time intervals perform on documents from other time intervals, considering both seasonal intervals (intervals that repeat across years, e.g., winter) and non-seasonal intervals (e.g., specific years). We show experimentally that classification performance varies over time, and that performance can be improved by using a standard domain adaptation approach to adjust for changes in time.
Language usage can change across periods of time, but document classifiers models are usually trained and tested on corpora spanning multiple years without considering temporal variations. This paper describes two complementary ways to adapt classifiers to shifts across time. First, we show that diachronic word embeddings, which were originally developed to study language change, can also improve document classification, and we show a simple method for constructing this type of embedding. Second, we propose a time-driven neural classification model inspired by methods for domain adaptation. Experiments on six corpora show how these methods can make classifiers more robust over time.
IntroductionThe Centers for Disease Control and Prevention (CDC) spend significant time and resources to track influenza vaccination coverage each influenza season using national surveys. Emerging data from social media provide an alternative solution to surveillance at both national and local levels of influenza vaccination coverage in near real time.ObjectivesThis study aimed to characterise and analyse the vaccinated population from temporal, demographical and geographical perspectives using automatic classification of vaccination-related Twitter data.MethodsIn this cross-sectional study, we continuously collected tweets containing both influenza-related terms and vaccine-related terms covering four consecutive influenza seasons from 2013 to 2017. We created a machine learning classifier to identify relevant tweets, then evaluated the approach by comparing to data from the CDC’s FluVaxView. We limited our analysis to tweets geolocated within the USA.ResultsWe assessed 1 124 839 tweets. We found strong correlations of 0.799 between monthly Twitter estimates and CDC, with correlations as high as 0.950 in individual influenza seasons. We also found that our approach obtained geographical correlations of 0.387 at the US state level and 0.467 at the regional level. Finally, we found a higher level of influenza vaccine tweets among female users than male users, also consistent with the results of CDC surveys on vaccine uptake.ConclusionSignificant correlations between Twitter data and CDC data show the potential of using social media for vaccination surveillance. Temporal variability is captured better than geographical and demographical variability. We discuss potential paths forward for leveraging this approach.
The COVID-19 pandemic brought widespread attention to an “infodemic” of potential health misinformation. This claim has not been assessed based on evidence. We evaluated if health misinformation became more common during the pandemic. We gathered about 325 million posts sharing URLs from Twitter and Facebook during the beginning of the pandemic (March 8-May 1, 2020) compared to the same period in 2019. We relied on source credibility as an accepted proxy for misinformation across this database. Human annotators also coded a subsample of 3000 posts with URLs for misinformation. Posts about COVID-19 were 0.37 times as likely to link to “not credible” sources and 1.13 times more likely to link to “more credible” sources than prior to the pandemic. Posts linking to “not credible” sources were 3.67 times more likely to include misinformation compared to posts from “more credible” sources. Thus, during the earliest stages of the pandemic, when claims of an infodemic emerged, social media contained proportionally less misinformation than expected based on the prior year. Our results suggest that widespread health misinformation is not unique to COVID-19. Rather, it is a systemic feature of online health communication that can adversely impact public health behaviors and must therefore be addressed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.