In this work we address the issue of generic automated disease incidence monitoring on twitter. We employ an ontology of disease related concepts and use it to obtain a conceptual representation of tweets. Unlike previous key word based systems and topic modeling approaches, our ontological approach allows us to apply more stringent criteria for determining which messages are relevant such as spatial and temporal characteristics whilst giving a stronger guarantee that the resulting models will perform well on new data that may be lexically divergent. We achieve this by training learners on concepts rather than individual words. For training we use a dataset containing mentions of influenza and Listeria and use the learned models to classify datasets containing mentions of an arbitrary selection of other diseases. We show that our ontological approach achieves good performance on this task using a variety of Natural Language Processing Techniques. We also show that word vectors can be learned directly from our concepts to achieve even better results.
Twitter and social media as a whole have great potential as a source of disease surveillance data however the general messiness of tweets presents several challenges for standard information extraction methods. Most deployed systems employ approaches that rely on simple keyword matching and do not distinguish between relevant and irrelevant keyword mentions making them susceptible to false positives as a result of the fact that keyword volume can be influenced by several social phenomena that may be unrelated to disease occurrence. Furthermore, most solutions are intended for a single language and those meant for multilingual scenarios do not incorporate semantic context. In this paper we experimentally examine different approaches for classifying text for epidemiological surveillance on the social web in addition we offer a systematic comparison of the impact of different input representations on performance. Specifically we compare continuous representations against one-hot encoding for word-based, class-based (ontology-based) and subword units in the form of byte pair encodings. We also go on to establish the desirable performance characteristics for multi-lingual semantic filtering approaches and offer an in-depth discussion of the implications for end-to-end surveillance.
ity on the ground and can therefore be employed as a basis for so-called internet based syndromic surveillance. The integration of these systems into the public health surveillance process confers several advantages. First of all these systems have global reach for instance facebook, the largest service has 1.47 billion daily active users who generate unsolicited information at a rate of 293,000 status updates, 510,000 comments and 136,000 photo updates a minute [9] and Twitter which is the subject of this paper has about 330 million daily active users who generate more than 500 million tweets a day [10]. A system capable of sifting through this deluge of real time user information for mentions of disease occurrence has a very high chance of detecting outbreaks far quicker than traditional surveillance approaches moreover at a fraction of the cost due to the
The social web has emerged as a dominant information architecture accelerating technology innovation on an unprecedented scale. The utility of these developments to public health use cases like disease surveillance, information dissemination, outbreak prediction and so forth has been widely investigated and variously demonstrated in work spanning several published experimental studies and deployed systems. In this paper we provide an overview of automated disease surveillance efforts based on the social web characterized by their different high level design choices regarding functional aspects like user participation and language parsing approaches. We briefly discuss the technical rationale and practical implications of these different choices in addition to the key limitations associated with these systems within the context of operable disease surveillance. We hope this can offer some technical guidance to multi-disciplinary teams on how best to implement, interpret and evaluate disease surveillance programs based on the social web.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.