Over the past two decades our era can be described as big data era where digital data is becoming increasingly important in many domains like healthcare, science, technology and society. A large amount of data has been captured and generated from multiple areas, multiple sources such as streaming machines, high throughput instruments,
The growing chronic diseases patients and the centralization of medical resources cause significant economic impact resulting in hospital visits, hospital readmission, and other healthcare costs. This paper proposes a scalable and real-time system for disease prediction from medical data streams. This is carried out by integrating Twitter, Apache Kafka, Apache Spark and Apache Cassandra. Thus, Twitter users tweet attributes related to health, Kafka streaming receives all desired tweets attributes and ingest them to Spark streaming. Here, a machine learning algorithm is applied to predict health status and send back a response message through Kafka. The heart disease dataset, obtained from the UCI repository, was used for experiments. In order to enhance prediction accuracy, Relief algorithm is used for features selection. We compared sex types of relevant machine learning algorithms implemented by Spark MLlib such as Random Forest (RF), Naive Bayes, Support Vector Machine, Multilayer Perceptron, Decision Tree and Logistic Regression with the full features as well as selected features. The highest classification accuracy of 92.05% was reported using RF with selected features. The scalability of RF using Spark MLlib and WEKA framework for both training and application stages was measured. The results show significantly better performances of Spark in terms of scalability and computing times.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.