A Dynamic Spark-based Classification Framework for Imbalanced Big Data

Abdel-Hamid, Nahla B.; Elghamrawy, Sally M.; Ali, Hesham; Arafat, Hesham

doi:10.1007/s10723-018-9465-z

Cited by 25 publications

(10 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This consisted of 23.7% tweets that were labelled as relevant and 76.3% labelled as irrelevant. Imbalanced data causes well known problems to classification models [48]. We initially tried both oversampling and undersampling techniques to create a balanced training dataset as well as just using the unbalanced data.…”

Section: Methodsmentioning

confidence: 99%

Twitter mining using semi-supervised classification for relevance filtering in syndromic surveillance

et al. 2019

View full text Add to dashboard Cite

We investigate the use of Twitter data to deliver signals for syndromic surveillance in order to assess its ability to augment existing syndromic surveillance efforts and give a better understanding of symptomatic people who do not seek healthcare advice directly. We focus on a specific syndrome—asthma/difficulty breathing. We outline data collection using the Twitter streaming API as well as analysis and pre-processing of the collected data. Even with keyword-based data collection, many of the tweets collected are not be relevant because they represent chatter, or talk of awareness instead of an individual suffering a particular condition. In light of this, we set out to identify relevant tweets to collect a strong and reliable signal. For this, we investigate text classification techniques, and in particular we focus on semi-supervised classification techniques since they enable us to use more of the Twitter data collected while only doing very minimal labelling. In this paper, we propose a semi-supervised approach to symptomatic tweet classification and relevance filtering. We also propose alternative techniques to popular deep learning approaches. Additionally, we highlight the use of emojis and other special features capturing the tweet’s tone to improve the classification performance. Our results show that negative emojis and those that denote laughter provide the best classification performance in conjunction with a simple word-level n -gram approach. We obtain good performance in classifying symptomatic tweets with both supervised and semi-supervised algorithms and found that the proposed semi-supervised algorithms preserve more of the relevant tweets and may be advantageous in the context of a weak signal. Finally, we found some correlation ( r = 0.414, p = 0.0004) between the Twitter signal generated with the semi-supervised system and data from consultations for related health conditions.

show abstract

Section: Methodsmentioning

confidence: 99%

Twitter mining using semi-supervised classification for relevance filtering in syndromic surveillance

et al. 2019

View full text Add to dashboard Cite

show abstract

“…Regardless of the specific type of machine learning method being used, a significant challenge is the unequal distribution of classes within a data set, referred to as imbalanced learning [32], [50]- [53]. In our problem of interest with skewed class proportions, many observations belong to the safe category (related to as the majority class), and much fewer samples fit in the structural failure group (referred to as the minority class).…”

Section: Proposed Two-stage Framework For Predictive Modeling Anmentioning

confidence: 99%

Neural Networks and Imbalanced Learning for Data-Driven Scientific Computing With Uncertainties

Pourkamali-Anaraki

Hariri-Ardebili

2021

IEEE Access

View full text Add to dashboard Cite

Uncertainty quantification in complex engineering problems is challenging because of necessitating large numbers of expensive model evaluations. This paper proposes a two-stage framework for developing accurate machine learning-based surrogate models in structural engineering. The studied numerical model considers aleatory and epistemic uncertainties, i.e., ground motion features and material properties. Our framework's first step trains classification algorithms on the collected data from our numerical model with a disproportionate ratio of observations from two categories, i.e., failed and safe simulations. We investigate the performance of imbalanced learning strategies along with artificial neural networks to achieve high classification accuracy. The second step of our framework aims to estimate three quantities of interest using the same network architecture, comparing our approach with regularized linear regression models. Moreover, we present a new approach to reducing the number of numerical simulations for developing machine learning-based surrogate models with limited training data. This approach employs Gaussian processes as a powerful probabilistic technique, providing an inherent uncertainty measure to determine the quality of estimated response values. Extensive numerical experiments demonstrate the superior performance of neural networks with three hidden layers compared to traditional machine learning algorithms for both classification and regression tasks. Also, empirical investigations corroborate that Gaussian processes enable us to predict the values of missing simulations for reducing the computational cost associated with numerical models. To conclude this work, we present several applications and future research directions. INDEX TERMS Uncertainty, Gaussian processes, neural networks, imbalanced classification, regression.

show abstract

“…CQNS is a proposed framework for improving and estimating complex queries for relational databases and other types of NoSQL data stores. For this purpose, a unified data model is proposed that uses a suitable environment such as Apache Spark with MongoDB [ 40 – 42 ] to optimize the qualification of the data ingestion process. The CQNS framework transforms each query process received from any dataset to the matched Engine after using Hadoop/HDFS and Hadoop/MapReduce with parallel k-means clustering for processing data without physical transformation data.…”

Section: Related Workmentioning

confidence: 99%

An adaptive spark-based framework for querying large-scale NoSQL and relational databases

Khashan

El-Desouky

Elghamrawy³

2021

PLoS ONE

Self Cite

View full text Add to dashboard Cite

The growing popularity of big data analysis and cloud computing has created new big data management standards. Sometimes, programmers may interact with a number of heterogeneous data stores depending on the information they are responsible for: SQL and NoSQL data stores. Interacting with heterogeneous data models via numerous APIs and query languages imposes challenging tasks on multi-data processing developers. Indeed, complex queries concerning homogenous data structures cannot currently be performed in a declarative manner when found in single data storage applications and therefore require additional development efforts. Many models were presented in order to address complex queries Via multistore applications. Some of these models implemented a complex unified and fast model, while others’ efficiency is not good enough to solve this type of complex database queries. This paper provides an automated, fast and easy unified architecture to solve simple and complex SQL and NoSQL queries over heterogeneous data stores (CQNS). This proposed framework can be used in cloud environments or for any big data application to automatically help developers to manage basic and complicated database queries. CQNS consists of three layers: matching selector layer, processing layer, and query execution layer. The matching selector layer is the heart of this architecture in which five of the user queries are examined if they are matched with another five queries stored in a single engine stored in the architecture library. This is achieved through a proposed algorithm that directs the query to the right SQL or NoSQL database engine. Furthermore, CQNS deal with many NoSQL Databases like MongoDB, Cassandra, Riak, CouchDB, and NOE4J databases. This paper presents a spark framework that can handle both SQL and NoSQL Databases. Four scenarios’ benchmarks datasets are used to evaluate the proposed CQNS for querying different NoSQL Databases in terms of optimization process performance and query execution time. The results show that, the CQNS achieves best latency and throughput in less time among the compared systems.

show abstract

A Dynamic Spark-based Classification Framework for Imbalanced Big Data

Cited by 25 publications

References 22 publications

Twitter mining using semi-supervised classification for relevance filtering in syndromic surveillance

Twitter mining using semi-supervised classification for relevance filtering in syndromic surveillance

Neural Networks and Imbalanced Learning for Data-Driven Scientific Computing With Uncertainties

An adaptive spark-based framework for querying large-scale NoSQL and relational databases

Contact Info

Product

Resources

About