The Effect of Dataset Size on Training Tweet Sentiment Classifiers

Prusa, Joseph D.; Khoshgoftaar, Taghi M.; Seliya, Naeem

doi:10.1109/icmla.2015.22

Cited by 52 publications

(30 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• The first dataset is a Twitter sentiment labeled dataset with emojis. It has 6,600 positive and negative Tweets each, a large enough dataset for accurate sentiment analysis (Prusa, Khoshgoftaar, and Seliya 2015). • The second dataset is a Twitter sentiment labeled dataset without emojis.…”

Section: Datasetmentioning

confidence: 99%

Understanding Emojis for Sentiment Analysis

Yoo

Rayz

2021

FLAIRS

View full text Add to dashboard Cite

Many people use emojis to express themselves in a more clear and efficient way and the usage of such expressions has grown significantly. In recent years, there have been many research papers that analyze the meanings of individual emojis or show the accuracy of sentiment analysis when emojis are included. However, there is limited research done on understanding how emojis are used to show sentiment and how it affects sentiment analysis. In this paper, we analyze the usage of emojis in Tweets and their effects on the overall sentiment of the Tweet. We also introduce a pre-processing method for emojis that increases the effect of emojis and as a result, improves the sentiment analysis accuracy.

show abstract

Section: Datasetmentioning

confidence: 99%

Understanding Emojis for Sentiment Analysis

Yoo

Rayz

2021

FLAIRS

View full text Add to dashboard Cite

show abstract

“…The dataset size is considered a critical property in determining the performance of a machine learning model. Typically, large datasets lead to better classification performance and small datasets may trigger over-fitting [1][2][3]. In practice, however, collecting medical data faces many challenges due to patients' privacy, lack of cases due to rare conditions [4], as well as organizational and legal challenges [5,6].…”

Section: Introductionmentioning

confidence: 99%

“…Establishing a method to find the trend in small datasets is not only of scientific interest but also of practical importance and requires a special care when developing machine learning models. Unfortunately, classification algorithms may perform worse when trained with limited size datasets [2]. This is because small datasets typically contain less details, hence the classification model cannot generalize patterns in training data.…”

Section: Introductionmentioning

confidence: 99%

Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain

et al. 2021

View full text Add to dashboard Cite

Dataset size is considered a major concern in the medical domain, where lack of data is a common occurrence. This study aims to investigate the impact of dataset size on the overall performance of supervised classification models. We examined the performance of six widely-used models in the medical field, including support vector machine (SVM), neural networks (NN), C4.5 decision tree (DT), random forest (RF), adaboost (AB), and naïve Bayes (NB) on eighteen small medical UCI datasets. We further implemented three dataset size reduction scenarios on two large datasets and analyze the performance of the models when trained on each resulting dataset with respect to accuracy, precision, recall, f-score, specificity, and area under the ROC curve (AUC). Our results indicated that the overall performance of classifiers depend on how much a dataset represents the original distribution rather than its size. Moreover, we found that the most robust model for limited medical data is AB and NB, followed by SVM, and then RF and NN, while the least robust model is DT. Furthermore, an interesting observation is that a robust machine learning model to limited dataset does not necessary imply that it provides the best performance compared to other models.

show abstract

“…Moreover, the limited number of samples makes it affordable to test different IDS solutions on the complete set without the need to select a small random partition. In fact, even if the complexity of a dataset is important in order to faithfully emulate a real industrial plant, a too large dataset is not properly managed by machine learning algorithms reducing its usability [30]. Thus, evaluation results of different papers could be effectively compared in order to identify the best algorithms without any influence from the selected random data partitions.…”

Section: Introductionmentioning

confidence: 99%

A Hardware-in-the-Loop Water Distribution Testbed Dataset for Cyber-Physical Security Testing

Faramondi

Flammini

Guarino³

et al. 2021

IEEE Access

View full text Add to dashboard Cite

This paper presents a dataset to support researchers in the validation process of solutions such as Intrusion Detection Systems (IDS) based on artificial intelligence and machine learning techniques for the detection and categorization of threats in Cyber Physical Systems (CPS). To this end, data have been acquired from a hardware-in-the-loop Water Distribution Testbed (WDT) which emulates water flowing between eight tanks via solenoid-valves, pumps, pressure and flow sensors. The testbed is composed of a real subsystem which is virtually connected to a simulated one. The proposed dataset encompasses both physical and network data in order to highlight the consequences of attacks in the physical process as well as in network traffic behaviour. Simulations data are organized in four different acquisitions for a total duration of 2 hours by considering normal scenario and multiple anomalies due to cyber and physical attacks.

show abstract

The Effect of Dataset Size on Training Tweet Sentiment Classifiers

Cited by 52 publications

References 6 publications

Understanding Emojis for Sentiment Analysis

Understanding Emojis for Sentiment Analysis

Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain

A Hardware-in-the-Loop Water Distribution Testbed Dataset for Cyber-Physical Security Testing

Contact Info

Product

Resources

About