An Investigation and Evaluation of N-Gram, TF-IDF and Ensemble Methods in Sentiment Classification

Rahman, Sheikh Shah Mohammad Motiur; Biplob, Khalid Been Md. Badruzzaman; Rahman, Md. Habibur; Sarker, Kaushik; Islam, Takia

doi:10.1007/978-3-030-52856-0_31

Cited by 15 publications

(7 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…N represents the number of adjacent words considered as a sequence. In the unigram, each word is considered as a single sequence, whereas in bigram every two words are a sequence [ 27 , 28 ].…”

Section: Methodsmentioning

confidence: 99%

Computational Intelligence-Based Model for Exploring Individual Perception on SARS-CoV-2 Vaccine in Saudi Arabia

Khan

Aslam

Chrouf

et al. 2022

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

Countries around the world are facing so many challenges to slow down the spread of the current SARS-CoV-2 virus. Vaccination is an effective way to combat this virus and prevent its spreading among individuals. Currently, there are more than 50 SARS-CoV-2 vaccine candidates in trials; only a few of them are already in use. The primary objective of this study is to analyse the public awareness and opinion toward the vaccination process and to develop a model that predicts the awareness and acceptability of SARS-CoV-2 vaccines in Saudi Arabia by analysing a dataset of Arabic tweets related to vaccination. Therefore, several machine learning models such as Support Vector Machine (SVM), Naïve Bayes (NB), and Logistic Regression (LR), sideways with the N-gram and Term Frequency-Inverse Document Frequency (TF-IDF) techniques for feature extraction and Long Short-Term Memory (LSTM) model used with word embedding. LR with unigram feature extraction has achieved the best accuracy, recall, and F1 score with scores of 0.76, 0.69, and 0.72, respectively. However, the best precision value of 0.80 was achieved using SVM with unigram and NB with bigram TF-IDF. However, the Long Short-Term Memory (LSTM) model outperformed the other models with an accuracy of 0.95, a precision of 0.96, a recall of 0.95, and an F1 score of 0.95. This model will help in gaining a complete idea of how receptive people are to the vaccine. Thus, the government will be able to find new ways and run more campaigns to raise awareness of the importance of the vaccine.

show abstract

Section: Methodsmentioning

confidence: 99%

Computational Intelligence-Based Model for Exploring Individual Perception on SARS-CoV-2 Vaccine in Saudi Arabia

Khan

Aslam

Chrouf

et al. 2022

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

show abstract

“…Therefore, a features extraction approach is implemented to convert text data into numerical vectors that the algorithms can process and work with. N-gram and the Term Frequency/Inverse Document Frequency (TF-IDF) are the most used feature extraction approaches [ 18 ].…”

Section: Methodsmentioning

confidence: 99%

Sentiment Analysis of Arabic Tweets Regarding Distance Learning in Saudi Arabia during the COVID-19 Pandemic

Aljabri

Chrouf

Alzahrani

et al. 2021

Sensors

View full text Add to dashboard Cite

The COVID-19 pandemic has greatly impacted the normal life of people worldwide. One of the most noticeable impacts is the enforcement of social distancing to reduce the spread of the virus. The Ministry of Education in Saudi Arabia implemented social distancing measures by enforcing distance learning at all educational stages. This measure brought about new experiences and challenges to students, parents, and teachers. This research measures the acceptance rate of this way of learning by analysing people’s tweets regarding distance learning in Saudi Arabia. All the tweets analysed were written in Arabic and collected within the boundary of Saudi Arabia. They date back to the day that the distance learning announcement was made. The tweets were pre-processed, and labelled positive, or negative. Machine learning classifiers with different features and extraction techniques were then built to analyse the sentiment. The accuracy results for the different models were then compared. The best accuracy achieved (0.899) resulted from the Logistic regression classifier with unigram and Term Frequency-Inverse Document Frequency as a feature extraction approach. This model was then applied on a new unlabelled dataset and classified to different educational stages; results demonstrated generally positive opinions regarding distance learning for general education stages (kindergarten, intermediate, and high schools), and negative opinions for the university stage. Further analysis was applied to identify the main topics related to the positive and negative sentiment. This result can be used by the Ministry of Education to further improve the distance learning educational system.

show abstract

“…The N-gram technique represents the text as an N-words sequence; it can be simple or complex, based on the value of N. In unigrams, it considers each word a sequence, while in bigrams it considers each pair of words a sequence. Then, the vectorizer calculates the occurrences of each sequence to generate the sentences' vectors [23].…”

Section: Feature Extractionmentioning

confidence: 99%

Machine Learning Model for Sentiment Analysis of COVID-19 Tweets

Aljabri¹,

Aljameel²,

Khan³

et al. 2022

International Journal on Advanced Science, Engineering and Information Technology

View full text Add to dashboard Cite

COVID-19 pandemic presents unprecedented challenges and enormously affects different aspects of individuals' lives worldwide. The implementation of different prevention measures, the economic and social disruption, and the significant rise in the mortality rate greatly affect the peoples' spectrum of emotions. Sentiment analysis, an important branch of artificial intelligence, uses machine learning techniques to understand public perspectives and gain more insights into how they think and feel. During the pandemic, sentiment analysis increasingly contributes towards making appropriate decisions. This research aims to analyze the public sentiment related to COVID-19 by exploring social perceptions shared on Twitter, one of the most ubiquitous social networks. This goal was achieved by building a machine learning model using a dataset of COVID-19 related English tweets. Different combinations of machine learning classification algorithms (Support Vector Machine (SVM), Random Forest (RF), and XGBoost (XGB)) and feature extraction techniques (Term Frequency-Inverse Document Frequency (TF-IDF) and N-gram) were built and applied to the dataset for binary (positive, negative) and ternary (positive, negative, and neutral) classifications. A comparative study for the performance of the different models was then conducted, and the results concluded that XGB classification algorithm with unigram and bigram for binary classification achieved the highest accuracy of 90%. This sentiment analysis model can assist countries and governments in measuring the impact of the pandemic and the applied prevention measures on people's emotional and mental health and take early actions to reduce their impact or prevent them from becoming severe cases.

show abstract

An Investigation and Evaluation of N-Gram, TF-IDF and Ensemble Methods in Sentiment Classification

Cited by 15 publications

References 16 publications

Computational Intelligence-Based Model for Exploring Individual Perception on SARS-CoV-2 Vaccine in Saudi Arabia

Computational Intelligence-Based Model for Exploring Individual Perception on SARS-CoV-2 Vaccine in Saudi Arabia

Sentiment Analysis of Arabic Tweets Regarding Distance Learning in Saudi Arabia during the COVID-19 Pandemic

Machine Learning Model for Sentiment Analysis of COVID-19 Tweets

Contact Info

Product

Resources

About