Credibility Detection in Twitter Using Word N-gram Analysis and Supervised Machine Learning Techniques

Hassan, Noha Y.; Gomaa, Wael H.; Khoriba, Ghada; Haggag, Mohammed H.

doi:10.22266/ijies2020.0229.27

Cited by 33 publications

(24 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We examined frequent word pairs (bigrams) for each facet to qualitatively validate that tweets were related to the facet they were assigned to as has been done in other Twitter-based studies as they offer more insight into the sentiment of tweets than examining unigrams on their own [18] , [19] , [20] . A representative sample is shown in Appendix B.…”

Section: Methodsmentioning

confidence: 99%

Defining facets of social distancing during the COVID-19 pandemic: Twitter analysis

Kwon

Grady

Feliciano

et al. 2020

Journal of Biomedical Informatics

View full text Add to dashboard Cite

show abstract

Section: Methodsmentioning

confidence: 99%

Defining facets of social distancing during the COVID-19 pandemic: Twitter analysis

Kwon

Grady

Feliciano

et al. 2020

Journal of Biomedical Informatics

View full text Add to dashboard Cite

show abstract

“…In this section, we compare the performance of the proposed model with three models existing in the literature. The first model was introduced by Noha Hassan et al [42] who introduces a classification model based on supervised machine learning techniques and word-based N-gram analysis to automatically classify Twitter messages into credible and non-credible. The results in Table 5 show that this model has an accuracy of 84.9%.…”

Section: Comparison (Vssyntax-based Methods)mentioning

confidence: 99%

“…in Facebook also , Fong et al [35] have taken into consideration the analysis of the avatar on a profile, sex, age, and the name. Noha et al [42], introduces a classification model based on supervised machine learning techniques and word-based N-gram analysis to classify Twitter messages automatically into credible and not credible. The best performance is achieved using a combination of both unigrams and bigrams, LSVM as a classifier and TF-IDF as a feature extraction technique.…”

Section: Related Workmentioning

confidence: 99%

A Framework for Spam Detection in Twitter Based on Recommendation System

Elmendili¹,

Idrissi²

2020

IJIES

View full text Add to dashboard Cite

The rapidly growing online social networking sites have been infiltrated by a large amount of spam. Spammers are a particular kind of ill-intentioned users who degrade the quality of OSNs information through misusing all possible services provided by OSNs. Social spammers spread many intensive posts/tweets to lure legitimate users to malicious or commercial sites containing malware downloads, phishing, and drug sales. Given the fact that Twitter is not immune towards the social spam problem, different researchers have designed various detection methods, which inspect individual tweets or accounts for the existence of spam contents. Today, social networks are exposed to various threats that exploit their vulnerability. However, although of the high detection rates of the account-based spam detection methods, these methods are not suitable for filtering tweets in the real-time detection because of the need for information from Twitter's servers. At tweet spam detection level, many light features have been proposed for real-time filtering; however, the existing classification models separately classify a tweet without considering the state of previous handled tweets associated with a topic. First, they propose the identification of spam tweet by the security approach based on social honeypots and then they propose a method based on an algorithm "content filtering" in order to detect those that are similar to spam tweet detected by the approach of honeypots. Our approach has greatly improved the quality of abstraction in terms of performance and design. The algorithm is also fast and simple to implement. Experimental results show the stability and accuracy (over 99%), F-measure 98% of our approach.

show abstract

“…Using a Kaggle data collection, the authors investigated to label each of the tweets as positive, negative or neutral sentiment. A classification algorithm focused on supervised ML techniques and word-based N-gram processing to automatically divide Twitter messages into credible and not credible ones introduced by (Hassan et al, 2020). Five different supervised ML classification techniques were applied and the research examines two interpretations of features (TF and TF-IDF) and separate sets of N-gram terms.…”

Section: Literature Reviewmentioning

confidence: 99%

Influence of Pre-Processing Strategies on the Performance of ML Classifiers Exploiting TF-IDF and BOW Features

Pimpalkar

Raj

2020

ADCAIJ

View full text Add to dashboard Cite

Data analytics and its associated applications have recently become impor-tant fields of study. The subject of concern for researchers now-a-days is a massive amount of data produced every minute and second as people con-stantly sharing thoughts, opinions about things that are associated with them. Social media info, however, is still unstructured, disseminated and hard to handle and need to be developed a strong foundation so that they can be utilized as valuable information on a particular topic. Processing such unstructured data in this area in terms of noise, co-relevance, emoticons, folksonomies and slangs is really quite challenging and therefore requires proper data pre-processing before getting the right sentiments. The dataset is extracted from Kaggle and Twitter, pre-processing performed using NLTK and Scikit-learn and features selection and extraction is done for Bag of Words (BOW), Term Frequency (TF) and Inverse Document Frequency (IDF) scheme. For polarity identification, we evaluated five different Machine Learning (ML) algorithms viz Multinomial Naive Bayes (MNB), Logistic Regression (LR), Decision Trees (DT), XGBoost (XGB) and Support Vector Machines (SVM). We have performed a comparative analysis of the success for these algorithms in order to decide which algorithm works best for the given data-set in terms of recall, accuracy, F1-score and precision. We assess the effects of various pre-processing techniques on two datasets; one with domain and other not. It is demonstrated that SVM classifier outperformed the other classifiers with superior evaluations of 73.12% and 94.91% for accuracy and precision respectively. It is also highlighted in this research that the selection and representation of features along with various pre-processing techniques have a positive impact on the performance of the classification. The ultimate outcome indicates an improvement in sentiment classification and we noted that pre-processing approaches obviously suggest an improvement in the efficiency of the classifiers.

show abstract

Credibility Detection in Twitter Using Word N-gram Analysis and Supervised Machine Learning Techniques

Cited by 33 publications

References 21 publications

Defining facets of social distancing during the COVID-19 pandemic: Twitter analysis

Defining facets of social distancing during the COVID-19 pandemic: Twitter analysis

A Framework for Spam Detection in Twitter Based on Recommendation System

Influence of Pre-Processing Strategies on the Performance of ML Classifiers Exploiting TF-IDF and BOW Features

Contact Info

Product

Resources

About