2017
DOI: 10.1002/asi.23836
|View full text |Cite
|
Sign up to set email alerts
|

An analysis of 14 Million tweets on hashtag‐oriented spamming*

Abstract: Over the years, Twitter has become a popular platform for information dissemination and information gathering. However, the popularity of Twitter has attracted not only legitimate users but also spammers who exploit social graphs, popular keywords, and hashtags for malicious purposes. In this paper, we present a detailed analysis of the HSpam14 dataset, which contains 14 million tweets with spam and ham (i.e., nonspam) labels, to understand spamming activities on Twitter. The primary focus of this paper is to … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
59
0
4

Year Published

2017
2017
2020
2020

Publication Types

Select...
5
3

Relationship

2
6

Authors

Journals

citations
Cited by 45 publications
(64 citation statements)
references
References 45 publications
0
59
0
4
Order By: Relevance
“…We combine MinHash clustering and incremental clustering to group tweets collected by keywords TC within a time window into clusters. It has been reported that the MinHash algorithm is effective in grouping near-duplicate tweets (Sedhai & Sun, 2015). However, tweets with a different minimum hash value could also be similar.…”
Section: Event-related Tweets Identificationmentioning
confidence: 99%
“…We combine MinHash clustering and incremental clustering to group tweets collected by keywords TC within a time window into clusters. It has been reported that the MinHash algorithm is effective in grouping near-duplicate tweets (Sedhai & Sun, 2015). However, tweets with a different minimum hash value could also be similar.…”
Section: Event-related Tweets Identificationmentioning
confidence: 99%
“…Features used by machine learning methods for Twitter spam detection are varying and differ in terms of their level (e.g account, tweet, and campaign), formulations, powerfulness, ease of manipulation, and their suitability for real-time detection. Authors in Sedhai and Sun (2017a) and Yang, Harkreader, and Gu (2013) provide a comprehensive study and analysis regarding hashtag, tweet, account, graph, and timing features with respect to their performance in Twitter spam detection.…”
Section: Machine Learning Approachmentioning
confidence: 99%
“…The most challenging task in creating a large dataset is the annotation process. Currently, researchers are using four ways to generate ground truth, including: manual inspection, blacklists, suspended accounts, and clustering Hu et al, 2014;Hu et al, 2013;Sedhai & Sun, 2017a;Wu, Liu et al, 2017 ). Manual inspection is costly, time-consuming, and some times subjective.…”
Section: Dataset Description and Ground Truthmentioning
confidence: 99%
“…One possible reason for tri-gram features not in the list may be due to the sparsity of the tri-gram vocabulary in the dataset. It is interesting to note that most of the top words based on Gini-impurity score are the same as the list of hashtags having the highest spammy-index reported in [17]. (1) FoT having more than 2 hashtags (5) Contains spammy hashtag (2) Hashtag per tweet (12) Contains categorical hashtag (7) FoT having spammy hashtag (14) Fraction of capitalized tweets (2) Contains URL (11) FoT that are retweet (6) Is retweet…”
Section: Feature Analysismentioning
confidence: 99%