Using Random String Classification to Filter and Annotate Automated Accounts

Beskow, David M.; Carley, Kathleen M.

doi:10.1007/978-3-319-93372-6_40

Cited by 16 publications

(22 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Given these results, we used Logistic Regression for our production model, given that it is simpler and faster. Note that this result entails significantly more training data than we used in earlier research (see [6]), where SVM performed better. Before predicting whether or not a string was random, we first applied several heuristic filters.…”

Section: Feature Engineeringmentioning

confidence: 75%

Its all in a name: detecting and labeling bots by their name

Beskow

Carley

2018

Comput Math Organ Theory

Self Cite

View full text Add to dashboard Cite

Automated social media bots have existed almost as long as the social media environments they inhabit. Their emergence has triggered numerous research efforts to develop increasingly sophisticated means to detect these accounts. These efforts have resulted in a cat and mouse cycle in which detection algorithms evolve trying to keep up with ever evolving bots. As part of this continued evolution, our research proposes a multi-model 'tool-box' approach in order to conduct detection at various tiers of data granularity. To support this toolbox approach this research also uses random string detection applied to user names to filter twitter streams for bot accounts and use this as labeled training data for follow on research.

show abstract

Section: Feature Engineeringmentioning

confidence: 75%

Its all in a name: detecting and labeling bots by their name

Beskow

Carley

2018

Comput Math Organ Theory

Self Cite

View full text Add to dashboard Cite

show abstract

“…When a higher value of n was used, the model was more accurate for detecting spamming IDs but it took a longer time. Beskow and Carley [22] proposed a randomly generated user ID detection method based on its randomness of strings in user IDs. Many fake user accounts are randomly generated and such randomly generated IDs are likely to have rare combinations of characters.…”

Section: B Feature Based Approachesmentioning

confidence: 99%

“…The other is feature-based methods which extract features to detect suspicious IDs. One of the latest methods is the n-gram-based randomly generated ID detection method using term frequency-inverse document frequency (TF-IDF), proposed by Beskow and Carley [22]. However, their method suffers from the curse of dimensionality because the feature dimension increases exponentially with increasing n in n-gram-based approaches.…”

Section: Introductionmentioning

confidence: 99%

Detection Method for Randomly Generated User IDs: Lift the Curse of Dimensionality

Kang

Seo

et al. 2022

IEEE Access

View full text Add to dashboard Cite

Internet services are essential to our daily life in these days, and user accounts are usually required for downloading or browsing for multimedia contents from service providers such as Yahoo, Google, YouTube and so on. Attackers who perform malicious actions against these services use fake user accounts to hide their identity, or use them to continue malicious actions even after being caught by the service's detection system. Using a random string generation algorithm for user identification (ID) string is one of the common method to create and obtain a large number of fake user accounts. To detect IDs and to defend against such attacks, some researchers have proposed the models that detect randomly generated IDs. Among these detection models, the n-gram-based using term frequency-inverse document frequency model is regarded as a state-of-the-art model to detect randomly generated IDs, but n-gram-based approaches have the problem of the curse of dimensionality because the sparsity of feature vector increases exponentially with the increase of size n. As a result, the improvement of the detection accuracy is limited since size n cannot be increased. This paper proposes two methods to detect randomly generated IDs more accurately. The first is to avoid the curse of dimensionality with the compression of feature dimension size. The second is a technique to reduce false positives by using pattern matching and Bhattacharyya distance. We tested our method with about 3 million normal user IDs collected from the real portal service, 1 million IDs generated by a random string generation algorithm, and 8,541 IDs found after being used for malicious behavior in real portal services. The experimental results showed that the proposed method can improve detection accuracy as well as inference performance.INDEX TERMS Authentication, computer crime, identity management systems, web sites.

show abstract

“…Supervised models include traditional machine learning with SVM (Lee and Kim 2014), Naïve Bayes (Chen, Guan, and Su 2014), and Random Forest (Ferrara et al 2016) models trained on features extracted from Twitter's tweet and user objects. Other methods have attempted to classify accounts based only on their text (Kudugunta and Ferrara 2018) or their screen name (Beskow and Carley 2018c). Several of the available models like Botometer (Davis et al 2016) and Bot-Hunter (Beskow and Carley 2018b) are classic supervised machine learning models.…”

Section: Previous Work In Bot Detectionmentioning

confidence: 99%

Graph-Hist: Graph Classification from Latent Feature Histograms with Application to Bot Detection

Magelinski

Beskow

Carley

2020

AAAI

Self Cite

View full text Add to dashboard Cite

Neural networks are increasingly used for graph classification in a variety of contexts. Social media is a critical application area in this space, however the characteristics of social media graphs differ from those seen in most popular benchmark datasets. Social networks tend to be large and sparse, while benchmarks are small and dense. Classically, large and sparse networks are analyzed by studying the distribution of local properties. Inspired by this, we introduce Graph-Hist: an end-to-end architecture that extracts a graph's latent local features, bins nodes together along 1-D cross sections of the feature space, and classifies the graph based on this multi-channel histogram. We show that Graph-Hist improves state of the art performance on true social media benchmark datasets, while still performing well on other benchmarks. Finally, we demonstrate Graph-Hist's performance by conducting bot detection in social media. While sophisticated bot and cyborg accounts increasingly evade traditional detection methods, they leave artificial artifacts in their conversational graph that are detected through graph classification. We apply Graph-Hist to classify these conversational graphs. In the process, we confirm that social media graphs are different than most baselines and that Graph-Hist outperforms existing bot-detection models.

show abstract

Using Random String Classification to Filter and Annotate Automated Accounts

Cited by 16 publications

References 13 publications

Its all in a name: detecting and labeling bots by their name

Its all in a name: detecting and labeling bots by their name

Detection Method for Randomly Generated User IDs: Lift the Curse of Dimensionality

Graph-Hist: Graph Classification from Latent Feature Histograms with Application to Bot Detection

Contact Info

Product

Resources

About