Classification of Short Text Using Various Preprocessing Techniques: An Empirical Evaluation

Kumar, H. M. Keerthi; Harish, B. S.

doi:10.1007/978-981-10-8633-5_3

Cited by 15 publications

(2 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For some specific applications, such as detecting spamming accounts [16], even more structured user features, such as the URL rate and the interaction rate, are believed to be highly informative. Interestingly, a recent study [14] has reversed the prediction logic and based the analysis on replies, but this approach struggled to predict the popularity of the original source tweet.…”

Section: High or Low Number Of Repliesmentioning

confidence: 99%

Predicting the Volume of Response to Tweets Posted by a Single Twitter Account

et al. 2020

View full text Add to dashboard Cite

Social media users, including organizations, often struggle to acquire the maximum number of responses from other users, but predicting the responses that a post will receive before publication is highly desirable. Previous studies have analyzed why a given tweet may become more popular than others, and have used a variety of models trained to predict the response that a given tweet will receive. The present research addresses the prediction of response measures available on Twitter, including likes, replies and retweets. Data from a single publisher, the official US Navy Twitter account, were used to develop a feature-based model derived from structured tweet-related data. Most importantly, a deep learning feature extraction approach for analyzing unstructured tweet text was applied. A classification task with three classes, representing low, moderate and high responses to tweets, was defined and addressed using four machine learning classifiers. All proposed models were symmetrically trained in a fivefold cross-validation regime using various feature configurations, which allowed for the methodically sound comparison of prediction approaches. The best models achieved F1 scores of 0.655. Our study also used SHapley Additive exPlanations (SHAP) to demonstrate limitations in the research on explainable AI methods involving Deep Learning Language Modeling in NLP. We conclude that model performance can be significantly improved by leveraging additional information from the images and links included in tweets.

show abstract

Section: High or Low Number Of Repliesmentioning

confidence: 99%

Predicting the Volume of Response to Tweets Posted by a Single Twitter Account

et al. 2020

View full text Add to dashboard Cite

show abstract

“…The raw comment data obtained from RTV SLO contained a lot of unnecessary information, such as text formatting tags (for italics and bold text), hyperlinks, and metadata from other cited comments. As leaving the text formatting tags in the comments may dilute the information in the comments and therefore potentially worsen the resulting models, we decided to remove them from the texts [25,26].…”

Section: Data Preprocessingmentioning

confidence: 99%

Authorship Attribution on Short Texts in the Slovenian Language

Gabrovšek,

Peer,

Emeršič

et al. 2023

Applied Sciences

View full text Add to dashboard Cite

The study investigates the task of authorship attribution on short texts in Slovenian using the BERT language model. Authorship attribution is the task of attributing a written text to its author, frequently using stylometry or computational techniques. We create five custom datasets for different numbers of included text authors and fine-tune two BERT models, SloBERTa and BERT Multilingual (mBERT), to evaluate their performance in closed-class and open-class problems with varying numbers of authors. Our models achieved an F1 score of approximately 0.95 when using the dataset with the comments of the top five users by the number of written comments. Training on datasets that include comments written by an increasing number of people results in models with a gradually decreasing F1 score. Including out-of-class comments in the evaluation decreases the F1 score by approximately 0.05. The study demonstrates the feasibility of using BERT models for authorship attribution in short texts in the Slovenian language.

show abstract