SMS spam detection is an important task where spam SMS messages are identified and filtered. As greater numbers of SMS messages are communicated every day, it is very difficult for a user to remember and correlate the newer SMS messages received in context to previously received SMS. SMS threads provide a solution to this problem. In this work the problem of SMS spam detection and thread identification is discussed and a state of the art clustering-based algorithm is presented. The work is planned in two stages. In the first stage the binary classification technique is applied to categorize SMS messages into two categories namely, spam and non-spam SMS; then, in the second stage, SMS clusters are created for non-spam SMS messages using non-negative matrix factorization and K-means clustering techniques. A threading-based similarity feature, that is, time between consecutive communications, is described for the identification of SMS threads, and the impact of the time threshold in thread identification is also analysed experimentally. Performance parameters like accuracy, precision, recall and F-measure are also evaluated. The SMS threads identified in this proposed work can be used in applications like SMS thread summarization, SMS folder classification and other SMS management-related tasks.
Emails are the most popular and effective way of communicating over the internet. A number of applications are available today for computers and mobile devices for email messaging. Email messaging is constantly getting more popular and, as a result, numbers of sent and received emails are also increasing. It is very difficult for a user to remember emails and relate newer incoming emails to previous communications made on similar topics. Email threads provide a mechanism using which a user can obtain sequences of emails for a particular set of communication in a time frame and provides a number of benefits to users. In this work two email thread identification algorithms based on a nested textual clustering approach are presented. The work is planned in two stages; in the first stage two popular text clustering approaches, latent Dirichlet allocation and non-negative matrix factorization, are applied over the email messages to form the email clusters. Then in the second stage, clustering is again performed over the created email clusters to identify the email threads using threading features. Performance parameters like accuracy, precision, recall and F-measure are evaluated for the presented thread identification algorithms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.