Abstract. In this article we propose a supervised method for expanding tweet contents to improve the recall of tweet filtering task in online reputation management systems. Our method does not use any external resources. It consists of creating a K-NN classifier in three steps. In these steps the tweets labeled related and unrelated in the training set are expanded by extracting and adding the most discriminative terms, calculating and adding the most frequent terms, and re-weighting the original tweet terms from training set. Our experiments in RepLab 2013 data set show that our method improves the performance of filtering task, in terms of F criterion, up to 13% over state-of-the-art classifiers such as SVM. This data set consists of 61 entities from different domains of automotive, banking, universities, and music.
IntroductionTwitter is one of the widely used social networks in the world. According to reports 1 as of February 2015, Twitter had 288 million users. This large number of users, has made this website to be one of the most studied social networks in computer science [1][2][3]. On Twitter website users can post their messages in less than 140 characters; then their followers can read and re-tweet these messages. The huge source of information is spread in Twitter and other social networks every day; this has caused the emergence of Online Reputation Management systems (ORM.) ORM is about monitoring the Internet users' opinions regarding organizations, products, or celebrities [4]. The main tasks of ORM systems are retrieving the messages posted by users, analyzing the messages, and visualizing the results [3]. An important step in ORM is detecting the messages that are related to a specific entity; in other words, classifying messages based on their context. This step is known as the filtering task. If this step is carried out properly, it will result in reduction of noise and one could expect a higher quality of results. This task is quite challenging due to the ambiguity in the name of entities and the short length of messages. For 1 http://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/ 56 P. Karisani et al.instance, if an ORM system wants to analyze users' impression of BMW Company, it must be able to recognize the tweets that contain this name (or other related names.) However, this is not an easy task because users may also abbreviate other phrases to BMW. For example, 90s TV series "Boy Meet World" is also abbreviated to BMW in tweets due to the constraints on the message length. Therefore, more sophisticated methods than simple keyword matching are required to carry out this step correctly.The short length of messages is the main challenge of applying regular classification and disambiguation techniques for tweet filtering [3]. In this research, we propose a supervised method to address this problem through tweet expansion. We expand the content of each tweet with more related words in order to increase the accuracy of matching tweets with keywords. Although we onl...