Abstract-This paper describes the principles used to collect open English and Japanese Twitter corpora for emotion analysis. We have created a set of eight emotions, based on Ekman and Plutchik categories, applicable both to the English-speaking and Japanese cultures, ensuring that each tweet in our subset of TREC'2011 collection is coded independently by three individuals. We analyse emotions contained in the resulting corpora and briefly discuss the obtained results. This work will provide valuable insights for researchers interested in emotion analysis of micro-blogosphere and comparative studies of English and Japanese tweets.Index Terms-Emotion, corpus, microblogs, Twitter.
I. INTRODUCTIONThe analysis of emotions as depicted in blogosphere has a number of practical applications, ranging from social studies and forensics to business analytics and marketing. The rise of microblog platforms, such as Tumblr or Twitter, opened new challenges to sentiment analysis. Microblogs require separate treatment since they differ significantly from blogs in terms of length, lexico-grammar, style, and content. For instance, the most popular microblogging platform Twitter has a limitation of 140 characters per message, thus effectively forcing the users to formulate what they wish to express in a very concise way. Researchers report that such tweetspeak is different from other written English genres in many respects, and characterised by an extensive use of acronyms, abbreviations, misspellings, and slang words [1]. Furthermore, as noted by [2], the informal nature of microblogging encourages the users to write frequently, expressing their daily thoughts and emotions, which results in less polished text that is likely to be more emotionally charged than other writings.The study of emotions in text typically relies on the analysis of annotated corpora, providing samples of texts that contain traces of emotional manifestations previously identified by human coders. However, to our knowledge, there have been few research activities aiming at creation of such corpora of microblog texts. Notable exceptions include a collection of tweets about people and/or film reviews classified as positive, negative, neutral, or objective [3]; Sanders Corpus of tweets that contains the words Apple, Google, Microsoft or Twitter, classified as positive, neutral, negative, and irrelevant [4]; and the Empa Tweet corpus, containing microblog messages related to certain predefined topics and classified according to seven emotional categories [2].Emoticons are widely used in microblog texts. They tend to emphasise a given emotion expressed, although they also might be used habitually, thus contradicting it in a sense. To our knowledge, the analysis of eastern-style emoticons in microblogs, such as tweets, has not attracted much academic interest so far. As part of our sentiment analysis of microblogs, we therefore also focus on a relationship between emotions and emoticonsIn the present paper, we discuss our research efforts to create a corpus for aut...