Different from the languages widely used in western countries such as English or French, there are no spaces between words in Chinese language, and a segmentation of the texts is necessary before other superior processes. New word identification is an important problem in the segmentation process, especially when the segmentation targets are social network texts which have more abbreviated words or other non-standard representations. Several methods have been proposed to detect Chinese new words. Most of these methods take the corpus as a static set and they don't consider the time domain information. Different from these studies, we regard our social network corpus as a text series spreading along the time line and design a new kind of features named dynamic features which can reflect the temporal variety of the string's statistical features. The experimental results on the dataset crawled from the biggest microblogging application in China show that this method can significantly improve the effect of Chinese new word identification.
Keywords-new word identification; time domain; social networkIn recent years, web users generate more and more information and texts in the widely used social network applications such as microblogging websites and question answering (QA) systems. Comparing to the traditional corpus, the texts posted by users are shorter and much more similar to spoken language. Specially in Chinese language, there are more new words in social network texts. Because most of the Chinese word segmentation algorithms nowadays are based on a dictionary, it will be hard to segment the social network texts accurately when there are too many new words which are not in the dictionary.In order to solve this problem, most of the existing work tries to extract some rules or features to detect Chinese new words automatically, and these studies can be divided into three categories. The first one is the rulebased method in which some researchers try to extract explicit rules of new words from the perspective of linguistics. The features in this kind of methods are mainly part of speech of strings, collocation of strings and so on. The second one is the statistical-based method. In this kind of methods, researchers try to identify new words through some statistical features such as MI (Mutual Information) [3], PLU (Phrase-like Unit) [4], PLR (PLU-based likelihood Ratio) [4], IWP (In-word Probability of a Character) [5] and AVC (Accessor Variety Criteria) [6]. And the last kind of methods is the combination of rulebased methods and statistical-based methods.In most of the methods above, the corpus are normal texts such as news web pages, and the time domain information is not considered in the feature computation process. This is reasonable for traditional texts because they are long enough for feature extraction, even without any time domain information. Different from the normal texts, most of the social network texts are posted by users instead of being edited by professional editors, so they are much shorter th...