Chinese New Word Identification: A Latent Discriminative Model with Global Features

Lin

Proceedings of the 2014 IEEE 18th International Conference on Computer Supported Cooperative Work in Design (CSCWD)

2014

Different from the languages widely used in western countries such as English or French, there are no spaces between words in Chinese language, and a segmentation of the texts is necessary before other superior processes. New word identification is an important problem in the segmentation process, especially when the segmentation targets are social network texts which have more abbreviated words or other non-standard representations. Several methods have been proposed to detect Chinese new words. Most of these methods take the corpus as a static set and they don't consider the time domain information. Different from these studies, we regard our social network corpus as a text series spreading along the time line and design a new kind of features named dynamic features which can reflect the temporal variety of the string's statistical features. The experimental results on the dataset crawled from the biggest microblogging application in China show that this method can significantly improve the effect of Chinese new word identification. Keywords-new word identification; time domain; social networkIn recent years, web users generate more and more information and texts in the widely used social network applications such as microblogging websites and question answering (QA) systems. Comparing to the traditional corpus, the texts posted by users are shorter and much more similar to spoken language. Specially in Chinese language, there are more new words in social network texts. Because most of the Chinese word segmentation algorithms nowadays are based on a dictionary, it will be hard to segment the social network texts accurately when there are too many new words which are not in the dictionary.In order to solve this problem, most of the existing work tries to extract some rules or features to detect Chinese new words automatically, and these studies can be divided into three categories. The first one is the rulebased method in which some researchers try to extract explicit rules of new words from the perspective of linguistics. The features in this kind of methods are mainly part of speech of strings, collocation of strings and so on. The second one is the statistical-based method. In this kind of methods, researchers try to identify new words through some statistical features such as MI (Mutual Information) [3], PLU (Phrase-like Unit) [4], PLR (PLU-based likelihood Ratio) [4], IWP (In-word Probability of a Character) [5] and AVC (Accessor Variety Criteria) [6]. And the last kind of methods is the combination of rulebased methods and statistical-based methods.In most of the methods above, the corpus are normal texts such as news web pages, and the time domain information is not considered in the feature computation process. This is reasonable for traditional texts because they are long enough for feature extraction, even without any time domain information. Different from the normal texts, most of the social network texts are posted by users instead of being edited by professional editors, so they are much shorter th...

Section: Combination Of the Rule-based Methods And The Statistical-mentioning

confidence: 99%

“…We regard the new words extraction process as a tagging process like other researchers do in [10][11][12][13], and since the CRF model has archived excellent results in [10][11][12][13], we also choose it as the tagging model.…”

Section: A Overall Frameworkmentioning

confidence: 99%

New word identification in social network text based on time series information

Lin

Proceedings of the 2014 IEEE 18th International Conference on Computer Supported Cooperative Work in Design (CSCWD)

2014

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Confere

“…There are also many hybrid methods combined statistical metrics with linguistic knowledge and machine Learning algorithms, such as Part-of-Speech filters (Smadja, 1994;Asanee, 1997), roles tagging based (Zhang et al, 2002), syntactic discriminators (Chen & Ma 2002), max-margin Markov networks (Qiao and Sun, 2010;Li and Chang, 2010), Unsupervised Learning Strategy (Sun et al, 2004), Latent Discriminative Model (Sun et al, 2011), boostingbased ensemble learning (TeCho et al, 2012). But POS filters, roles tagging, machine learning algorithms does not work for Tibetan UWI.…”

Section: Related Workmentioning

confidence: 99%

Tibetan Unknown Word Identification from News Corpora for Supporting Lexicon-based Tibetan Word Segmentation

Nuo

Liu

Long

et al. 2015

In Tibetan, as words are written consecutively without delimiters, finding unknown word boundary is difficult. This paper presents a hybrid approach for Tibetan unknown word identification for offline corpus processing. Firstly, Tibetan named entity is preprocessed based on natural annotation. Secondly, other Tibetan unknown words are extracted from word segmentation fragments using MTC, the combination of a statistical metric and a set of context sensitive rules. In addition, the preliminary experimental results on Tibetan News Corpus are reported. Lexicon-based Tibetan word segmentation system SegT with proposed unknown word extension mechanism is indeed helpful to promote the performance of Tibetan word segmentation. It increases the F-score of Tibetan word segmentation by 4.15% on random-selected test set. Our unknown word identification scheme can find new words in any length and in any field.

Journal of Intelligent &Amp; Fuzzy Systems

“…Each domain category contains positive and negative documents. We use our Chinese lexical analysis tools [13], [14] to extracted all the Multiword Expressions of two words that match the three patterns we predefined: a noun (subject) and an adjective (predicate), a verb and a noun, an adverb and an adjective, from "ChnSentiCorp". Then all the Multiword Expressions are manually labeled with corresponding semantic polarity labels(negative, neutral or positive).…”

Section: Experimental Settingsmentioning

confidence: 99%

Semantic polarity detection of Chinese multiword expression in microblogging based on discriminative latent model

Sun

2014

Self Cite

Extracting semantic polarity of Chinese Multiword Expression, especially some newly generated Multiword Expression from internet(such as weibo or microblog), is an important task for sentiment analysis of web texts or other real word text as some Multiword Expressions can express more integrative sentiments than words units. This paper proposes a method that contains a novel latent discriminative algorithm, which attempts to attack this problem by integrating discriminative model and latent value model. Although Chinese Multiword Expressions consist of multiple words, the semantic polarity of the Multiword Expression is not just simple integration of polarities of the component words, as some words can invert the affective polarity so the Multiword Expressions can have totally opposite semantic polarity, such as ironic texts. In order to capture the property of such Multiword Expressions, hidden semi-CRF which includes a latent valuable layer, which can be used to address dual-sequence labeling tasks synchronously, is adopted. The method is tested experimentally by adopting a manually labeled set of positive and negative Multiword Expressions from microblog or other internet resources, and the experiments have shown very promising results, which is comparable to the best value ever reported.