Segmenting Chinese Unknown Words by Heuristic Method

Yang, Christopher C.; Li, Kar Wing

doi:10.1007/978-3-540-24594-0_52

Cited by 8 publications

(6 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In our previous work, we developed the boundary detection (Yang, Luk, Yung, & Yen, 2000) and the heuristic techniques to segment Chinese sentences based on mutual information and significant estimation (Chien, 1997). Our accuracy is over 90% (Yang & Li, 2003c).…”

Section: A Corpus‐based Approach: Automatic Crosslingual Concept Spacmentioning

confidence: 90%

Automatic crosslingual thesaurus generated from the Hong Kong SAR Police Department Web corpus for crime analysis

Yang

2005

J. Am. Soc. Inf. Sci.

Self Cite

View full text Add to dashboard Cite

based approach to align English/Chinese Hong Kong Police press release documents from the Web is first presented. We also introduce an algorithmic approach to generate a robust knowledge base based on statistical correlation analysis of the semantics (knowledge) embedded in the bilingual press release corpus. The research output consisted of a thesaurus-like, semantic network knowledge base, which can aid in semanticsbased crosslingual information management and retrieval.

show abstract

Section: A Corpus‐based Approach: Automatic Crosslingual Concept Spacmentioning

confidence: 90%

Automatic crosslingual thesaurus generated from the Hong Kong SAR Police Department Web corpus for crime analysis

Yang

2005

J. Am. Soc. Inf. Sci.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Thresholding and abrupt changes of the values of mutual information are utilized for the detection of segmentation points. The heuristic method utilizes five rules to segment Chinese text based on the mutual information of bi-grams and significance estimation of tri-grams [4].…”

Section: Boundary Detection and Heuristic Methodsmentioning

confidence: 99%

“…Two statistical based Chinese text segmentation techniques have been developed by Yang et al, namely, boundary detection [3] and heuristic method [4]. Due to the limitation of the lexical statistics collected from the Chinese corpus, errors may occur in segmentation.…”

Section: Introductionmentioning

confidence: 99%

Error anaylsis of Chinese text segmentation using statistical approach

Yang

2004

Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries

Self Cite

View full text Add to dashboard Cite

The Chinese text segmentation is important for the indexing of Chinese documents, which has significant impact on the performance of Chinese information retrieval. The statistical approach overcomes the limitations of the dictionary based approach. The statistical approach is developed by utilizing the statistical information about the association of adjacent characters in Chinese text collected from the Chinese corpus. Both known words and unknown words can be segmented by the statistical approach. However, errors may occur due to the limitation of the corpus. In this work, we have conducted the error analysis of two Chinese text segmentation techniques using statistical approach, namely, boundary detection and heuristic method. Such error analysis is useful for the future development of the automatic text segmentation of Chinese text or other text in oriental languages. It is also helpful to understand the impact of these errors on the information retrieval system in digital libraries.

show abstract

“…Statistics‐based approaches or hybrid approaches are proposed to solve the problem of unknown words (Banko et al, 2002; Li et al, 2004; Sproat & Shih, 1990; Yang & Li, 2003; Yang & Li, 2005). Given a large corpus of Chinese texts, the statistics‐based approaches measure the statistical association of characters in the corpus.…”

Section: Traditional Segmentationmentioning

confidence: 99%

Mining Web data for Chinese segmentation

Wang¹,

Yang²

2007

J. Am. Soc. Inf. Sci.

Self Cite

View full text Add to dashboard Cite

Modern information retrieval systems use keywords within documents as indexing terms for search of relevant documents. As Chinese is an ideographic characterbased language, the words in the texts are not delimited by white spaces. Indexing of Chinese documents is impossible without a proper segmentation algorithm. Many Chinese segmentation algorithms have been proposed in the past. Traditional segmentation algorithms cannot operate without a large dictionary or a large corpus of training data. Nowadays, the Web has become the largest corpus that is ideal for Chinese segmentation. Although most search engines have problems in segmenting texts into proper words, they maintain huge databases of documents and frequencies of character sequences in the documents. Their databases are important potential resources for segmentation. In this paper, we propose a segmentation algorithm by mining Web data with the help of search engines. On the other hand, the Romanized pinyin of Chinese language indicates boundaries of words in the text. Our algorithm is the first to utilize the Romanized pinyin to segmentation. It is the first unified segmentation algorithm for the Chinese language from different geographical areas, and it is also domain independent because of the nature of the Web. Experiments have been conducted on the datasets of a recent Chinese segmentation competition. The results show that our algorithm outperforms the traditional algorithms in terms of precision and recall. Moreover, our algorithm can effectively deal with the problems of segmentation ambiguity, new word (unknown word) detection, and stop words.

show abstract

Segmenting Chinese Unknown Words by Heuristic Method

Cited by 8 publications

References 6 publications

Automatic crosslingual thesaurus generated from the Hong Kong SAR Police Department Web corpus for crime analysis

Automatic crosslingual thesaurus generated from the Hong Kong SAR Police Department Web corpus for crime analysis

Error anaylsis of Chinese text segmentation using statistical approach

Mining Web data for Chinese segmentation

Contact Info

Product

Resources

About