1998
DOI: 10.1017/s1351324998002058
|View full text |Cite
|
Sign up to set email alerts
|

Splitting-merging model of Chinese word tokenization and segmentation

Abstract: Currently, word tokenization and segmentation are still a hot topic in natural language processing, especially for languages like Chinese in which there is no blank space for word delimitation. Three major problems are faced: (1) tokenizing direction and efficiency; (2) insufficient tokenization dictionary and new words; and (3) ambiguity of tokenization and segmentation. Most existing tokenization and segmentation methods have not dealt with the above problems together. To tackle the three problems in on… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2003
2003
2023
2023

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(2 citation statements)
references
References 5 publications
0
2
0
Order By: Relevance
“…Unlike Latin based or Germanic based languages, the ones without blank space will lead to problems on word delimitation [21]. Major problems faced during tokenization and segmentation include tokenizing direction, e ciency, and ambiguity [22] [23]. CKIP word segmentation system was developed to resolve these problems.…”
Section: Error Analysesmentioning
confidence: 99%
“…Unlike Latin based or Germanic based languages, the ones without blank space will lead to problems on word delimitation [21]. Major problems faced during tokenization and segmentation include tokenizing direction, e ciency, and ambiguity [22] [23]. CKIP word segmentation system was developed to resolve these problems.…”
Section: Error Analysesmentioning
confidence: 99%
“…A prerequisite to information extraction that is peculiar to Chinese language texts is a fundamental pre-processing task, namely word segmentation since Chinese natural language texts do not encode word boundaries. Approaches to segmentation have been both symbolic (rule-based), for example, Yeh and Lee (1991), and statistical, for example, Chen and Liu (1992), Yao and Lua (1998) and Peng (2001). Apart from this, a major focus of Chinese IE has been the recognition and classification of named entities, a task motivated by the significantly high distribution of proper nouns in newspaper texts.…”
Section: Chinese Term Formationmentioning
confidence: 99%