Splitting-merging model of Chinese word 
tokenization and segmentation

Yao, Yuan; Lua, K.T.

doi:10.1017/s1351324998002058

“…Unlike Latin based or Germanic based languages, the ones without blank space will lead to problems on word delimitation [21]. Major problems faced during tokenization and segmentation include tokenizing direction, e ciency, and ambiguity [22] [23]. CKIP word segmentation system was developed to resolve these problems.…”

Section: Error Analysesmentioning

confidence: 99%

Using Semi-Automatically Annotation System on Medical Entity Recognition

TUNG

¹

,

HUANG

²

,

Cai

³

et al. 2023

Preprint

0

View full text Add to dashboard Cite

It is more and more common that people ask questions on the web and seek suggestion before visiting medical institutions. These corpus resources may be valuable for further research on natural languages processing for Medicine. Amazon provided a service called “Amazon Comprehend Medical” that could help medical experts to extract six kinds of the important terms from the articles. In this research, we proposed a medical entity recognition model to identify ten medical entity terms. A semi-auto annotation system was also developed to extract medical entity terms from the questions. The expected result shows that the annotation system could reduce 40% labeling time and provides a tagging interface to add medical entity terms manually.

show abstract

“…A prerequisite to information extraction that is peculiar to Chinese language texts is a fundamental pre-processing task, namely word segmentation since Chinese natural language texts do not encode word boundaries. Approaches to segmentation have been both symbolic (rule-based), for example, Yeh and Lee (1991), and statistical, for example, Chen and Liu (1992), Yao and Lua (1998) and Peng (2001). Apart from this, a major focus of Chinese IE has been the recognition and classification of named entities, a task motivated by the significantly high distribution of proper nouns in newspaper texts.…”

Section: Chinese Term Formationmentioning

confidence: 99%

The head-modifier principle and multilingual term extraction

Hippisley¹,

Cheng²,

Ahmad³

2005

View full text Add to dashboard Cite

Advances in language engineering may be dependent on theoretical principles originating from linguistics, since both share a common object of enquiry, natural language structures. We outline an approach to term extraction that rests on theoretical claims about the structure of words. We use the structural properties of compound words to specifically elicit the sets of terms defined by type hierarchies such as hyponymy and meronymy. The theoretical claims revolve around the head-modifier principle, which determines the formation of a major class of compounds. Significantly it has been suggested that the principle operates in languages other than English. To demonstrate the extendibility of our approach beyond English, we present a case study of term extraction in Chinese, a language whose written form is the vehicle of communication for over 1.3 billion language users, and therefore has great significance for the development of language engineering technologies.

show abstract

Using Semi-Automatically Annotation System on Medical Entity Recognition

TUNG

¹

,

HUANG

²

,

Cai

³

et al. 2022

Preprint

0

View full text Add to dashboard Cite

It is more and more common that people ask questions on the web and seek suggestion before visiting medical institutions. These corpus resources may be valuable for further research on natural languages processing for Medicine. Amazon provided a service called “Amazon Comprehend Medical” that could help medical experts to extract six kinds of the important terms from the articles. In this research, we proposed a medical entity recognition model to identify ten medical entity terms. A semi-auto annotation system was also developed to extract medical entity terms from the questions. The expected result shows that the annotation system could reduce 40% labeling time and provides a tagging interface to add medical entity terms manually.

show abstract

Splitting-merging model of Chinese word tokenization and segmentation

Cited by 4 publications

References 5 publications

Using Semi-Automatically Annotation System on Medical Entity Recognition

Using Semi-Automatically Annotation System on Medical Entity Recognition

The head-modifier principle and multilingual term extraction

Using Semi-Automatically Annotation System on Medical Entity Recognition

Contact Info

Product

Resources

About