Broad coverage automatic morphological segmentation of German words

Pachunke, Thomas; Mertineit, Oliver; Wothke, Klaus; Schmidt, Rudolf

doi:10.3115/992424.992468

Cited by 5 publications

(5 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The latest version of annotated data when the article is prepared: Burmese: http://www2.nict.go.jp/astrec-att/member/ mutiyama/ALT/my-nova-170405.zip Khmer: http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/km-nova-180803. zip 11. The modified short-and long-tags show similar tendency as those of the basic tags shown inFigure 7.…”

supporting

confidence: 60%

“…Compared with the non-Zipfian series of short-tag, it suggests that the bracketing annotation covers plenty of phenomena in a heavy-tailed distribution, rather than a simple classification of tokens. 11 Detailed distributions of tags and patterns are listed in Tables 2, 3, and 4. In Table 2 of short tags, it is obvious that n, v, and o tags nearly cover 90% of the tokens, which is in accordance with the…”

Section: Statistics and Examples On Annotated Datamentioning

confidence: 99%

“…For most Indo-European languages, the tokenization process is relatively trivial because their writing systems already use spaces to separate words, which can be used directly as tokens. 1 Thus, the tokenization process for these languages mainly handles punctuation marks, compounds, abbreviations, and capitalizations [11,14]. For these languages, the POS tagging process is usually regarded as a separate task from tokenization.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Nova

Ding

Utiyama

Sumita

2018

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

View full text Add to dashboard Cite

A feasible and flexible annotation system is designed for joint tokenization and part-of-speech (POS) tagging to annotate those languages without natural definitions of words. This design was motivated by the fact that word separators are not used in many highly analytic East and Southeast Asian languages. Although several of the languages are well-studied, e.g., Chinese and Japanese, many are understudied with low resources, e.g., Burmese (Myanmar) and Khmer. In the first part of the article, the proposed annotation system, named nova, is introduced. nova contains only four basic tags (n, v, a, and o); these tags can be further modified and combined to adapt complex linguistic phenomena in tokenization and POS tagging. In the second part of the article, the feasibility and flexibility of nova is illustrated from the annotation practice on Burmese and Khmer. The relation between nova and two universal POS tagsets is discussed in the final part of the article.

show abstract

supporting

confidence: 60%

Section: Statistics and Examples On Annotated Datamentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Nova

Ding

Utiyama

Sumita

2018

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

View full text Add to dashboard Cite

show abstract

“…Teahan et al (2000) state that interpreting a text as a sequence of words is beneficial for some information retrieval and storage tasks: for example, full-text searches, word-based compression, and key-phrase extraction. According to Guo (1997), words and tokens are the primary building blocks in almost all linguistic theories and language-processing systems, including Japanese (Kobayasi, Tokumaga, and Tanaka 1994), Korean (Yun, Lee, and Rim 1995), German (Pachunke et al 1992), and English (Garside, Leech, and Sampson 1987), in various media, such as continuous speech and cursive handwriting, and in numerous applications, such as translation, recognition, indexing, and proofreading. The identification of words in natural language is nontrivial since, as observed by Chao (1968), linguistic words often represent a different set than do sociological words.…”

Section: Introductionmentioning

confidence: 99%

Accessor Variety Criteria for Chinese Word Extraction

Chen

Deng

et al. 2004

Computational Linguistics

View full text Add to dashboard Cite

We are interested in the problem of word extraction from Chinese text collections. We define a word to be a meaningful string composed of several Chinese characters. For example, ‘percent’, and, ‘more and more’, are not recognized as traditional Chinese words from the viewpoint of some people. However, in our work, they are words because they are very widely used and have specific meanings. We start with the viewpoint that a word is a distinguished linguistic entity that can be used in many different language environments. We consider the characters that are directly before a string (predecessors) and the characters that are directly after a string (successors) as important factors for determining the independence of the string. We call such characters accessors of the string, consider the number of distinct predecessors and successors of a string in a large corpus (TREC 5 and TREC 6 documents), and use them as the measurement of the context independency of a string from the rest of the sentences in the document. Our experiments confirm our hypothesis and show that this simple rule gives quite good results for Chinese word extraction and is comparable to, and for long words outperforms, other iterative methods.

show abstract

“…Currently, word tokenization and segmentation problems exist in almost all natural languages such as Chinese (Chen and Liu 1992 ;Bai, 1995), Japanese (Yosiyuki, Takenobu and Hozumi 1992), Korean (Yun, Lee and Rim 1995), German (Pachunke, Mertineit, Wothke and Schmidt 1992) and English (Garside, Leech and Sampson 1987), in diverse media forms such as continuous speech recognition and handwriting recognition, and in numerous applications such as translation, recognition, indexing and proof-reading. Depending on the resources applied, word tokenization and segmentation solutions can be broadly categorized as either orthography-oriented or dictionary-based.…”

Section: Introductionmentioning

confidence: 99%

Splitting-merging model of Chinese word tokenization and segmentation

Yao

Lua

1998

Nat. Lang. Eng.

View full text Add to dashboard Cite

Currently, word tokenization and segmentation are still a hot topic in natural language processing, especially for languages like Chinese in which there is no blank space for word delimitation. Three major problems are faced: (1) tokenizing direction and efficiency; (2) insufficient tokenization dictionary and new words; and (3) ambiguity of tokenization and segmentation. Most existing tokenization and segmentation methods have not dealt with the above problems together. To tackle the three problems in one basket, this paper presents a novel dictionary-based method called the Splitting-Merging Model (SMM) for Chinese word tokenization and segmentation. It uses the mutual information of Chinese characters to find the boundaries and the non-boundaries of Chinese words, and finally leads to a word segmentation by resolving ambiguities and detecting new words.

show abstract

Broad coverage automatic morphological segmentation of German words

Cited by 5 publications

References 2 publications

Nova

Nova

Accessor Variety Criteria for Chinese Word Extraction

Splitting-merging model of Chinese word tokenization and segmentation

Contact Info

Product

Resources

About