A stochastic finite-state word-segmentation algorithm for Chinese

Sproat, Richard; Shih, Chilin; Gale, William A.; Chang, Nancy

doi:10.3115/981732.981742

Cited by 135 publications

(164 citation statements)

References 23 publications

(29 reference statements)

Supporting

Mentioning

161

Contrasting

Order By: Relevance

“…Such ambiguity in the definition of what constitutes a word makes it difficult to evaluate segmentation algorithms that follow different conventions, as it is nearly impossible to construct a "gold standard" against which to directly compare results [7]. As shown in [23], the rate of agreement between two human judges on this task is less than 80%. The performance of word segmentation is usually measured using precision and recall, where recall is defined as the percent of words in the manually segmented text identified by the segmentation algorithm, and precision is defined as the percentage of words returned by the algorithm that also occurred in the hand-segmented text in the same position.…”

Section: Evaluation and Experimental Resultsmentioning

confidence: 99%

A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate

Islam

Inkpen

Kiringa

2007

Computational Linguistics and Intelligent Text Processing

View full text Add to dashboard Cite

Abstract. In this paper, we formulate a generalized method of automatic word segmentation. The method uses corpus type frequency information to choose the type with maximum length and frequency from "desegmented" text. It also uses a modified forward-backward matching technique using maximum length frequency and entropy rate if any non-matching portions of the text exist. The method is also extendible to a dictionary-based or hybrid method with some additions to the algorithms. Evaluation results show that our method outperforms several competing methods.

show abstract

Section: Evaluation and Experimental Resultsmentioning

confidence: 99%

A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate

Islam

Inkpen

Kiringa

2007

Computational Linguistics and Intelligent Text Processing

View full text Add to dashboard Cite

show abstract

“…Sproat et al, 1996). As an example, consider the Chinese character sequence which forms a complete noun in the sentence…”

Section: Languages Without Word Separationmentioning

confidence: 99%

“…E.g., Sproat et al (1996) give a good overview of the problems text analysis for Chinese is confronted with.…”

Section: Language-dependent Syntactic Structure Analysismentioning

confidence: 99%

Text analysis and language identification for polyglot text-to-speech synthesis

Romsdorfer

Pfister

2007

Speech Communication

View full text Add to dashboard Cite

In multilingual countries, text-to-speech synthesis systems often have to deal with texts containing inclusions of multiple other languages in form of phrases, words, or even parts of words. In such multilingual cultural settings, listeners expect a high-quality text-to-speech synthesis system to read such texts in a way that the origin of the inclusions is heard, i.e., with correct language-specific pronunciation and prosody. The challenge for a text analysis component of a text-to-speech synthesis system is to derive from mixedlingual sentences the correct polyglot phone sequence and all information necessary to generate natural sounding polyglot prosody.This article presents a new approach to analyze mixed-lingual sentences. This approach centers around a modular, mixed-lingual morphological and syntactic analyzer, which additionally provides accurate language identification on morpheme level and word and sentence boundary identification in mixed-lingual texts. This approach can also be applied to word identification in languages without a designated word boundary symbol like Chinese or Japanese. To date, this mixed-lingual text analysis supports any mixture of English, French, German, Italian, and Spanish. Because of its modular design it is easily extensible to additional languages.

show abstract

“…A brief sampling of areas where various automata show up as the underlying formalism include natural language processing (speech recognition, morphological analysis), computational linguistics, robotics and control systems, computational biology (phylogeny, structural pattern recognition), data mining, time series and music (Koskenniemi, 1983;de la Higuera, 2005;Mohri, 1996;Mohri et al, 2002;Mohri, 1997;Mohri et al, 2010;Rambow et al, 2002;Sproat et al, 1996). Thus, developing efficient formal language learning techniques and understanding their limitations is of a broad and direct relevance in the digital realm.…”

Section: Introductionmentioning

confidence: 99%

On the Learnability of Shuffle Ideals

Angluin

Aspnes

Kontorovich

2012

Lecture Notes in Computer Science

View full text Add to dashboard Cite

PAC learning of unrestricted regular languages is long known to be a difficult problem. The class of shuffle ideals is a very restricted subclass of regular languages, where the shuffle ideal generated by a string u is the collection of all strings containing u as a subsequence. This fundamental language family is of theoretical interest in its own right and provides the building blocks for other important language families. Despite its apparent simplicity, the class of shuffle ideals appears quite difficult to learn. In particular, just as for unrestricted regular languages, the class is not properly PAC learnable in polynomial time if RP = NP, and PAC learning the class improperly in polynomial time would imply polynomial time algorithms for certain fundamental problems in cryptography. In the positive direction, we give an efficient algorithm for properly learning shuffle ideals in the statistical query (and therefore also PAC) model under the uniform distribution.

show abstract

A stochastic finite-state word-segmentation algorithm for Chinese

Cited by 135 publications

References 23 publications

A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate

A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate

Text analysis and language identification for polyglot text-to-speech synthesis

On the Learnability of Shuffle Ideals

Contact Info

Product

Resources

About