Mathias Creutz scite author profile

We present two methods for unsupervised segmentation of words into morphemelike units. The model utilized is especially suited for languages with a rich morphology, such as Finnish. The first method is based on the Minimum Description Length (MDL) principle and works online. In the second method, Maximum Likelihood (ML) optimization is used. The quality of the segmentations is measured using an evaluation method that compares the segmentations produced to an existing morphological analysis. Experiments on both Finnish and English corpora show that the presented methods perform well compared to a current stateof-the-art system.

show abstract

Unsupervised models for morpheme segmentation and morphology learning

Creutz¹,

Lagus²

2007

TSLP

124

179

View full text Add to dashboard Cite

Unlimited vocabulary speech recognition with morph language models applied to Finnish

Hirsimäki¹,

Creutz²,

Siivola³

et al. 2006

Computer Speech & Language

104

100

View full text Add to dashboard Cite

Morph-based speech recognition and modeling of out-of-vocabulary words across languages

Creutz¹,

Hirsimäki²,

Kurimo³

et al. 2007

ACM Trans. Speech Lang. Process.

View full text Add to dashboard Cite

We explore the use of morph-based language models in large-vocabulary continuous-speech recognition systems across four so-called morphologically rich languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. The morphs are subword units discovered in an unsupervised, data-driven way using the Morfessor algorithm. By estimating n-gram language models over sequences of morphs instead of words, the quality of the language model is improved through better vocabulary coverage and reduced data sparsity. Standard word models suffer from high out-ofvocabulary (OOV) rates, whereas the morph models can recognize previously unseen word forms by concatenating morphs. It is shown that the morph models do perform fairly well on OOVs without compromising the recognition accuracy on in-vocabulary words. The Arabic experiment constitutes the only exception since here the standard word model outperforms the morph model. Differences in the datasets and the amount of data are discussed as a plausible explanation.

show abstract

Unsupervised segmentation of words using prior distributions of morph length and frequency

Creutz¹

2003

View full text Add to dashboard Cite

We present a language-independent and unsupervised algorithm for the segmentation of words into morphs. The algorithm is based on a new generative probabilistic model, which makes use of relevant prior information on the length and frequency distributions of morphs in a language. Our algorithm is shown to outperform two competing algorithms, when evaluated on data from a language with agglutinative morphology (Finnish), and to perform well also on English data.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Mathias Creutz

Unsupervised discovery of morphemes

Unsupervised models for morpheme segmentation and morphology learning

Unlimited vocabulary speech recognition with morph language models applied to Finnish

Morph-based speech recognition and modeling of out-of-vocabulary words across languages

Unsupervised segmentation of words using prior distributions of morph length and frequency

Contact Info

Product

Resources

About