Unsupervised learning of morphology for building lexicon for a highly inflectional language

Sharma, Utsav; Kalita, Jugal; Das, Rajib

doi:10.3115/1118647.1118648

Cited by 15 publications

(14 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Here the restriction on stem-length first proposed by Gaussier is upheld. Sharma's (2006) work deals with neutral suffix only and does not capture nonneutral suffixes. These studies are limited to suffix identification and do not generate paradigms.…”

Section: Related Workmentioning

confidence: 99%

“…One promising methodology for unsupervised segmentation which does not make any suffix frequency assumptions is p-similar technique for morpheme segmentation first proposed by Gaussier (1999). Researchers have used this method for suffix identification and not for segmentation (Gaussier, 1999;Sharma, 2006). We extended this less studied technique to segment words by introducing the concept of suffix association matrix, thus giving us an unsupervised method which correctly identifies suffixes irrespective of their frequency of occurrence in the corpus and also segments short stem words.…”

Section: Introductionmentioning

confidence: 99%

“…We extended this less studied technique to segment words by introducing the concept of suffix association matrix, thus giving us an unsupervised method which correctly identifies suffixes irrespective of their frequency of occurrence in the corpus and also segments short stem words. To the best of our knowledge, most reported work which uses p-similar technique for suffix identification (Gaussier, 1999;Sharma, 2006) enforce a restriction on stem-length that it should be at least five. This restriction works well for suffix identification but not for segmentation.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Framework for Learning Morphology using Suffix Association Matrix

Desai

Pawar

Bhattacharyya

2014

Proceedings of the Fifth Workshop on South and Southeast Asian Natural Language Processing

View full text Add to dashboard Cite

Unsupervised learning of morphology is used for automatic affix identification, morphological segmentation of words and generating paradigms which give a list of all affixes that can be combined with a list of stems. Various unsupervised approaches are used to segment words into stem and suffix. Most unsupervised methods used to learn morphology assume that suffixes occur frequently in a corpus. We have observed that for morphologically rich Indian Languages like Konkani, 31 percent of suffixes are not frequent. In this paper we report our framework for Unsupervised Morphology Learner which works for less frequent suffixes. Less frequent suffixes can be identified using p-similar technique which has been used for suffix identification, but cannot be used for segmentation of short stem words. Using proposed Suffix Association Matrix, our Unsupervised Morphology Learner can also do segmentation of short stem words correctly. We tested our framework to learn derivational morphology for English and two Indian languages, namely Hindi and Konkani. Compared to other similar techniques used for segmentation, there was an improvement in the precision and recall.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Framework for Learning Morphology using Suffix Association Matrix

Desai

Pawar

Bhattacharyya

2014

Proceedings of the Fifth Workshop on South and Southeast Asian Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…Many publications (Ćavar et al, 2004;Brent et al, 1995;Déjean, 1998;Argamon et al, 2004;Goldsmith, 2001;Creutz and Lagus, 2005;Neuvel and Fulop, 2002;Baroni, 2003;Gaussier, 1999;Sharma et al, 2002;Wicentowski, 2002;Oliver, 2004), and various other works by the same authors, describe strategies that use frequencies, probabilities, and optimization criteria, often Minimum Description Length (MDL), in various combinations. So far, all these are unsatisfactory on two main accounts; on the theretical side, they still owe an explanation of why compression or MDL should give birth to segmentations coinciding with morphemes as linguistically defined.…”

Section: Related Workmentioning

confidence: 99%

“…Secondly, segmentation algorithms may have different purposes and it might not make good sense to study segmentation in isolation from induction of paradigms. Lastly, and most importantly, all of the reviewed techniques (Wicentowski, 2004;Wicentowski, 2002;Baroni et al, 2002;Andreev, 1965;Ćavar et al, 2004;Snover and Brent, 2003;Snover and Brent, 2001;Schone and Jurafsky, 2001;Jacquemin, 1997;Goldsmith and Hu, 2004;Sharma et al, 2002;Clark, 2001;Kazakov and Manandhar, 1998;Déjean, 1998;Oliver, 2004;Creutz and Lagus, 2003;Creutz and Lagus, 2004;Hirsimäki et al, 2003;Creutz and Lagus, 2005;Argamon et al, 2004;Gaussier, 1999;Lehmann, 1973;Langer, 1991;Flenner, 1995;Klenk and Langer, 1989;Goldsmith, 2001;Goldsmith, 2000;Hu et al, 2005b;Hu et al, 2005a;Brent et al, 1995), as they are described, have threshold-parameters of some sort, explicitly claim not to work well for an open set of languages, or require noise-free all-form input (Albright, 2002;Manning, 1998;Borin, 1991). Therefore it is not possible to even design a fair test.…”

Section: Related Workmentioning

confidence: 99%

A naive theory of affixation and an algorithm for extraction

Technology

2006

Proceedings of the Eighth Meeting of the ACL Special Interest Group on Computational Phonology and Morphology - SIGPHON '06

View full text Add to dashboard Cite

We present a novel approach to the unsupervised detection of affixes, that is, to extract a set of salient prefixes and suffixes from an unlabeled corpus of a language. The underlying theory makes no assumptions on whether the language uses a lot of morphology or not, whether it is prefixing or suffixing, or whether affixes are long or short. It does however make the assumption that 1. salient affixes have to be frequent, i.e occur much more often that random segments of the same length, and that 2. words essentially are variable length sequences of random characters, e.g a character should not occur in far too many words than random without a reason, such as being part of a very frequent affix. The affix extraction algorithm uses only information from fluctation of frequencies, runs in linear time, and is free from thresholds and untransparent iterations. We demonstrate the usefulness of the approach with example case studies on typologically distant languages.

show abstract