Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2016
DOI: 10.18653/v1/p16-1108
|View full text |Cite
|
Sign up to set email alerts
|

Leveraging Inflection Tables for Stemming and Lemmatization.

Abstract: We present several methods for stemming and lemmatization based on discriminative string transduction. We exploit the paradigmatic regularity of semi-structured inflection tables to identify stems in an unsupervised manner with over 85% accuracy. Experiments on English, Dutch and German show that our stemmers substantially outperform Snowball and Morfessor, and approach the accuracy of a supervised model. Furthermore, the generated stems are more consistent than those annotated by experts. Our direct lemmatiza… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
13
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
5
4

Relationship

2
7

Authors

Journals

citations
Cited by 18 publications
(13 citation statements)
references
References 12 publications
0
13
0
Order By: Relevance
“…Again, the later methods can be further classified depending on whether context of the current word is considered or not. Lemmatization without context (Cotterell et al, 2016;Nicolai and Kondrak, 2016) is closer to stemming and not the focus of the present work. It is noteworthy here that the supervised lemmatization methods do not try to classify the lemma of a given word form as it is infeasible due to having a large number of lemmas in a language.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Again, the later methods can be further classified depending on whether context of the current word is considered or not. Lemmatization without context (Cotterell et al, 2016;Nicolai and Kondrak, 2016) is closer to stemming and not the focus of the present work. It is noteworthy here that the supervised lemmatization methods do not try to classify the lemma of a given word form as it is infeasible due to having a large number of lemmas in a language.…”
Section: Related Workmentioning
confidence: 99%
“…Efforts on developing lemmatizers can be divided into two principle categories (i) rule/heuristics based approaches (Koskenniemi, 1984;Plisson et al, 2004) which are usually not portable to different languages and (ii) learning based methods (Chrupala et al, 2008;Toutanova and Cherry, 2009;Gesmundo and Samardzic, 2012;Müller et al, 2015;Nicolai and Kondrak, 2016) requiring prior training dataset to learn the morphological patterns. Again, the later methods can be further classified depending on whether context of the current word is considered or not.…”
Section: Related Workmentioning
confidence: 99%
“…In DirecTL+ (Jiampojamarn et al, 2010), the feature set was augmented with joint n-grams defined on both source and target substrings. The system was applied to related tasks such as transliteration (Jiampojamarn et al, 2009), morphological inflection (Nicolai et al, 2015), stemming (Nicolai and Kondrak, 2016), and cognate projection (Hauer et al, 2019), proving to be particularly competitive in low-resource settings. DTLM (Nicolai et al, 2018), our principal tool in this work, is a successor of DirecTL+, which incorporates target-side language models and a highprecision alignment.…”
Section: Prior Workmentioning
confidence: 99%
“…Previous work on lemmatization has investigated both neural (Bergmanis and Goldwater, 2019) and non-neural (Chrupała, 2008;Müller et al, 2015;Nicolai and Kondrak, 2016;Cotterell et al, 2017) methods. We compare our approach against recent competing methods that report results on UD datasets.…”
Section: Baselines (And Related Work)mentioning
confidence: 99%