2009
DOI: 10.1016/j.ipm.2009.06.001
|View full text |Cite
|
Sign up to set email alerts
|

Indexing and stemming approaches for the Czech language

Abstract: a b s t r a c tThis paper describes and evaluates various stemming and indexing strategies for the Czech language. Based on Czech test-collection, we have designed and evaluated two stemming approaches, a light and a more aggressive one. We have compared them with a no stemming scheme as well as a language-independent approach (n-gram). To evaluate the suggested solutions we used various IR models, including Okapi, Divergence from Randomness (DFR), a statistical language model (LM) as well as the classical tf … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2010
2010
2021
2021

Publication Types

Select...
4
3

Relationship

2
5

Authors

Journals

citations
Cited by 27 publications
(7 citation statements)
references
References 19 publications
0
7
0
Order By: Relevance
“…Based on this data, we found that a more aggressive stemmer tended to result in better MAP while for some languages (e.g., Bengali) the performance difference between a light and an aggressive stemmer was not significant. Moreover, when compared to MAP found for certain European languages, these relative improvements after stemming for these three Indian languages were quite large (e.g., 4% for English, 4.1% for Dutch, 7% for Spanish, 9% for French, 15% for Italian, 19% for German, 29% for Swedish, 40% for Finnish [Tomlinson 2004], and 45% for Czech [Dolamic and Savoy 2010]). …”
Section: Stemming Evaluationmentioning
confidence: 77%
See 2 more Smart Citations
“…Based on this data, we found that a more aggressive stemmer tended to result in better MAP while for some languages (e.g., Bengali) the performance difference between a light and an aggressive stemmer was not significant. Moreover, when compared to MAP found for certain European languages, these relative improvements after stemming for these three Indian languages were quite large (e.g., 4% for English, 4.1% for Dutch, 7% for Spanish, 9% for French, 15% for Italian, 19% for German, 29% for Swedish, 40% for Finnish [Tomlinson 2004], and 45% for Czech [Dolamic and Savoy 2010]). …”
Section: Stemming Evaluationmentioning
confidence: 77%
“…A noun's inflectional termination depends on its case, number, and gender, thus resulting in the complex morpho-syntaxical construction often found in other Indo-European languages, such as Czech [Dolamic and Savoy 2010].…”
Section: Key Features Of Marathi Morphologymentioning
confidence: 99%
See 1 more Smart Citation
“…14 In many NLP applications, a very popular preprocessing technique is stemming. We tested the Czech light stemmer (Dolamic & Savoy, 2009) and High Precision Stemmer. 15 Another widely-used method for reducing the vocabulary size, and thus the feature space, is lemmatization.…”
Section: Preprocessingmentioning
confidence: 99%
“…For some languages (e.g., Chinese, Japanese [1]), word segmentation is not an easy task, while for others (e.g., German), the use of different compound constructions to express the same concept or idea may hurt the retrieval quality [2]. The presence of numerous inflectional suffixes (e.g., Hungarian [3], Finnish), even for names (e.g., Czech [4], Russian [5]) as well as numerous derivational suffixes must be taken into account for an effective retrieval.…”
Section: Introductionmentioning
confidence: 99%