2018
DOI: 10.1007/978-3-319-73706-5_8
|View full text |Cite
|
Sign up to set email alerts
|

Developing a Stemmer for German Based on a Comparative Analysis of Publicly Available Stemmers

Abstract: Abstract. Stemmers, which reduce words to their stems, are important components of many natural language processing systems. In this paper, we conduct a systematic evaluation of several stemmers for German using two gold standards we have created and will release to the community. We then present our own stemmer, which achieves state-of-the-art results, is easy to understand and extend, and will be made publicly available both for use by programmers and as a benchmark for further stemmer development.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
8
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 12 publications
(9 citation statements)
references
References 6 publications
1
8
0
Order By: Relevance
“…The distribution between Android and iOS in the reviews (77% Android) roughly matches that of the distribution of Android and iOS market shares in Germany (64% Android) 5 . Looking at all ratings, including those without a review, for Android, 36% of the ratings were 5-star ratings, and 35% were 1-star ratings.…”
Section: A App Reviewssupporting
confidence: 62%
“…The distribution between Android and iOS in the reviews (77% Android) roughly matches that of the distribution of Android and iOS market shares in Germany (64% Android) 5 . Looking at all ratings, including those without a review, for Android, 36% of the ratings were 5-star ratings, and 35% were 1-star ratings.…”
Section: A App Reviewssupporting
confidence: 62%
“…Then, we tokenized tweets using the regular expression, re, Python package [17]. When tokenizing, we changed tweets to lowercase and stemmed tweets with Porter stemmer for English [18] and with Cistem for German [19]. For English tweets, we employed a basic tokenizer to tokenize tweets without stemming for POS tagging.…”
Section: Pre-processingmentioning
confidence: 99%
“…Dedicated methods that tackle rich target-side morphology have also shown good results in phrase-based translation systems previously (Huck et al, 2017c). Future work on neural machine translation could for instance follow a two-step prediction paradigm (Conforti et al, 2018), or improve over our current version of linguistically informed word segmentation by means of a better linguistic analysis (Weissweiler and Fraser, 2017).…”
Section: Preprocessingmentioning
confidence: 99%