Corpus-based stemming using cooccurrence of word variants

Xu, Jinxi; Croft, W. Bruce

doi:10.1145/267954.267957

Cited by 229 publications

(134 citation statements)

References 4 publications

Supporting

Mentioning

125

Contrasting

Unclassified

Order By: Relevance

“…Xu and Croft introduce the use of co-occurrence data to improve stemming algorithms (Xu and Croft, 1998). The premise of the system described in this paper is to use contextual (e.g., co-occurrence) information to improve the equivalence classes produced by an aggressive stemmer, such as the Porter stemmer.…”

Section: Discussionmentioning

confidence: 99%

A framework for understanding Latent Semantic Indexing (LSI) performance

Kontostathis

Pottenger

2006

Information Processing & Management

147

View full text Add to dashboard Cite

Section: Discussionmentioning

confidence: 99%

A framework for understanding Latent Semantic Indexing (LSI) performance

Kontostathis

Pottenger

2006

Information Processing & Management

147

View full text Add to dashboard Cite

“…were not separated out. This was referred to as the stringing effect in [13]. The proposed method split all of them into separate classes.…”

Section: Resultsmentioning

confidence: 99%

Distribution Based Stemmer Refinement

Narayan

Pal

2005

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Stemming is a common preprocessing task applied to text corpora. Errors in this process may be refined either manually or based on a corpus. We describe a novel corpus-based stemming technique which models the given words as being generated from a multinomial distribution over the topics available in the corpus. A sequential hypothesis testing like procedure helps us group together distributionally similar words. This stemmer refines any given stemmer and its strength can be controlled with the help of two thresholds. A refinement based on the 20 Newsgroups data set shows that the proposed method splits equivalence classes appropriately.

show abstract

“…While stemming schemes are normally designed to work with general texts, some may also be especially designed for a specific domain (e.g., in medicine) or a given document collection, such as that developed by Xu and Croft (1998), which used a corpus-based approach. This more closely reflects language usage (including word frequencies and other co-occurrence statistics), instead of a set of morphological rules in which the frequency of each rule (and therefore its underlying importance) is not precisely known.…”

Section: Related Workmentioning

confidence: 99%

Indexing and stemming approaches for the Czech language

Dolamic

Savoy

2009

Information Processing & Management

View full text Add to dashboard Cite

a b s t r a c tThis paper describes and evaluates various stemming and indexing strategies for the Czech language. Based on Czech test-collection, we have designed and evaluated two stemming approaches, a light and a more aggressive one. We have compared them with a no stemming scheme as well as a language-independent approach (n-gram). To evaluate the suggested solutions we used various IR models, including Okapi, Divergence from Randomness (DFR), a statistical language model (LM) as well as the classical tf idf vector-space approach. We found that the Divergence from Randomness paradigm tend to propose better retrieval effectiveness than the Okapi, LM or tf idf models, the performance differences were however statistically significant only with the last two IR approaches. Ignoring the stemming reduces generally the MAP by more than 40%, and these differences are always significant. Finally, if our more aggressive stemmer tends to show the best performance, the differences in performance with a light stemmer are not statistically significant.

show abstract

Corpus-based stemming using cooccurrence of word variants

Cited by 229 publications

References 4 publications

A framework for understanding Latent Semantic Indexing (LSI) performance

A framework for understanding Latent Semantic Indexing (LSI) performance

Distribution Based Stemmer Refinement

Indexing and stemming approaches for the Czech language

Contact Info

Product

Resources

About