Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval 2011
DOI: 10.1145/2009916.2010031
|View full text |Cite
|
Sign up to set email alerts
|

A novel corpus-based stemming algorithm using co-occurrence statistics

Abstract: We present a stemming algorithm for text retrieval. The algorithm uses the statistics collected on the basis of certain corpus analysis based on the co-occurrence between two word variants. We use a very simple co-occurrence measure that reflects how often a pair of word variants occurs in a document as well as in the whole corpus. A graph is formed where the word variants are the nodes and two word variants form an edge if they co-occur. On the basis of the co-occurrence measure, a certain edge strength is de… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2012
2012
2023
2023

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 31 publications
(5 citation statements)
references
References 24 publications
0
5
0
Order By: Relevance
“…The stemmer used for our experiments is able to increase mean average precision (MAP) by 21.82% (0.2250 to 0.2741) on the monolingual title-only Bengali queries of the FIRE-2010 test collection. Although there are more complex corpus-based approaches reported for Bengali stemming (Majumder et al, 2007;Paik et al, 2011), the focus of this paper is not on improving stemming, but rather to improve on cross-lingual retrieval performance from English to Bengali. We thus applied a simple rule based approach as a stemmer which does not require computationally intensive pre-processing over the vocabulary of the corpus.…”
Section: Methodsmentioning
confidence: 99%
“…The stemmer used for our experiments is able to increase mean average precision (MAP) by 21.82% (0.2250 to 0.2741) on the monolingual title-only Bengali queries of the FIRE-2010 test collection. Although there are more complex corpus-based approaches reported for Bengali stemming (Majumder et al, 2007;Paik et al, 2011), the focus of this paper is not on improving stemming, but rather to improve on cross-lingual retrieval performance from English to Bengali. We thus applied a simple rule based approach as a stemmer which does not require computationally intensive pre-processing over the vocabulary of the corpus.…”
Section: Methodsmentioning
confidence: 99%
“…In the character n -gram based method, adjacent characters in a length of n from the words in a corpus are considered to have less frequency whereas the variants have higher frequencies ( McNamee & Mayfield, 2004 ; Ahmed & Nrnberger, 2009 ; Pande, Tamta & Dhami, 2018 ). Also, various studies on corpus-based stemming using co-occurrence analysis and machine learning techniques are presented ( Paik, Pal & Parui, 2011 ; Paik et al, 2013 ; Brychcn & Konopk, 2015 ). These methods analyze the co-occurrence or context of the basis form of the words in a corpus.…”
Section: Related Workmentioning
confidence: 99%
“…The proposed technique is implemented using Python3 employing several necessary packages, such as PorterStemmer [31,32,47], Sent tokenize, Word tokenize of Natural Language Tool Kit [4,62], Regular Expression [13,38], and so on. Note that all the words are stemmed initially before passing them to the processing phase employing porter-Stemmer.…”
Section: Implementation Detailsmentioning
confidence: 99%