A novel corpus-based stemming algorithm using co-occurrence statistics

Paik, Jiaul H.; Pal, Dipasree; Parui, Susanta Kumar

doi:10.1145/2009916.2010031

Cited by 31 publications

(5 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The stemmer used for our experiments is able to increase mean average precision (MAP) by 21.82% (0.2250 to 0.2741) on the monolingual title-only Bengali queries of the FIRE-2010 test collection. Although there are more complex corpus-based approaches reported for Bengali stemming (Majumder et al, 2007;Paik et al, 2011), the focus of this paper is not on improving stemming, but rather to improve on cross-lingual retrieval performance from English to Bengali. We thus applied a simple rule based approach as a stemmer which does not require computationally intensive pre-processing over the vocabulary of the corpus.…”

Section: Methodsmentioning

confidence: 99%

Topical Relevance Model

Ganguly

Leveling

Jones

2012

Information Retrieval Technology

View full text Add to dashboard Cite

Cross-lingual relevance modelling (CLRLM) is a state-of-the-art technique for cross-lingual information retrieval (CLIR) which integrates query term disambiguation and expansion in a unified framework, to directly estimate a model of relevant documents in the target language starting with a query in the source language. However, CLRLM involves integrating a translation model either on the document side if a parallel corpus is available, or on the query side if a bilingual dictionary is available. For low resourced language pairs, large parallel corpora do not exist and the vocabulary coverage of dictionaries is small, as a result of which RLM-based CLIR fails to obtain satisfactory results. Despite the lack of parallel resources for a majority of language pairs, the availability of comparable corpora for many languages has grown considerably in the recent years. Existing CLIR techniques such as cross-lingual relevance models, cannot effectively utilize these comparable corpora, since they do not use information from documents in the source language. We overcome this limitation by using information from retrieved documents in the source language to improve the retrieval quality of the target language documents. More precisely speaking, our model involves a two step approach of first retrieving documents both in the source language and the target language (using query translation), and then improving on the retrieval quality of target language documents by expanding the query with translations of words extracted from the top ranked documents retrieved in the source language which are thematically related (i.e. share the same concept) to the words in the top ranked target language documents. Our key hypothesis is that the query in the source language and its equivalent target language translation retrieve documents which share topics. The ovelapping topics of these top ranked documents in both languages are then used to improve the ranking of the target language documents. Since the model relies on the alignment of topics between language pairs, we call it the cross-lingual topical relevance model (CLTRLM). Experimental results show that the CLTRLM significantly outperforms the standard CLRLM by upto 37% on English-Bengali CLIR, achieving mean average precision (MAP) of up to 60.27% of the Bengali monolingual IR MAP.

show abstract

Section: Methodsmentioning

confidence: 99%

Topical Relevance Model

Ganguly

Leveling

Jones

2012

Information Retrieval Technology

View full text Add to dashboard Cite

show abstract

“…In the character n -gram based method, adjacent characters in a length of n from the words in a corpus are considered to have less frequency whereas the variants have higher frequencies ( McNamee & Mayfield, 2004 ; Ahmed & Nrnberger, 2009 ; Pande, Tamta & Dhami, 2018 ). Also, various studies on corpus-based stemming using co-occurrence analysis and machine learning techniques are presented ( Paik, Pal & Parui, 2011 ; Paik et al, 2013 ; Brychcn & Konopk, 2015 ). These methods analyze the co-occurrence or context of the basis form of the words in a corpus.…”

Section: Related Workmentioning

confidence: 99%

A selective approach to stemming for minimizing the risk of failure in information retrieval systems

Göksel

Arslan

Dinçer

2023

PeerJ Computer Science

View full text Add to dashboard Cite

Stemming is supposed to improve the average performance of an information retrieval system, but in practice, past experimental results show that this is not always the case. In this article, we propose a selective approach to stemming that decides whether stemming should be applied or not on a query basis. Our method aims at minimizing the risk of failure caused by stemming in retrieving semantically-related documents. The proposed work mainly contributes to the IR literature by proposing an application of selective stemming and a set of new features that derived from the term frequency distributions of the systems in selection. The method based on the approach leverages both some of the query performance predictors and the derived features and a machine learning technique. It is comprehensively evaluated using three rule-based stemmers and eight query sets corresponding to four document collections from the standard TREC and NTCIR datasets. The document collections, except for one, include Web documents ranging from 25 million to 733 million. The results of the experiments show that the method is capable of making accurate selections that increase the robustness of the system and minimize the risk of failure (i.e., per query performance losses) across queries. The results also show that the method attains a systematically higher average retrieval performance than the single systems for most query sets.

show abstract

“…The proposed technique is implemented using Python3 employing several necessary packages, such as PorterStemmer [31,32,47], Sent tokenize, Word tokenize of Natural Language Tool Kit [4,62], Regular Expression [13,38], and so on. Note that all the words are stemmed initially before passing them to the processing phase employing porter-Stemmer.…”

Section: Implementation Detailsmentioning

confidence: 99%

TeKET: a Tree-Based Unsupervised Keyphrase Extraction Technique

et al. 2020

View full text Add to dashboard Cite

Automatic keyphrase extraction techniques aim to extract quality keyphrases for higher level summarization of a document. Majority of the existing techniques are mainly domain-specific, which require application domain knowledge and employ higher order statistical methods, and computationally expensive and require large train data, which is rare for many applications. Overcoming these issues, this paper proposes a new unsupervised keyphrase extraction technique. The proposed unsupervised keyphrase extraction technique, named TeKET or Tree-based Keyphrase Extraction Technique, is a domain-independent technique that employs limited statistical knowledge and requires no train data. This technique also introduces a new variant of a binary tree, called KeyPhrase Extraction (KePhEx) tree, to extract final keyphrases from candidate keyphrases. In addition, a measure, called Cohesiveness Index or CI, is derived which denotes a given node's degree of cohesiveness with respect to the root. The CI is used in flexibly extracting final keyphrases from the KePhEx tree and is co-utilized in the ranking process. The effectiveness of the proposed technique and its domain and language independence are experimentally evaluated using available benchmark corpora, namely SemEval-2010 (a scientific articles dataset), Theses100 (a thesis dataset), and a German Research Article dataset, respectively. The acquired results are compared with other relevant unsupervised techniques belonging to both statistical and graph-based techniques. The obtained results demonstrate the improved performance of the proposed technique over other compared techniques in terms of precision, recall, and F1 scores.

show abstract

A novel corpus-based stemming algorithm using co-occurrence statistics

Cited by 31 publications

References 24 publications

Topical Relevance Model

Topical Relevance Model

A selective approach to stemming for minimizing the risk of failure in information retrieval systems

TeKET: a Tree-Based Unsupervised Keyphrase Extraction Technique

Contact Info

Product

Resources

About