Jiaul H. Paik scite author profile

We introduce a generalised multivariate Pólya process for document language modelling. The framework outlined here generalises a number of statistical language models used in information retrieval for modelling document generation. In particular, we show that the choice of replacement matrix M ultimately defines the type of random process and therefore defines a particular type of document language model. We show that a particular variant of the general model is useful for modelling termspecific burstiness. Furthermore, via experimentation we show that this variant significantly improves retrieval effectiveness on a number of small test collections.1 such that the mass of the urn never decreases 2 a vector that is a 1 in dimension t i and 0 elsewhere

show abstract

Context-sensitive gender inference of named entities in text

Das

Paik

2021

Information Processing & Management

View full text Add to dashboard Cite

A Fixed-Point Method for Weighting Terms in Verbose Informational Queries

Paik

Oard

2014

View full text Add to dashboard Cite

The term weighting and document ranking functions used with informational queries are typically optimized for cases in which queries are short and documents are long. It is reasonable to assume that the presence of a term in a short query reflects some aspect of the topic that is important to the user, and thus rewarding documents that contain the greatest number of distinct query terms is a useful heuristic. Verbose informational queries, such as those that result from cut-and-paste of example text, or that might result from informal spoken interaction, pose a different challenge in which many extraneous (and thus potentially misleading) terms may be present in the query. Modest improvements have been reported from applying supervised methods to learn which terms in a verbose query deserve the greatest emphasis. This paper proposes a novel unsupervised method for weighting terms in verbose informational queries that relies instead on iteratively estimating which terms are most central to the query. The key idea is to use an initial set of retrieval results to define a recursion on the term weight vector that converges to a fixed point representing the vector that optimally describes the initial result set. Experiments with several TREC news and Web test collections indicate that the proposed method often statistically significantly outperforms state of the art supervised methods.

show abstract

A novel corpus-based stemming algorithm using co-occurrence statistics

Paik

Pal

Parui

2011

View full text Add to dashboard Cite

We present a stemming algorithm for text retrieval. The algorithm uses the statistics collected on the basis of certain corpus analysis based on the co-occurrence between two word variants. We use a very simple co-occurrence measure that reflects how often a pair of word variants occurs in a document as well as in the whole corpus. A graph is formed where the word variants are the nodes and two word variants form an edge if they co-occur. On the basis of the co-occurrence measure, a certain edge strength is defined for each of the edges. Finally, on the basis of the edge strengths, we propose a partition algorithm that groups the word variants based on their strongest neighbours, that is, the neighbours with largest strengths.Our stemming algorithm has two static parameters and does not use any other information except the co-occurrence statistics from the corpus. The experiments on TREC, CLEF and FIRE data consisting of four European and two Asian languages show a significant improvement over no-stem strategy on all the languages. Also, the proposed algorithm significantly outperforms a number of strong stemmers including the rule-based ones on a number of languages. For highly inflectional languages, a relative improvement of about 50% is obtained compared to un-normalized words and a relative improvement ranging from 5% to 16% is obtained compared to the rule based stemmer for the concerned language.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jiaul H. Paik

A novel TF-IDF weighting scheme for effective ranking

A Pólya Urn Document Language Model for Improved Information Retrieval

Context-sensitive gender inference of named entities in text

A Fixed-Point Method for Weighting Terms in Verbose Informational Queries

A novel corpus-based stemming algorithm using co-occurrence statistics

Contact Info

Product

Resources

About