Query-Sensitive Similarity Measures for Information Retrieval

Tombros, Anastasios; Rijsbergen, C. J. van

doi:10.1007/s10115-003-0115-8

Cited by 37 publications

(22 citation statements)

References 26 publications

(46 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Outlier-based Re-ranking Method According to the clustering hypothesis [Rijsbergen 1979] [Tombros and van 2004], the topically-relevant documents tend to cluster together, while the irrelevant ones would be scattered. By considering the scattered irrelevant documents as outlier documents, we then propose to use outlier detection methods to automatically detect the irrelevant documents.…”

Section: Three Re-ranking Methodsmentioning

confidence: 99%

A Distribution Separation Method Using Irrelevance Feedback Data for Information Retrieval

Zhang

Qian

Hou

et al. 2017

ACM Trans. Intell. Syst. Technol.

View full text Add to dashboard Cite

In many research and application areas, such as information retrieval and machine learning, we often encounter dealing with a probability distribution which is mixed by one distribution that is relevant to our task in hand and the other that is irrelevant and we want to get rid of. Thus, it is an essential problem to separate the irrelevant distribution from the mixture distribution. This paper is focused on the application in Information Retrieval, where relevance feedback is a widely used technique to build a refined query model based on a set of feedback documents. However, in practice, the relevance feedback set, even provided by users explicitly or implicitly, is often a mixture of relevant and irrelevant documents. Consequently, the resultant query model (typically a term distribution) is often a mixture rather than a true relevance term distribution, leading to a negative impact on the retrieval performance. To tackle this problem, we recently proposed a Distribution Separation Method (DSM), which aims to approximate the true relevance distribution by separating a seed irrelevance distribution from the mixture one. While it achieved a promising performance in an empirical evaluation with simulated explicit irrelevance feedback data, it has not been deployed in the scenario where one should automatically obtain the irrelevance feedback data. In this article, we propose a substantial extension of the basic DSM from two perspectives: developing a further regularization framework and deploying DSM in the automatic irrelevance feedback scenario. Specifically, in order to avoid the output distribution of DSM drifting away from the true relevance distribution when the quality of seed irrelevant distribution (as the input to DSM) is not guaranteed, we propose a DSM regularization framework to constrain the estimation for the relevance distribution. This regularization framework includes three algorithms, each corresponding to a regularization strategy incorporated in the objective function of DSM. In addition, we exploit DSM in automatic (i.e., pseudo) irrelevance feedback, by automatically detecting the seed irrelevant documents via three different document re-ranking methods. We have carried out extensive experiments based on various TREC data sets, in order to systematically evaluate the proposed methods. The experimental results demonstrate the effectiveness of our proposed approaches in comparison with various strong baselines.

show abstract

Section: Three Re-ranking Methodsmentioning

confidence: 99%

A Distribution Separation Method Using Irrelevance Feedback Data for Information Retrieval

Zhang

Qian

Hou

et al. 2017

ACM Trans. Intell. Syst. Technol.

View full text Add to dashboard Cite

show abstract

“…Recent research in inter-document similarity [20,21] has suggested that similarity measures that take the query into account are more effective than conventional measures. This class of similarity measures is called query-sensitive (QSSM).…”

Section: Structurementioning

confidence: 99%

“…For each query, we retrieve the top 100 documents and use them for our study. In [12,20,21] it has been demonstrated that using relationships from among documents ranked high by an IR system in response to a query, is more effective than using relationships from entire document collections.…”

Section: Experimental Environmentmentioning

confidence: 99%

See 1 more Smart Citation

Factors Affecting Web Page Similarity

Tombros

Ali

2005

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. Tools that allow effective information organisation, access and navigation are becoming increasingly important on the Web. Similarity between web pages is a concept that is central to such tools. In this paper, we examine the effect that content and layout-related aspects of web pages have on web page similarity. We consider the textual content contained within common HTML tags, the structural layout of pages, and the query terms contained within pages. Our study shows that combinations of factors can yield more promising results than individual factors, and that different aspects of web pages affect similarities between pages in a different manner. We found a number of factors that, when taken into account, can result in effective measures of similarity between web pages. Query information in particular, proved to be important for the effective organisation of web pages.

show abstract

“…For the purpose of query performance prediction, this measure is tailored further using the query-dependant extension of the dotproduct described by Tombros and van Rijsbergen in [TvR04]. Amongst the alternatives suggested, the similarity between two documents is calculated here as the product of their cosine dot product and the query-dependant component.…”

Section: Approximation Of the Cox-lewis Statisticmentioning

confidence: 99%

Relevance Feedback for Text Retrieval

Vinay¹

SpringerReference

View full text Add to dashboard Cite

Relevance Feedback is a technique that helps an Information Retrieval system modify a query in response to relevance judgements provided by the user about individual results displayed after an initial retrieval. This thesis begins by proposing an evaluation framework for measuring the effectiveness of feedback algorithms. The simulation-based method involves a brute force exploration of the outcome of every possible user action. Starting from an initial state, each available alternative is represented as a traversal along one branch of a user decision tree. The use of the framework is illustrated in two situations -searching on devices with small displays and for web search. Three well known RF algorithms, Rocchio, Robertson/Sparck-Jones (RSJ) and Bayesian, are compared for these applications.For small display devices, the algorithms are evaluated in conjunction with two strategies for presenting search results: the top-D ranked documents and a document ranking that attempts to maximise information gain from the user's choices. Experimental results indicate that for RSJ feedback which involves an explicit feature selection policy, the greedy top-D display is more appropriate. For the other two algorithms, the exploratory display that maximises information gain produces better results. A user study was conducted to evaluate the performance of the relevance feedback methods with real users and compare the results with the findings from the tree analysis. This comparison between the simulations and real user behaviour indicates that the Bayesian algorithm, coupled with the sampled display, is the most effective. For web-search, two possible representations for web-pages are considered -the textual content of the page and the anchor text of hyperlinks into this page. Results indicate that there is a significant variation in the upper-bound performance of the three RF algorithms and that the Bayesian algorithm approaches the best possible.The relative performance of the three algorithms differed in the two sets of experiments.All other factors being constant, this difference in effectiveness was attributed to the fact that the datasets used in the two cases were different. Also, at a more general level, a relationship was observed between the performance of the original query and benefits of subsequent relevance feedback.The remainder of the thesis looks at properties that characterise sets of documents with the particular aim of identifying measures that are predictive of future performance of statistical algorithms on these document sets. The central hypothesis is that a set of points (corAbstract 4 responding to documents) are difficult if they lack structure. Three properties are identified -the clustering tendency, sensitivity to perturbation and the local intrinsic dimensionality.The clustering tendency reflects the presence or absence of natural groupings within the data. Perturbation analysis looks at the sensitivity of the similarity metric to small changes in the input. The correlation present in se...

show abstract

Query-Sensitive Similarity Measures for Information Retrieval

Cited by 37 publications

References 26 publications

A Distribution Separation Method Using Irrelevance Feedback Data for Information Retrieval

A Distribution Separation Method Using Irrelevance Feedback Data for Information Retrieval

Factors Affecting Web Page Similarity

Relevance Feedback for Text Retrieval

Contact Info

Product

Resources

About