Title language model for information retrieval

Jin, Rong; Hauptmann, Alex; Zhai, Cheng Xiang

doi:10.1145/564376.564386

Cited by 69 publications

(37 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Empirical studies have shown that anchor texts exhibit characteristics similar to both user queries and document titles [24]. Language models generated from document titles also can be used as an approximation of a user query language model [30]. Anchor text has been widely used in the IR field to improve search effectiveness [22,23,25,31,35,36,38].…”

Section: Query Setmentioning

confidence: 99%

Quantifying retrieval bias in Web archive search

et al. 2017

View full text Add to dashboard Cite

A Web archive usually contains multiple versions of documents crawled from the Web at different points in time. One possible way for users to access a Web archive is through full-text search systems. However, previous studies have shown that these systems can induce a bias, known as the retrievability bias, on the accessibility of documents in community-collected collections (such as TREC collections). This bias can be measured by analyzing the distribution of the retrievability scores for each document in a collection, quantifying the likelihood of a document's retrieval. We investigate the suitability of retrievability scores in retrieval systems that consider every version of a document in a Web archive as an independent document. We show that the retrievability of documents can vary for different versions of the same document and that retrieval systems induce biases to different extents. We quantify this bias for a retrieval system which is adapted to handle multiple versions of the same document. The retrieval system indexes each version of a document independently, and we refine the search results using two techniques to aggregate similar versions. The first approach is to collapse similar versions of a document based on content similarity. The second approach is to collapse all versions of the same document based on their URLs. In both cases, we found that the degree of bias Universiteit Utrecht, Utrecht, The Netherlands is related to the aggregation level of versions of the same document. Finally, we study the effect of bias across time using the retrievability measure. Specifically, we investigate whether the number of documents crawled in a particular year correlates with the number of documents in the search results from that year. Assuming queries are not inherently temporal in nature, the analysis is based on the timestamps of documents in the search results returned using the retrieval model for all queries. The results show a relation between the number of documents per year and the number of documents retrieved by the retrieval system from that year. We further investigated the relation between the queries' timestamps and the documents' timestamps. First, we split the queries into different time frames using a 1-year granularity. Then, we issued the queries against the retrieval system. The results show that temporal queries indeed retrieve more documents from the assumed time frame. Thus, the documents from the same time frame were preferred by the retrieval system over documents from other time frames.

show abstract

Section: Query Setmentioning

confidence: 99%

Quantifying retrieval bias in Web archive search

et al. 2017

View full text Add to dashboard Cite

show abstract

“…This is the basic formulation of the HMM model proposed by Miller et al and often referred to as the simple language model which has been used as the baseline language model in several studies (Lavrenko & Croft, 2001;Liu & Croft, 2002;Jin et al, 2002). Retrieval experiments on TREC test collections show that the simple two-state system can do dramatically better than the tf-idf measure.…”

Section: Miller Et Al (1999) Use a Two State Hidden Markov Model (Hmmentioning

confidence: 99%

“…Building upon the ideas of Berger & Lafferty (1999), Jin et al (2002) propose to construct language models of document titles and determine the relevance a document to a query by estimating the likelihood that the query would have been the title for the document. The title of a document is viewed as a translation from that document and the title language model is regarded as an approximate language model of the query.…”

Section: Miller Et Al (1999) Use a Two State Hidden Markov Model (Hmmentioning

confidence: 99%

“…The title of a document is viewed as a translation from that document and the title language model is regarded as an approximate language model of the query. Jin et al (2002) first estimate a translation model by using all the document-title pairs in a collection. The translation model is then used for mapping a regular document language model to a title language model.…”

Section: Miller Et Al (1999) Use a Two State Hidden Markov Model (Hmmentioning

confidence: 99%

See 1 more Smart Citation

Statistical language modeling for information retrieval

Liu¹,

Croft²

2005

Annual Review Info Sci & Tec

View full text Add to dashboard Cite

“…Ponte and Croft originally proposed LM for IR [10], then Song put emphasis on data smoothing techniques in LM [12]. Recently, many variations of traditional LM have been developed to improve IR performance, such as relevance-based language model [13], time-based language model [14] and title language model [15]. In this paper, we extend the traditional document LM to the author LM and the category LM according to the nature of BBS articles.…”

Section: Related Workmentioning

confidence: 99%

An Article Language Model for BBS Search

Zhu

2005

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Bulletin Board Systems (BBS), similar to blogs, newsgroups, online forums, etc., are online broadcasting spaces where people can exchange ideas and make announcements. As BBS are becoming valuable repositories of knowledge and information, effective BBS search engines are required to make the information universally accessible and useful. However, the techniques that have been proven successful for web search are not suitable for searching BBS articles due to the nature of BBS. In this paper, we propose a novel article language model (LM) to build an effective BBS search engine. We investigate the differences between BBS articles and web pages, then extend the traditional LM to author LM and category LM. The article LM is powerful in the sense that it can combine the three LMs into a single framework. Experimental results shows that our article LM substantially outperforms both INQUERY algorithm and the traditional LM.

show abstract

Title language model for information retrieval

Cited by 69 publications

References 9 publications

Quantifying retrieval bias in Web archive search

Quantifying retrieval bias in Web archive search

Statistical language modeling for information retrieval

An Article Language Model for BBS Search

Contact Info

Product

Resources

About