2000
DOI: 10.1007/s007999900025
|View full text |Cite
|
Sign up to set email alerts
|

A probabilistic justification for using tf×idf term weighting in information retrieval

Abstract: Abstract. This paper presents a new probabilistic model of information retrieval. The most important modeling assumption made is that documents and queries are defined by an ordered sequence of single terms. This assumption is not made in well-known existing models of information retrieval, but is essential in the field of statistical natural language processing. Advances already made in statistical natural language processing will be used in this paper to formulate a probabilistic justification for using tf×i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
72
0
3

Year Published

2001
2001
2016
2016

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 158 publications
(82 citation statements)
references
References 13 publications
(26 reference statements)
0
72
0
3
Order By: Relevance
“…This notion is for example expressed in the popular tf/idf family of formulae but is also implicit in the language modelling framework [10]. The same method can be applied to the video retrieval setting, in which each shared video corresponds to a distinct d. We assume a unigram collection model LM C comprised of all comments in C and dedicated document models LM d based on the comment thread of document d. Subsequently, we assume good descriptors of d can be determined by the termwise KL-divergence between both models (LM C and LM d ), identifying locally densely occurring terms w (those that display a high negative value of KL(w)).…”
Section: Related Workmentioning
confidence: 99%
“…This notion is for example expressed in the popular tf/idf family of formulae but is also implicit in the language modelling framework [10]. The same method can be applied to the video retrieval setting, in which each shared video corresponds to a distinct d. We assume a unigram collection model LM C comprised of all comments in C and dedicated document models LM d based on the comment thread of document d. Subsequently, we assume good descriptors of d can be determined by the termwise KL-divergence between both models (LM C and LM d ), identifying locally densely occurring terms w (those that display a high negative value of KL(w)).…”
Section: Related Workmentioning
confidence: 99%
“…One common scoring method that has been used for visual place recognition is known as TF-IDF (Term Frequency -Inverse Document Frequency (Hiemstra, 2000;Manning et al, 2008)), which creates vectors for each location where each element is the ratio between how common a word is within that location and how common the word is within the entire set of locations (Sivic and Zisserman, 2003). Locations can then be compared by finding the distance between their corresponding TF-IDF vectors.…”
Section: Related Workmentioning
confidence: 99%
“…1, "Toy Story 3." In this third possibility, the trending topic is associated with a "Promoted" tweet -a hybrid tweet-advertisement which is displayed at the top of search results on relevant topics 5 . While the classification of a trending topic as consisting of spikes or chatter is helpful for the understanding of the nature of trending topics, it is not directly useful in the identification or classification of terms as trending topics.…”
Section: Problem Definitionmentioning
confidence: 99%
“…Put simply, the weight of a document will be higher if the number of times a word occurs in a document is higher, or if the number of documents containing that word is lower; similarly, the weight of a document will be lower if the number of times a word occurs in a document is lower, or if the number of documents containing that word is higher [5].…”
Section: B Tf-idfmentioning
confidence: 99%