The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.
2011
DOI: 10.1007/s10791-011-9163-y
|View full text |Cite
|
Sign up to set email alerts
|

Improving document clustering using Okapi BM25 feature weighting

Abstract: We investigate the effect of feature weighting on document clustering, including a novel investigation of Okapi BM25 feature weighting. Using eight document datasets and 17 well-established clustering algorithms we show that the benefit of tf-idf weighting over tf weighting is heavily dependent on both the dataset being clustered and the algorithm used. In addition, binary weighting is shown to be consistently inferior to both tf-idf weighting and tf weighting. We investigate clustering using both BM25 term sa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
38
1
3

Year Published

2014
2014
2024
2024

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 52 publications
(44 citation statements)
references
References 23 publications
2
38
1
3
Order By: Relevance
“…Whissell et al [19] have investigated the effect of different feature weighting approaches on the document clustering performance. They concluded that BM25 outperforms other feature weighting approaches and suggested to use BM25 for clustering tasks.…”
Section: Document Representationmentioning
confidence: 99%
See 1 more Smart Citation
“…Whissell et al [19] have investigated the effect of different feature weighting approaches on the document clustering performance. They concluded that BM25 outperforms other feature weighting approaches and suggested to use BM25 for clustering tasks.…”
Section: Document Representationmentioning
confidence: 99%
“…In case of the BoW model, we used BM25 with the parameters k1 = 20 and b = 1 based on a study of Whissell et al [19]. For the paragraph vector model, we used a model that was trained on a dump of all English Wikipedia articles from December 2017 using the gensim library [18].…”
Section: Experiments Setupmentioning
confidence: 99%
“…Given a document corpus D and a query phrase Q, the diversified query expansion (DQE) problem requires that we generate an ordered (i.e., ranked) list of expansion terms E. Each of the terms in E may be appended to Q to create an extended query phrase that could be processed by a search engine operating over D using a relevance function such as BM25 [35] or PageRank [23]. The relevance function itself is external to the DQE task.…”
Section: Problem Statementmentioning
confidence: 99%
“…Given a document corpus D and a query phrase Q, the diversified query expansion (DQE) problem requires that we generate an ordered (i.e., ranked) list of expansion terms E. Each of the terms in E may be appended to Q to create an extended query phrase that could be processed by a search engine operating over D using a relevance function such as BM25 [27] or PageRank [18]. The ideal E is that ordering of terms such that the separate extended queries formed using the top few terms in E are capable of eliciting documents relevant to most aspects of Q from the search engine.…”
Section: Problem Formulationmentioning
confidence: 99%