Abstract:We investigate the effect of feature weighting on document clustering, including a novel investigation of Okapi BM25 feature weighting. Using eight document datasets and 17 well-established clustering algorithms we show that the benefit of tf-idf weighting over tf weighting is heavily dependent on both the dataset being clustered and the algorithm used. In addition, binary weighting is shown to be consistently inferior to both tf-idf weighting and tf weighting. We investigate clustering using both BM25 term sa… Show more
“…Whissell et al [19] have investigated the effect of different feature weighting approaches on the document clustering performance. They concluded that BM25 outperforms other feature weighting approaches and suggested to use BM25 for clustering tasks.…”
Section: Document Representationmentioning
confidence: 99%
“…In case of the BoW model, we used BM25 with the parameters k1 = 20 and b = 1 based on a study of Whissell et al [19]. For the paragraph vector model, we used a model that was trained on a dump of all English Wikipedia articles from December 2017 using the gensim library [18].…”
Abstract. The vast amount of scientific literature poses a challenge when one is trying to understand a previously unknown topic. Selecting a representative subset of documents that covers most of the desired content can solve this challenge by presenting the user a small subset of documents. We build on existing research on representative subset extraction and apply it in an information retrieval setting. Our document selection process consists of three steps: computation of the document representations, clustering, and selection of documents. We implement and compare two different document representations, two different clustering algorithms, and three different selection methods using a coverage and a redundancy metric. We execute our 36 experiments on two datasets, with 10 sample queries each, from different domains. The results show that there is no clear favorite and that we need to ask the question whether coverage and redundancy are sufficient for evaluating representative subsets.
“…Whissell et al [19] have investigated the effect of different feature weighting approaches on the document clustering performance. They concluded that BM25 outperforms other feature weighting approaches and suggested to use BM25 for clustering tasks.…”
Section: Document Representationmentioning
confidence: 99%
“…In case of the BoW model, we used BM25 with the parameters k1 = 20 and b = 1 based on a study of Whissell et al [19]. For the paragraph vector model, we used a model that was trained on a dump of all English Wikipedia articles from December 2017 using the gensim library [18].…”
Abstract. The vast amount of scientific literature poses a challenge when one is trying to understand a previously unknown topic. Selecting a representative subset of documents that covers most of the desired content can solve this challenge by presenting the user a small subset of documents. We build on existing research on representative subset extraction and apply it in an information retrieval setting. Our document selection process consists of three steps: computation of the document representations, clustering, and selection of documents. We implement and compare two different document representations, two different clustering algorithms, and three different selection methods using a coverage and a redundancy metric. We execute our 36 experiments on two datasets, with 10 sample queries each, from different domains. The results show that there is no clear favorite and that we need to ask the question whether coverage and redundancy are sufficient for evaluating representative subsets.
“…Given a document corpus D and a query phrase Q, the diversified query expansion (DQE) problem requires that we generate an ordered (i.e., ranked) list of expansion terms E. Each of the terms in E may be appended to Q to create an extended query phrase that could be processed by a search engine operating over D using a relevance function such as BM25 [35] or PageRank [23]. The relevance function itself is external to the DQE task.…”
A search query, being a very concise grounding of user intent, could potentially have many possible interpretations. Search engines hedge their bets by diversifying top results to cover multiple such possibilities so that the user is likely to be satisfied, whatever be her intended interpretation. Diversified Query Expansion is the problem of diversifying query expansion suggestions, so that the user can specialize the query to better suit her intent, even before perusing search results. In this paper, we consider the usage of semantic resources and tools to arrive at improved methods for diversified query expansion. In particular, we develop two methods, those that leverage Wikipedia and pre-learnt distributional word embeddings respectively. Both the approaches operate on a common three-phase framework; that of first taking a set of informative terms from the search results of the initial query, then building a graph, following by using a diversity-conscious node ranking to prioritize candidate terms for diversified query expansion. Our methods differ in the second phase, with the first method Select-Link-Rank (SLR) linking terms with Wikipedia entities to accomplish graph construction; on the other hand, our second method, SelectEmbed-Rank (SER), constructs the graph using similarities between distributional word embeddings. Through an empirical analysis and user study, we show that SLR ourperforms state-of-the-art diversified query expansion methods, thus establishing that Wikipedia is an effective resource to aid diversified query expansion. Our empirical analysis also illustrates that SER outperforms the baselines convincingly, asserting that it is the best available method for those cases where SLR is not applicable; these include narrow-focus search systems where a relevant knowledge base is unavailable. Our SLR method is also seen to outperform a state-of-the-art method in the task of diversified entity ranking.
“…Given a document corpus D and a query phrase Q, the diversified query expansion (DQE) problem requires that we generate an ordered (i.e., ranked) list of expansion terms E. Each of the terms in E may be appended to Q to create an extended query phrase that could be processed by a search engine operating over D using a relevance function such as BM25 [27] or PageRank [18]. The ideal E is that ordering of terms such that the separate extended queries formed using the top few terms in E are capable of eliciting documents relevant to most aspects of Q from the search engine.…”
Abstract. A search query, being a very concise grounding of user intent, could potentially have many possible interpretations. Search engines hedge their bets by diversifying top results to cover multiple such possibilities so that the user is likely to be satisfied, whatever be her intended interpretation. Diversified Query Expansion is the problem of diversifying query expansion suggestions, so that the user can specialize the query to better suit her intent, even before perusing search results. We propose a method, Select-Link-Rank, that exploits semantic information from Wikipedia to generate diversified query expansions. SLR does collective processing of terms and Wikipedia entities in an integrated framework, simultaneously diversifying query expansions and entity recommendations. SLR starts with selecting informative terms from search results of the initial query, links them to Wikipedia entities, performs a diversity-conscious entity scoring and transfers such scoring to the term space to arrive at query expansion suggestions. Through an extensive empirical analysis and user study, we show that our method outperforms the state-of-the-art diversified query expansion and diversified entity recommendation techniques.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.