Web clustering engines organize search results by topic, thus offering a complementary view to the flat-ranked list returned by conventional search engines. In this survey, we discuss the issues that must be addressed in the development of a Web clustering engine, including acquisition and preprocessing of search results, their clustering and visualization. Search results clustering, the core of the system, has specific requirements that cannot be addressed by classical clustering algorithms. We emphasize the role played by the quality of the cluster labels as opposed to optimizing only the clustering structure. We highlight the main characteristics of a number of existing Web clustering engines and also discuss how to evaluate their retrieval performance. Some directions for future research are finally presented.
Search engines rock! Right? Without search engines, the Internet would be an enormous amount of disorganized information that would certainly be interesting but perhaps not very useful. Search engines help us in all kinds of tasks and are constantly improving result relevance. So, does even the tiniest scratch exist on this perfect image? We're afraid so.Contrary to popular belief, search engines don't answer questions. They merely provide fast access to the information people put on the Web. In fact, popular search engines, as opposed to query-answering systems, return Web pages matching the user's question rather than the question's answer. Luckily, this works most of the time, because people tend to place questions together with answers. The real problem arises when the information need expressed by the query is vague, too broad, or simply ill-defined. Then the range of covered topics becomes unmanageably large, confusing, and of doubtful value to the user.Search-results clustering aims to present information about the matching documents (see the "Related Work in Text Clustering" sidebar). It's like taking a step backward to grasp a bigger picture-we no longer care about individual documents, but about some underlying semantic structure capable of explaining why these documents constitute a good result to the query. To find this structure, we set a few goals:• identify groups of similar documents, • discover a textual description of the property making the documents similar, and • present these descriptions to the user in document clusters.Our approach reverses the traditional order of cluster discovery. Instead of calculating proximity between documents and then labeling the discovered groups, we first attempt to find good, conceptually varied cluster labels and then assign documents to the labels to form groups. We believe that only the commercial search engine Vivisimo (www.vivisimo com) uses a similar order of cluster discovery, but the details of that algorithm are unknown. The Lingo algorithmAccording to the Collins English Dictionary, lingo is "a range of words or a style of language which is used in a particular situation or by a particular group of people." Each time a user issues a query on the Web, a new language is created, with its own characteristic vocabulary, phrases, and expressions. A successful Web-search-results clustering algorithm should speak its users' lingoes-that is, create thematic groups whose descriptions are easy to read and understand. Users will likely disregard groups with overly long or ambiguous descriptions, even though their content might be valuable. A Web search clustering algorithm must therefore aim to generate only clusters possessing meaningful, concise, and accurate labels.In conventional approaches, which determine group labels after discovering the actual cluster content, this task proves fairly difficult to accomplish. Numerical cluster representations might "know" that certain documents are similar, but they can't describe the actual relationship.In the Lingo de...
Search results clustering problem is defined as an automatic, on-line grouping of similar documents in a search results list returned from a search engine. In this paper we present Lingo-a novel algorithm for clustering search results, which emphasizes cluster description quality. We describe methods used in the algorithm: algebraic transformations of the term-document matrix and frequent phrase extraction using suffix arrays. Finally, we discuss results acquired from an empirical evaluation of the algorithm. Knowledge is of two kinds: we know a subject ourselves, or we know where we can find information about it.
Abstract. In this paper we present the design goals and implementation outline of Carrot 2 , an open source framework for rapid development of applications dealing with Web Information Retrieval and Web Mining. The framework has been written from scratch keeping in mind flexibility and efficiency of processing. We show two software architectures that meet the requirements of these two aspects and provide evidence of their use in clustering of search results. We also discuss the importance and advantages of contributing and integrating the results of scientific projects with the open source community.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.