Abstract. The information on the World Wide Web is growing without bound. Users may have very diversified preferences in the pages they target through a search engine. It is therefore a challenging task to adapt a search engine to suit the needs of a particular community of users who share similar interests. In this paper, we propose a new algorithm, Ranking SVM in a Co-training Framework (RSCF). Essentially, the RSCF algorithm takes the clickthrough data containing the items in the search result that have been clicked on by a user as an input, and generates adaptive rankers as an output. By analyzing the clickthrough data, RSCF first categorizes the data as the labelled data set, which contains the items that have been scanned already, and the unlabelled data set, which contains the items that have not yet been scanned. The labelled data is then augmented with unlabelled data to obtain a larger data set for training the rankers. We demonstrate that the RSCF algorithm produces better ranking results than the standard Ranking SVM algorithm. Based on RSCF we develop a metasearch engine that comprises MSNSearch, Wisenut, and Overture, and carry out an online experiment to show that our metasearch engine outperforms Google.
When crawling resources, e.g., number of machines, crawl-time, etc., are limited, a crawler has to decide an optimal order in which to crawl and re-crawl webpages. Ideally, crawlers should request only those webpages that have changed since the last crawl; in practice, a crawler may not know whether a webpage has changed before downloading it. In this paper, we identify features of webpages that are correlated to their change frequency. We design a crawling algorithm that clusters webpages based on features that correlate to their change frequencies obtained by examining past history. The crawler downloads a sample of webpages from each cluster and depending upon whether a significant number of these webpages have changed in the last crawl cycle, it decides to recrawl the entire cluster. To evaluate the performance of our incremental crawler, we develop an evaluation framework that measures which crawling policy results in the best search results for the end-user. We run experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 websites. The results demonstrate that the clustering-based sampling algorithm effectively clusters the pages with similar change patterns, and our clusteringbased crawling algorithm outperforms existing algorithms in that it can improve the quality of the user experience for those who query the search engine.
Abstract-Skyline queries have received a lot of attention from database and information retrieval research communities. A skyline query returns a set of data objects that is not dominated by any other data objects in a given dataset. However, most of existing studies focus on skyline query processing in centralized systems. Only recently, skyline queries are considered in a distributed computing environment. Acknowledging the trend toward peer-to-peer (P2P) systems in distributed computing, we examine the problem of skyline query processing in P2P systems and propose innovative solutions. We exploit the data semantic embedded in semantically structured P2P overlay networks to efficiently prune search space, without compromising the quality of query result. In addition, we propose approximate algorithms to support skyline queries where exact answers are too costly to obtain. These approximate algorithms produce high quality answers using heuristics based on local semantics of peer nodes. Extensive experiments validate that our algorithms provides high efficiency and scalability to skyline query processing in P2P systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.