Abstract. The pages and hyperlinks of the World-Wide Web may be viewed as nodes and edges in a directed graph. This graph is a fascinating object of study: it has several hundred million nodes today, over a billion links, and appears to grow exponentially with time. There are many reasons -mathematical, sociological, and commercial -for studying the evolution of this graph. In this paper we begin by describing two algorithms that operate on the Web graph, addressing problems from Web search and automatic community discovery. We then report a number of measurements and properties of this graph that manifested themselves as we ran these algorithms on the Web. Finally, we observe that traditional random graph models do not explain these observations, and we propose a new family of random graph models. These models point to a rich new sub-field of the study of random graphs, and raise questions about the analysis of graph algorithms on the Web.
OverviewFew events in the history of computing have wrought as profound an influence on society as the advent and growth of the World-Wide Web. For the first time, millions -soon to be billions -of individuals are creating, annotating and exploiting hyperlinked content in a distributed fashion. A particular Web page might be authored in any language, dialect, or style by an individual with any background, culture, motivation, interest, and education; might range from a few characters to a few hundred thousand; might contain truth, falsehood, lies, propaganda, wisdom, or sheer nonsense; and might point to none, few, or several other Web pages. The hyperlinks of the Web endow it with additional structure; and the network of these links is rich in latent information content. Our focus in this paper is on the directed graph induced by the hyperlinks between Web pages; we refer to this as the Web graph.For our purposes, nodes represent static html pages and hyperlinks represent directed edges. Recent estimates [4] suggest that there are several hundred million nodes in the Web graph; this quantity is growing by a few percent a month. The average node has roughly seven hyperlinks (directed edges) to other pages, making for a total of several billion hyperlinks in all.
In this paper we study the space requirement of algorithms that make only one (or a small number of) pass(es) over the input data. We study such algorithms under a model of data streams that we introduce here. We give a number of upper and lower bounds for problems stemming from queryprocessing, invoking in the process tools from the area of communication complexity.
Search obstacles As we consider the types of pages we hope to discover, and to do so automatically, we quickly confront some difficult problems. First, it is insufficient to apply purely text-based methods to collect many potentially Sifting through the growing mountain of Web data demands an increasingly discerning search engine, one that can reliably assess the quality of sites, not just their relevance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.