2001
DOI: 10.1145/383034.383035
|View full text |Cite
|
Sign up to set email alerts
|

Searching the Web

Abstract: We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and implementation techniques for each of these components are presented. For this presentation we draw from the literature and from our own experimental search engine testbed. Emphasis is on introducing the fundamental c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
210
0
11

Year Published

2003
2003
2013
2013

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 437 publications
(221 citation statements)
references
References 39 publications
0
210
0
11
Order By: Relevance
“…A popular approach for focused resource discovery on the web is the best-first search (BSFS) algorithm where unvisited pages are stored in a priority queue, known as frontier, and they are reordered periodically based on a criterion. So, a typical topic-oriented crawler performs keeps two queues of URLs; one containing the already visited links (from here on AF) and another having the references of the first queue also called crawl frontier (from here on CF) [5]. The challenging task is ordering the links in the CF efficiently.…”
Section: Web Information Retrievalmentioning
confidence: 99%
“…A popular approach for focused resource discovery on the web is the best-first search (BSFS) algorithm where unvisited pages are stored in a priority queue, known as frontier, and they are reordered periodically based on a criterion. So, a typical topic-oriented crawler performs keeps two queues of URLs; one containing the already visited links (from here on AF) and another having the references of the first queue also called crawl frontier (from here on CF) [5]. The challenging task is ordering the links in the CF efficiently.…”
Section: Web Information Retrievalmentioning
confidence: 99%
“…Search engines use many of techniques developed over the last decades for full-text document retrieval, but are also quite different in many aspects [12]. Users interact with these systems in a very different way: queries tend to be much shorter and only the first or second results pages are examined in most cases.…”
Section: Web Search Evaluationmentioning
confidence: 99%
“…Perfectly and adeptly determined near duplicates are relied on different web mining applications, for example, document clustering [3], collaborative filtering [25], detection of replicated web collections [26], discovering large dense graphs [34], detecting plagiarism [31] and community mining in a social network site [32]. The removal of the near duplicate pages [33] helps in reduced storage costs and improved quality of search indexes in addition to considerable bandwidth conservation. Above all, the crawled web pages are preprocessed using document parsing, that eliminates the HTML tags and java scripts present in the web documents, and which is followed by the removal of common words or stop words from the crawled pages.…”
Section: Introductionmentioning
confidence: 99%