2002
DOI: 10.1007/3-540-45747-x_7
|View full text |Cite
|
Sign up to set email alerts
|

Focused Crawls, Tunneling, and Digital Libraries

Abstract: Crawling the Web to build collections of documents related to pre-specified topics became an active area of research during the late 1990's, crawler technology having been developed for use by search engines. Now, Web crawling is being seriously considered as an important strategy for building large scale digital libraries. This paper covers some of the crawl technologies that might be exploited for collection building. For example, to make such collection-building crawls more effective, focused crawling was d… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
49
0

Year Published

2004
2004
2015
2015

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 69 publications
(49 citation statements)
references
References 22 publications
0
49
0
Order By: Relevance
“…Web surfing is feasible due to the fact that most pages link to similar pages. Some recent work by Menczer [3,4] provides interesting insights into the relationship between content similarities and relatedness among Web pages. He finds that both content and links provide a weak yet significant signal about the (semantic) relatedness of Web pages [18][19].…”
Section: Literature Surveymentioning
confidence: 99%
See 1 more Smart Citation
“…Web surfing is feasible due to the fact that most pages link to similar pages. Some recent work by Menczer [3,4] provides interesting insights into the relationship between content similarities and relatedness among Web pages. He finds that both content and links provide a weak yet significant signal about the (semantic) relatedness of Web pages [18][19].…”
Section: Literature Surveymentioning
confidence: 99%
“…On other occasions we find very valuable and accurate information. Hence, the large size, dynamism, and uncontrolled nature of the Web offer new challenges for information handling, retrieval, and usage [4][5][6].…”
Section: Introductionmentioning
confidence: 99%
“…Quadrant II (connected indirectly and in a forward direction search) contains relevant documents, which have indirectly connected characteristic, i.e. connected through one or several irrelevant documents [9], [10], [11]. Relevant documents in quadrant III connected directly through in-links of downloaded documents.…”
Section: Figure 4 Four Www Characteristics Quadrantsmentioning
confidence: 99%
“…In this algorithm, the heuristics (based on previous search results) are employed in the search ranking and queue order. Non-promising Universal Resource Locators (URLs) are placed in the back of the queue, where they rarely get a chance to be visited (Bergmark, 2002;Bergmark et al, 2002;Chakrabarti et al, 2007). Obviously, this type of search algorithm is more common than the breadth-first search algorithm since it examines the relevant page locations and avoids retrieving non-related pages.…”
Section: Best-first Searchmentioning
confidence: 99%