Focused Crawls, Tunneling, and Digital Libraries

Bergmark, Donna; Lagoze, Carl; Sbityakov, Alex

doi:10.1007/3-540-45747-x_7

Cited by 69 publications

(49 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Web surfing is feasible due to the fact that most pages link to similar pages. Some recent work by Menczer [3,4] provides interesting insights into the relationship between content similarities and relatedness among Web pages. He finds that both content and links provide a weak yet significant signal about the (semantic) relatedness of Web pages [18][19].…”

Section: Literature Surveymentioning

confidence: 99%

See 1 more Smart Citation

A Frame Work for Topical Collections Make with Focused and Accelerated Focused Crawlers

Saturi¹,

Raju²,

Kumar³

et al. 2015

IJCA

View full text Add to dashboard Cite

The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In the personalized search domain, an alternative to general purpose crawler called focused crawlers are receiving increasing attention. The goal of these crawlers is to selectively seek out pages that are relevant to a pre-defined set of topics or theme. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, these crawlers analyzes their crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. This paper presents and compares two focused crawlers called traditional focused crawler and accelerated focused crawler. Accelerated focused crawler takes offline lessons from traditional focused crawler. It emulates human surfer by trying to predict the relevance of a "HREF" target page based on words around the link on the source page. The topics are specified using exemplary documents in these experiments. Naive Bayesian classifier is used to guide the crawlers. The crawlers were evaluated for different number of pages crawled, for different number of features gathered from different distances from the link and with different feature selection methods.

show abstract

Section: Literature Surveymentioning

confidence: 99%

“…On other occasions we find very valuable and accurate information. Hence, the large size, dynamism, and uncontrolled nature of the Web offer new challenges for information handling, retrieval, and usage [4][5][6].…”

Section: Introductionmentioning

confidence: 99%

A Frame Work for Topical Collections Make with Focused and Accelerated Focused Crawlers

Saturi¹,

Raju²,

Kumar³

et al. 2015

IJCA

View full text Add to dashboard Cite

show abstract

“…Quadrant II (connected indirectly and in a forward direction search) contains relevant documents, which have indirectly connected characteristic, i.e. connected through one or several irrelevant documents [9], [10], [11]. Relevant documents in quadrant III connected directly through in-links of downloaded documents.…”

Section: Figure 4 Four Www Characteristics Quadrantsmentioning

confidence: 99%

CT-FC: more Comprehensive Traversal Focused Crawler

et al. 2012

View full text Add to dashboard Cite

show abstract

“…In this algorithm, the heuristics (based on previous search results) are employed in the search ranking and queue order. Non-promising Universal Resource Locators (URLs) are placed in the back of the queue, where they rarely get a chance to be visited (Bergmark, 2002;Bergmark et al, 2002;Chakrabarti et al, 2007). Obviously, this type of search algorithm is more common than the breadth-first search algorithm since it examines the relevant page locations and avoids retrieving non-related pages.…”

Section: Best-first Searchmentioning

confidence: 99%

Enhanced Search Scheme Precision and Performance using a GA Approach with Application to Arabic Content

Ghwanmeh¹

2012

JAC

View full text Add to dashboard Cite

Literature examination shows that information search engines in Arabic are few compared to those available in English and other languages. Additionally, search engines face many problems when programmed in the Arabic language, including difficulty and uncertainty. Employing Genetic Algorithm within the search scheme to improve performance and exactness and tackle issues with non-accurateness of search systems in which Arabic content is used can be considered an advancement. An enhanced search scheme that provides exactness, precision, and performance by applying the Genetic Algorithm Technique to Arabic content is presented in this paper. Based on the user starting page selection, the system employs its dynamic characteristics to search related pages on the Web. A series of experiments has been conducted to test the quality and effectiveness of the proposed system by means of well-known test-base collections -namely, CISI, CACM, and NPL -and 242 Arabic-content sites. General results revealed that the proposed system retrieved the largest number of appropriate documents and minimal non-related documents with respect to user requests in high-performance information retrieval systems that use the Genetic Algorithm.

show abstract

Focused Crawls, Tunneling, and Digital Libraries

Cited by 69 publications

References 22 publications

A Frame Work for Topical Collections Make with Focused and Accelerated Focused Crawlers

A Frame Work for Topical Collections Make with Focused and Accelerated Focused Crawlers

CT-FC: more Comprehensive Traversal Focused Crawler

Enhanced Search Scheme Precision and Performance using a GA Approach with Application to Arabic Content

Contact Info

Product

Resources

About