An enhanced web robot for the CINDI system

Chen, Rui; Desai, Bipin C.

doi:10.1145/1370256.1370278

Cited by 5 publications

(4 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Concordia Indexing and DIscovering system (CINDI) uses revised context graph and multilevel inspection scheme to discover relevant webpages. It explores relevant resources that are many links away from seed URLs.…”

Section: Current Status Of Web Crawlermentioning

confidence: 99%

A survey of Web crawlers for information retrieval

Kumar

Bhatia

Rattan

2017

WIREs Data Min & Knowl

View full text Add to dashboard Cite

Performance of any search engine relies heavily on its Web crawler. Web crawlers are the programs that get webpages from the Web by following hyperlinks. These webpages are indexed by a search engine and can be retrieved by a user query. In the area of Web crawling, we still lack an exhaustive study that covers all crawling techniques. This study follows the guidelines of systematic literature review and applies it to the field of Web crawling. We used the standard procedure of carrying out a systematic literature review on 248 studies from a total of 1488 articles published in 12 leading journals and other premier conferences and workshops. Existing literature about the Web crawler is classified into different key subareas. Each subarea is further divided according to the techniques being used. We analyzed the distribution of various articles using multiple criteria and depicted conclusions. Various studies that use open source Web crawlers are also reported. We have highlighted future areas of research. We call for an increased awareness in various fields of the Web crawler and identify how techniques from other domains can be used for crawling the Web. Limitations and recommendations for future are also discussed. WIREs Data Mining Knowl Discov 2017, 7:e1218. doi: 10.1002/widm.1218 This article is categorized under: Algorithmic Development > Web Mining Fundamental Concepts of Data and Knowledge > Information Repositories Fundamental Concepts of Data and Knowledge > Motivation and Emergence of Data Mining

show abstract

Section: Current Status Of Web Crawlermentioning

confidence: 99%

A survey of Web crawlers for information retrieval

Kumar

Bhatia

Rattan

2017

WIREs Data Min & Knowl

View full text Add to dashboard Cite

show abstract

“…Although the page content, hierarchy patterns and anchor texts are satisfactory leads, a focused crawler inevitably needs a multi-level inspection infrastructure to compensate their drawbacks. Unfortunately the current papers overlook the power of such comprehensiveness [23]. Considering these shortcomings, our proposed Treasure-Crawler utilized a significant approach in crawling and indexing Web pages that complied with its predefined topic of interest.…”

Section: Discussionmentioning

confidence: 99%

Empirical evaluation of the link and content-based focused Treasure-Crawler

Seyfi

Patel

Júnior

2016

Computer Standards & Interfaces

View full text Add to dashboard Cite

Indexing the Web is becoming a laborious task for search engines as the Web exponentially grows in size and distribution. Presently, the most effective known approach to overcome this problem is the use of focused crawlers. A focused crawler applies a proper algorithm in order to detect the pages on the Web that relate to its topic of interest. For this purpose we proposed a custom method that uses specific HTML elements of a page to predict the topical focus of all the pages that have an unvisited link within the current page. These recognized on-topic pages have to be sorted later based on their relevance to the main topic of the crawler for further actual downloads. In the Treasure-Crawler, we use a hierarchical structure called the T-Graph which is an exemplary guide to assign appropriate priority score to each unvisited link. These URLs will later be downloaded based on this priority. This paper outlines the architectural design and embodies the implementation, test results and performance evaluation of the Treasure-Crawler system. The Treasure-Crawler is evaluated in terms of information retrieval criteria such as recall and precision, both with values close to 0.5. Gaining such outcome asserts the significance of the proposed approach.

show abstract

“…The idea of using the context of a given topic to guide the crawling process could significantly increase both precision and recall. [10,11]. The major task of focused Web crawlers is to unveil as many bridges among relevant regions as possible.…”

Section: Introductionmentioning

confidence: 99%

“…The idea of using the context of a given topic to guide the crawling process could significantly increase both precision and recall. Tunneling (2001) is the phenomenon where a crawler reaches some relevant regions (or pages) while traversing a path which does not solely consist of relevant pages [10,11]. The major task of focused Web crawlers is to unveil as many bridges among relevant regions as possible.…”

Section: Introductionmentioning

confidence: 99%

A focused crawler combinatory link and content model based on T-Graph principles

Seyfi

Patel

2016

Computer Standards & Interfaces

View full text Add to dashboard Cite

Abstract-The two significant tasks of a focused Web crawler are finding relevant topic-specific documents on the Web and analytically prioritizing them for later effective and reliable download. For the first task, we propose a sophisticated custom algorithm to fetch and analyze the most effective HTML structural elements of the page as well as the topical boundary and anchor text of each unvisited link, based on which the topical focus of an unvisited page can be predicted and elicited with a high accuracy. Thus, our novel method uniquely combines both link-based and content-based approaches. For the second task, we propose a scoring function of the relevant URLs through the use of T-Graph (Treasure Graph) to assist in prioritizing the unvisited links that will later be put into the fetching queue. Our Web search system is called the Treasure-Crawler. This research paper embodies the architectural design of the Treasure-Crawler system which satisfies the principle requirements of a focused Web crawler, and asserts the correctness of the system structure including all its modules through illustrations and by the test results.

show abstract

An enhanced web robot for the CINDI system

Cited by 5 publications

References 4 publications

A survey of Web crawlers for information retrieval

A survey of Web crawlers for information retrieval

Empirical evaluation of the link and content-based focused Treasure-Crawler

A focused crawler combinatory link and content model based on T-Graph principles

Contact Info

Product

Resources

About