Smart Approach to Reduce the Web Crawling Traffic of Existing System using HTML based Update File at Web Server

Mishra, Shekhar; Jain, Anurag; Sachan, Amit

doi:10.5120/1593-2140

Cited by 3 publications

(4 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Third, up to our knowledge, most of the crawling techniques require communication between the running crawlers which increases the crawling processing time and requires high-quality networks (Mukhopadhyay et al, 2006;Wu and Lai, 2010;Kumar and Neelima, 2011;Agarwal et al, 2012;Amolochitis et al, 2013;Uzun et al, 2013). The fourth crawling problem is that the conventional crawlers work is based on the URLs, and download only the pages that are allocated on the web site server, and therefore, they are inefficient when dealing with AJAX pages, as they cannot index the web sites dynamic information (Mishra et al, 2010;Nath and Bal, 2011;Bhushan et al, 2012). In addition to the above static crawling problems, AJAX crawling techniques are still suffering from several challenging problems such as the following: first, identifying web page's statesin some cases, in order to identify the page states the AJAX events need to be triggered, and this may lead to change the content of the corresponding page without changing the page URL, and such page will be recognized as one of the page's states.…”

Section: Crawlers Challenging Problemsmentioning

confidence: 99%

“…In addition, the conventional crawler techniques require downloading all web site pages to find the updated ones, and this will increase the internet traffic and the bandwidth consumption. It has been found that approximately 40 percent of the current internet traffic, bandwidth consumption and web requests are due to search engine crawlers (Mishra et al, 2010;Nath and Bal, 2011). To solve this issue, mobile crawlers (Mishra et al, 2010;Nath and Bal, 2011) and sitemaps-based crawlers (Schonfeld and Shivakumar, 2009;Bhushan et al, 2012;Brawer et al, 2013) were introduced.…”

Section: Related Workmentioning

confidence: 99%

“…It has been found that approximately 40 percent of the current internet traffic, bandwidth consumption and web requests are due to search engine crawlers (Mishra et al, 2010;Nath and Bal, 2011). To solve this issue, mobile crawlers (Mishra et al, 2010;Nath and Bal, 2011) and sitemaps-based crawlers (Schonfeld and Shivakumar, 2009;Bhushan et al, 2012;Brawer et al, 2013) were introduced. The details of these two techniques are explained below:…”

Section: Related Workmentioning

confidence: 99%

“…However, this mechanism suffers from the following problems: first, many web sites do not have sitemap, and therefore, the web crawler has to follow the traditional way of crawling; second, recently, a group of web sites and software have been developed to create the web site's sitemaps automatically. However, when any of the site's pages is updated, the system administrator has to update the page's metadata such as lastmod manually (Mishra et al, 2010;Bhushan et al, 2012). This makes the process of updating the sitemap, especially for larger web sites, difficult and time consuming.…”

Section: Related Workmentioning

confidence: 99%

See 3 more Smart Citations

Efficient watcher based web crawler design

Alqaraleh

Ramadan

Salamah

2015

Aslib Journal of Information Management

View full text Add to dashboard Cite

Purpose – The purpose of this paper is to design a watcher-based crawler (WBC) that has the ability of crawling static and dynamic web sites, and can download only the updated and newly added web pages. Design/methodology/approach – In the proposed WBC crawler, a watcher file, which can be uploaded to the web sites servers, prepares a report that contains the addresses of the updated and the newly added web pages. In addition, the WBC is split into five units, where each unit is responsible for performing a specific crawling process. Findings – Several experiments have been conducted and it has been observed that the proposed WBC increases the number of uniquely visited static and dynamic web sites as compared with the existing crawling techniques. In addition, the proposed watcher file not only allows the crawlers to visit the updated and newly web pages, but also solves the crawlers overlapping and communication problems. Originality/value – The proposed WBC performs all crawling processes in the sense that it detects all updated and newly added pages automatically without any human explicit intervention or downloading the entire web sites.

show abstract