A large-scale study of robots.txt

Sun, Yongchao; Zhuang, Zhihao; Giles, C. Lee

doi:10.1145/1242572.1242726

Cited by 33 publications

(33 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In [4,5], the bias was measured by counting the number of directories disallowed, i.e., the crawler with the highest (lowest) such count was regarded as having the most unfavorable (favorable) bias. The drawback of this approach is that the number of directories disallowed may not correlate well to the amount of content or the number of URLs disallowed.…”

Section: Experiments and Resultsmentioning

confidence: 99%

See 1 more Smart Citation

A larger scale study of robots.txt

Kolay

D'Alberto

Dasdan

et al. 2008

Proceedings of the 17th International Conference on World Wide Web

View full text Add to dashboard Cite

A website can regulate search engine crawler access to its content using the robots exclusion protocol, specified in its robots.txt file. The rules in the protocol enable the site to allow or disallow part or all of its content to certain crawlers, resulting in a favorable or unfavorable bias towards some of them. A 2007 survey on the robots.txt usage of about 7,593 sites found some evidence of such biases, the news of which led to widespread discussions on the web. In this paper, we report on our survey of about 6 million sites. Our survey tries to correct the shortcomings of the previous survey and shows the lack of any significant preferences towards any particular search engine.

show abstract

Section: Experiments and Resultsmentioning

confidence: 99%

“…Despite the importance of this protocol for both content providers and search engines, the first reasonably large scale study for its usage was done only recently in 2007 [4,5]. The study was performed over 2,925 distinct robots.txt files from 7,593 sites.…”

Section: Introductionmentioning

confidence: 99%

A larger scale study of robots.txt

Kolay

D'Alberto

Dasdan

et al. 2008

Proceedings of the 17th International Conference on World Wide Web

View full text Add to dashboard Cite

show abstract

“…But it cannot identify search engine visits because Google Analytics track users with the help of JavaScripts and search engine crawlers do not enable the JavaScripts embedded in web pages when the crawlers visit the web sites. A large scale study of robots.txt is done by Sun et al [3] and the ethics of web crawlers is studied by Giles [1]. Schwenke et al has performed a study on the relationship between JavaScript usage and web site visibility to identify whether JavaScript based hyperlinks attract or repel crawlers resulting in an increase or decrease in web site visibility is done by [6].Another study is performed with commercial search engines to find whether there is a significant difference in their coverage of commercial web sites [7].…”

Section: Related Workmentioning

confidence: 99%

“…Certain crawlers are ethical in their behavior while many are not. Because of the highly automated nature of the robots, rules must be made to regulate such crawling activities to manage the server workload and denying access to confidential or private information [3]. A file called robots.txt is placed at the root of the web site directory which specifies the Robots Exclusion Protocol.…”

Section: Introductionmentioning

confidence: 99%

A Forecasting Model for the Pages Crawled by Search Engine Crawlers at a Web Site

Jose¹,

Lal²

2013

IJCA

View full text Add to dashboard Cite

World Wide Web is exploding in terms of the number of web sites and users. Without search engines the web sites will not be visible to the users. Different search engine crawlers behave in different ways while they access a web site. The number of visits and pages crawled by search engines could be helpful in identifying their behavior and also the server load. A forecasting model in time series has been proposed for predicting the number of pages crawled by search engines. This model was compared with the actual values and it was found feasible. General TermsWeb log mining, Web analytics

show abstract

“…More recently, a study of initially only 8'000 sites [10,11] has been extended in the BotSeer project and now covers 13.2 million sites [12]. Their finding (in the initial 8'000 site study) of a 38.5% adoption rate of robots.txt files is a little bit smaller than our average of 45.1%, which might be explained by the study's date (October 2006), and also by the fact that the study did not start with the most popular domains, which probably have a higher adoption rate.…”

Section: Related Workmentioning

confidence: 99%

Web Site Metadata

Wilde

Roy

2009

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Understanding the availability of site metadata on the Web is a foundation for any system or application that wants to work with the pages published by Web sites, and also wants to understand a Web site's structure. There is little information available about how much information Web sites make available about themselves, and this paper presents data addressing this question. Based on this analysis of available Web site metadata, it is easier for Web-oriented applications to be based on statistical analysis rather than assumptions when relying on Web site metadata. Our study of robots.txt files and sitemaps can be used as a starting point for Web-oriented applications wishing to work with Web site metadata.

show abstract

A large-scale study of robots.txt

Cited by 33 publications

References 1 publication

A larger scale study of robots.txt

A larger scale study of robots.txt

A Forecasting Model for the Pages Crawled by Search Engine Crawlers at a Web Site

Web Site Metadata

Contact Info

Product

Resources

About