Proceedings of the 16th International Conference on World Wide Web 2007
DOI: 10.1145/1242572.1242726
|View full text |Cite
|
Sign up to set email alerts
|

A large-scale study of robots.txt

Abstract: Search engines largely rely on Web robots to collect information from the Web. Due to the unregulated open-access nature of the Web, robot activities are extremely diverse. Such crawling activities can be regulated from the server side by deploying the Robots Exclusion Protocol in a file called robots.txt. Although it is not an enforcement standard, ethical robots (and many commercial) will follow the rules specified in robots.txt. With our focused crawler, we investigate 7,593 websites from education, governm… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
31
0
1

Year Published

2008
2008
2018
2018

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 33 publications
(33 citation statements)
references
References 1 publication
1
31
0
1
Order By: Relevance
“…In [4,5], the bias was measured by counting the number of directories disallowed, i.e., the crawler with the highest (lowest) such count was regarded as having the most unfavorable (favorable) bias. The drawback of this approach is that the number of directories disallowed may not correlate well to the amount of content or the number of URLs disallowed.…”
Section: Experiments and Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…In [4,5], the bias was measured by counting the number of directories disallowed, i.e., the crawler with the highest (lowest) such count was regarded as having the most unfavorable (favorable) bias. The drawback of this approach is that the number of directories disallowed may not correlate well to the amount of content or the number of URLs disallowed.…”
Section: Experiments and Resultsmentioning
confidence: 99%
“…Despite the importance of this protocol for both content providers and search engines, the first reasonably large scale study for its usage was done only recently in 2007 [4,5]. The study was performed over 2,925 distinct robots.txt files from 7,593 sites.…”
Section: Introductionmentioning
confidence: 99%
“…But it cannot identify search engine visits because Google Analytics track users with the help of JavaScripts and search engine crawlers do not enable the JavaScripts embedded in web pages when the crawlers visit the web sites. A large scale study of robots.txt is done by Sun et al [3] and the ethics of web crawlers is studied by Giles [1]. Schwenke et al has performed a study on the relationship between JavaScript usage and web site visibility to identify whether JavaScript based hyperlinks attract or repel crawlers resulting in an increase or decrease in web site visibility is done by [6].Another study is performed with commercial search engines to find whether there is a significant difference in their coverage of commercial web sites [7].…”
Section: Related Workmentioning
confidence: 99%
“…Certain crawlers are ethical in their behavior while many are not. Because of the highly automated nature of the robots, rules must be made to regulate such crawling activities to manage the server workload and denying access to confidential or private information [3]. A file called robots.txt is placed at the root of the web site directory which specifies the Robots Exclusion Protocol.…”
Section: Introductionmentioning
confidence: 99%
“…More recently, a study of initially only 8'000 sites [10,11] has been extended in the BotSeer project and now covers 13.2 million sites [12]. Their finding (in the initial 8'000 site study) of a 38.5% adoption rate of robots.txt files is a little bit smaller than our average of 45.1%, which might be explained by the study's date (October 2006), and also by the fact that the study did not start with the most popular domains, which probably have a higher adoption rate.…”
Section: Related Workmentioning
confidence: 99%