Proceedings of the Web Conference 2020 2020
DOI: 10.1145/3366423.3380104
|View full text |Cite
|
Sign up to set email alerts
|

The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing

Abstract: Large-scale Web crawls have emerged as the state of the art for studying characteristics of the Web. In particular, they are a core tool for online tracking research. Web crawling is an attractive approach to data collection, as crawls can be run at relatively low infrastructure cost and don't require handling sensitive user data such as browsing histories. However, the biases introduced by using crawls as a proxy for human browsing data have not been well studied. Crawls may fail to capture the diversity of u… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
9
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 23 publications
(11 citation statements)
references
References 38 publications
1
9
0
Order By: Relevance
“…Our evaluation focuses on the scope of validity of automated crawlers, real user behavior notwithstanding. To this extent, our comparison of baseline variation is similar to Zeber et al's [107]; we also reach the conclusion that automated crawler visiting sites at the same time receive similar content and thus can serve as a basis for comparison of other variables.…”
Section: Measurement Study Ecological Validitysupporting
confidence: 81%
See 1 more Smart Citation
“…Our evaluation focuses on the scope of validity of automated crawlers, real user behavior notwithstanding. To this extent, our comparison of baseline variation is similar to Zeber et al's [107]; we also reach the conclusion that automated crawler visiting sites at the same time receive similar content and thus can serve as a basis for comparison of other variables.…”
Section: Measurement Study Ecological Validitysupporting
confidence: 81%
“…Several studies investigate the representativeness of automated crawls in comparison to real user behavior [52,107] and suggest that automated crawlers might over-approximate the amount of third-party trackers a real user would experience. This is a limitation inherent to automated crawlers that may be ameliorated by future work on informing crawler behavior with real user behavior.…”
Section: Measurement Study Ecological Validitymentioning
confidence: 99%
“…For retrieving Web site contents through Web crawls, Ahmad et al developed a framework to compare crawlers based on varying technologies, finding that the choice of crawler may significantly impact measurements [105]. Zeber et al compared crawlers with each other and with human user traffic, and found results to vary over time as well as across platforms [106]. We provide a similar assessment of domain classification services, as they can equally impact the results of research studies.…”
Section: Related Workmentioning
confidence: 96%
“…Past studies used automated crawls to observe browser fingerprinting at scale [22,1,8]. However, relying on bot crawls introduces bias in the collected data [14,11,34] as more and more websites use defenses to block bot access [33]. Automating the registration and payment processes is also a challenge because of the high variability that can be found in related forms [13].…”
Section: Web Page Acquisitionmentioning
confidence: 99%