The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing

Zeber, David; Bird, Sarah; Oliveira, C.J.S.; Rudametkin, Walter; Segall, Ilana; Wollsén, Fredrik; Lopatka, Martin

doi:10.1145/3366423.3380104

Cited by 23 publications

(11 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our evaluation focuses on the scope of validity of automated crawlers, real user behavior notwithstanding. To this extent, our comparison of baseline variation is similar to Zeber et al's [107]; we also reach the conclusion that automated crawler visiting sites at the same time receive similar content and thus can serve as a basis for comparison of other variables.…”

Section: Measurement Study Ecological Validitysupporting

confidence: 81%

“…Several studies investigate the representativeness of automated crawls in comparison to real user behavior [52,107] and suggest that automated crawlers might over-approximate the amount of third-party trackers a real user would experience. This is a limitation inherent to automated crawlers that may be ameliorated by future work on informing crawler behavior with real user behavior.…”

Section: Measurement Study Ecological Validitymentioning

confidence: 99%

See 1 more Smart Citation

OmniCrawl: Comprehensive Measurement of Web Tracking With Real Desktop and Mobile Browsers

Cassel

Lin

Buraggina

et al. 2021

Proceedings on Privacy Enhancing Technologies

View full text Add to dashboard Cite

Over half of all visits to websites now take place in a mobile browser, yet the majority of web privacy studies take the vantage point of desktop browsers, use emulated mobile browsers, or focus on just a single mobile browser instead. In this paper, we present a comprehensive web-tracking measurement study on mobile browsers and privacy-focused mobile browsers. Our study leverages a new web measurement infrastructure, OmniCrawl, which we develop to drive browsers on desktop computers and smartphones located on two continents. We capture web tracking measurements using 42 different non-emulated browsers simultaneously. We find that the third-party advertising and tracking ecosystem of mobile browsers is more similar to that of desktop browsers than previous findings suggested. We study privacy-focused browsers and find their protections differ significantly and in general are less for lower-ranked sites. Our findings also show that common methodological choices made by web measurement studies, such as the use of emulated mobile browsers and Selenium, can lead to website behavior that deviates from what actual users experience.

show abstract

Section: Measurement Study Ecological Validitysupporting

confidence: 81%

Section: Measurement Study Ecological Validitymentioning

confidence: 99%

OmniCrawl: Comprehensive Measurement of Web Tracking With Real Desktop and Mobile Browsers

Cassel

Lin

Buraggina

et al. 2021

Proceedings on Privacy Enhancing Technologies

View full text Add to dashboard Cite

show abstract

“…For retrieving Web site contents through Web crawls, Ahmad et al developed a framework to compare crawlers based on varying technologies, finding that the choice of crawler may significantly impact measurements [105]. Zeber et al compared crawlers with each other and with human user traffic, and found results to vary over time as well as across platforms [106]. We provide a similar assessment of domain classification services, as they can equally impact the results of research studies.…”

Section: Related Workmentioning

confidence: 96%

Mis-shapes, Mistakes, Misfits

Vallina

Pochat

Feal

et al. 2020

Proceedings of the ACM Internet Measurement Conference

View full text Add to dashboard Cite

Domain classification services have applications in multiple areas, including cybersecurity, content blocking, and targeted advertising. Yet, these services are often a black box in terms of their methodology to classifying domains, which makes it difficult to assess their strengths, aptness for specific applications, and limitations. In this work, we perform a large-scale analysis of 13 popular domain classification services on more than 4.4M hostnames. Our study empirically explores their methodologies, scalability limitations, label constellations, and their suitability to academic research as well as other practical applications such as content filtering. We find that the coverage varies enormously across providers, ranging from over 90% to below 1%. All services deviate from their documented taxonomy, hampering sound usage for research. Further, labels are highly inconsistent across providers, who show little agreement over domains, making it difficult to compare or combine these services. We also show how the dynamics of crowd-sourced efforts may be obstructed by scalability and coverage aspects as well as subjective disagreements among human labelers. Finally, through case studies, we showcase that most services are not fit for detecting specialized content for research or content-blocking purposes. We conclude with actionable recommendations on their usage based on our empirical insights and experience. Particularly, we focus on how users should handle the significant disparities observed across services both in technical solutions and in research. CCS CONCEPTS• Networks → Network measurement; • Information systems → Clustering and classification; Web applications; Web searching and information discovery.ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

show abstract

“…Past studies used automated crawls to observe browser fingerprinting at scale [22,1,8]. However, relying on bot crawls introduces bias in the collected data [14,11,34] as more and more websites use defenses to block bot access [33]. Automating the registration and payment processes is also a challenge because of the high variability that can be found in related forms [13].…”

Section: Web Page Acquisitionmentioning

confidence: 99%

FP-Redemption: Studying Browser Fingerprinting Adoption for the Sake of Web Security

Durey

Laperdrix

Rudametkin

et al. 2021

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Browser fingerprinting has established itself as a stateless technique to identify users on the Web. In particular, it is a highly criticized technique to track users. However, we believe that this identification technique can serve more virtuous purposes, such as bot detection or multi-factor authentication. In this paper, we explore the adoption of browser fingerprinting for security-oriented purposes. More specifically, we study 4 types of web pages that require security mechanisms to process user data: sign-up, sign-in, basket and payment pages. We visited 1, 485 pages on 446 domains and we identified the acquisition of browser fingerprints from 405 pages. By using an existing classification technique, we identified 169 distinct browser fingerprinting scripts included in these pages. By investigating the origins of the browser fingerprinting scripts, we identified 12 security-oriented organizations who collect browser fingerprints on sign-up, sign-in, and payment pages. Finally, we assess the effectiveness of browser fingerprinting against two potential attacks, namely stolen credentials and cookie hijacking. We observe browser fingerprinting being successfully used to enhance web security.

show abstract

The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing

Cited by 23 publications

References 38 publications

OmniCrawl: Comprehensive Measurement of Web Tracking With Real Desktop and Mobile Browsers

OmniCrawl: Comprehensive Measurement of Web Tracking With Real Desktop and Mobile Browsers

Mis-shapes, Mistakes, Misfits

FP-Redemption: Studying Browser Fingerprinting Adoption for the Sake of Web Security

Contact Info

Product

Resources

About