An ongoing battle has been running for more than a decade between e-commerce websites owners and web scrapers. Whenever one party finds a new technique to prevail, the other one comes up with a solution to defeat it. Based on our industrial experience, we know this problem is far from being solved. New solutions are needed to address automated threats. In this work, we will describe the actors taking part in the battle, the weapons at their disposal, and their allies on either side. We will present a real-world setup to explain how e-commerce websites operators try to defend themselves and the open problems they seek solutions for.
Airline websites are the victims of unauthorised online travel agencies and aggregators that use armies of bots to scrape prices and flight information. These so-called Advanced Persistent Bots (APBs) are highly sophisticated. On top of the valuable information taken away, these huge quantities of requests consume a very substantial amount of resources on the airlines' websites. In this work, we propose a deceptive approach to counter scraping bots. We present a platform capable of mimicking airlines' sites changing prices at will. We provide results on the case studies we performed with it. We have lured bots for almost 2 months, fed them with indistinguishable inaccurate information. Studying the collected requests, we have found behavioural patterns that could be used as complementary bot detection. Moreover, based on the gathered empirical pieces of evidence, we propose a method to investigate the claim commonly made that proxy services used by web scraping bots have millions of residential IPs at their disposal. Our mathematical models indicate that the amount of IPs is likely 2 to 3 orders of magnitude smaller than the one claimed. This finding suggests that an IP reputation-based blocking strategy could be effective, contrary to what operators of these websites think today.
Web scraping bots are now using so-called Residential ip Proxy (resip) services to defeat state-of-the-art commercial bot countermeasures. resip providers promise their customers to give them access to tens of millions of residential ip addresses, which belong to legitimate users. They dramatically complicate the task of the existing anti-bot solutions and give the upper hand to the malicious actors. New specific detection methods are needed to identify and stop scrapers from taking advantage of these parties. This work, thanks to a 4 months-long experiment, validates the feasibility, soundness, and practicality of a detection method based on network measurements. This technique enables contacted servers to identify whether an incoming request comes directly from a client device or if it has been proxied through another device.
Identifying an attacker is a key factor to mitigate ongoing attacks. To evade localization, a single compromised machine can hide for months behind millions of available residential IP proxies. Without knowing the IP address of the machine, registration-based geolocation methods cannot be applied.Measurement-based methods have been proposed to estimate the location of a target without using its IP address. These methods use Round Trip Time (RTT) values and network speed modeling. They estimate a distance between the target and other observation points with known locations, called landmarks. However, most of these methods require additional information, whether it is on the topology of the network or the characteristics of the landmarks.In this paper, we present ImMuNE, a measurement-based technique which can estimate a location with only a few Round Trip Time measurements between a target and landmarks, even when some of these measures are inflated by temporary network congestion.Leveraging a previously made measurement campaign, we present promising results based on 11 millions TCP connections collected over a period of 4 months.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.