In this section, we present an initial approach we took to detect a scraping BaaS targeting hundreds of e-commerce websites. For this purpose, we registered for the BaaS, attempted to scrape websites protected by DataDome, and studied the fingerprints to characterize the BaaS behavior.
We then located other requests with the same or very similar behavior and signature of our own BaaS requests to create detection rules. Compared to the scenario where each attacker builds their own scraper, having several attackers using the same BaaS service implies that the fingerprints (such as timing between requests, the structure of the header values used to represent a language, user agent string structure, JavaScript fingerprints, etc.) are concentrated. This shapes a cluster of fingerprints with similar but abnormal characteristics that is distinguishable from humans.
DataDome’s detection leverages an encrypted cookie that holds information about the user session, as well as multiple JavaScript (JS) signals. All of this enables us to do behavioral detection per session. In the context of the BaaS, JS is executed at the very first request on the homepage of the website, just before scraping.
For the first test, we used the default mode of the BaaS, where the DataDome cookie is handled by the BaaS. After scraping and collecting our own fingerprints, we observed that:
The graph below shows a sample of the requests blocked between March 14 and 19. However, this detection pattern is quite generic and may match more bots than the BaaS we tested.
For our second test, we wanted to detect the standard mode of the BaaS. We provided a DataDome cookie and observed the behavior of the session:
The figure below shows a sample of the requests blocked between March 14 and 19.
We observed that attackers choose standard mode ~10 times more than the default. In standard mode, the most common configuration chosen by the BaaS are the geolocation (the country the bot wants the requests originate from) and the global language (e.g. the attacker indicates the locale fr-fr
and the BaaS translates this to a complete language header like fr-FR,fr;q=0.9
).
However, we observed that the origin of the IP address and the language were inconsistent 95% of the time. This discrepancy likely comes from either a bot configuration mistake or a low-quality proxy provider. Nevertheless, this discrepancy allowed us to reinforce the detection.
In this section, we study the proxy IPs leveraged by the BaaS—the kind of IPs used (data center or residential), the distribution of IP ranges, the not-yet-seen IPs we can still tie to the proxy provider, and the overlap of IPs used by other BaaS.
Firstly, BaaS is not a “one size fits all” solution for all attacks. They enforce the way the URL to scrape will be requested. For instance, some providers will prefer to rotate the IP address as soon as they get challenged by a CAPTCHA, instead of fulfilling the challenge. Even with a delay between two URLs, the attacker has no control on how each will be submitted to the target. This often results in the same URL being requested from bots with the exact same signature but 10+ distinct IP addresses, all in a one-minute time frame—without even loading the CAPTCHA challenge.
We used these strong signals to collect a sample of roughly one million requests identified as a scraping activity from the BaaS provider, and studied the proxy networks behind the IP addresses used to make the requests. The sample contains ~330,000 unique IP addresses, as the BaaS makes an average of three requests per IP before rotating.
For our testing, we analyzed the type and quality of proxies leveraged by the BaaS provider, then divided them into two main categories:
Based on this classification, we found 90% of the IPs in the sample of BaaS traffic came from data center proxies. In other words, when the user isn’t paying extra, the BaaS routes bot requests through low quality proxies. This allows us to reinforce the detection, as humans tend to visit websites from an ISP rather than a server in a data center.
Next, we used the same sample to measure the IP ranges distribution to determine if IPs are concentrated in the same block or dispatched over multiple blocks. In the latter case, we can infer that an entire block belongs to a proxy provider.
When a proxy provider builds its network, it will buy or lease entire blocks of IP addresses. Also, to take care of its IP reputation, it will refresh its network by reselling part of its blocks to buy/lease others, or use only a subset at a time.
From a network point of view, a block of IP addresses consists of contiguous addresses. For example, the block of 256 IP addresses running from 1.2.3.0 to 1.2.3.255 is denoted 1.2.3.0/24 and called a “C-class subnet”. A proxy provider might buy/lease a bigger block of 2,048 IPs like 1.2.3.0 to 1.2.10.255, and split it into eight C-class subnets: 1.2.3.0/24, 1.2.4.0/24, 1.2.5.0/24, etc. This way, the provider can use one subnet, and rotate once the reputation becomes unsatisfactory.
The split into C-class subnet (block of 256 IPs) is special; the provider could split their block further, but this is the atomic size in the Internet world. The Internet is a mesh of networks, each of them owning a set of blocks and communicating with its neighbors. Hence, to get a request traveling from a user (the source) to the website visited (the destination), the Internet networks will forward the request from peer to peer, from the source to the destination. This implies that each peer knows who is next. To do so, they continuously announce which IP blocks they own, and which ones they are able to forward. To keep these announcements efficient, the smallest block size announced is C-class, 256 IPs.
Therefore, we can learn how proxy providers organize their networks by studying the distribution of IPs among C-class subnets. From the sample, we grouped the IP addresses into their C-class subnets: the one million requests wrapped into 16,528 such subnets.
The figure below counts the number of subnets in the sample, depending on how much the subnet is occupied. Approximately 10,000 subnets were observed, with requests from only one or two IPs out of the 256 available. That represents 60% of the subnets, and indicates that proxy providers have large networks and dispatch IPs a lot to protect their IP reputation.
However, for data center IPs (90% of the sample), we learned something valuable: it’s safe to assume the entire C-class subnet belongs to the proxy provider. Hence, we can anticipate and challenge requests that will come from other unseen IPs in the subnet.
DataDome monitors BaaS providers’ activity so we can analyze the overlap of IPs from one proxy provider to another. From the sample of ~330,000 unique IPs seen during four days from a BaaS, we observed that 98% of the IPs are tied only to this proxy provider, with no overlap—making it quite unique among BaaS providers. The BaaS tested is also its own proxy provider: it builds the IP networks itself without relying on other network providers.
While BaaS services ease the IP rotation for bots, making attacks scalable even for untrained attackers, they’re not the best solution to stay undetected. BaaS attacks use behaviors and fingerprints distinguishable from humans. In addition, through these fingerprints, proxy providers disclose their networks. By default, unless users pay more, BaaS providers route requests through low-quality data center proxies—making them even easier to notice and stop.
Want to see how BaaS attacks might be affecting your website, mobile app, or APIs? Try our BotTester tool for a look at simple bots, and try DataDome for free to get a more in-depth look at the automated traffic affecting your business.