Proxies as a Service: How to Identify Proxy Providers via Bots as a Service

Proxies as a Service: How to Identify Proxy Providers via Bots as a Service
2024-6-26 02:8:38 Author: securityboulevard.com(查看原文) 阅读量:1 收藏

Bots as a Service Detection

In this section, we present an initial approach we took to detect a scraping BaaS targeting hundreds of e-commerce websites. For this purpose, we registered for the BaaS, attempted to scrape websites protected by DataDome, and studied the fingerprints to characterize the BaaS behavior.

We then located other requests with the same or very similar behavior and signature of our own BaaS requests to create detection rules. Compared to the scenario where each attacker builds their own scraper, having several attackers using the same BaaS service implies that the fingerprints (such as timing between requests, the structure of the header values used to represent a language, user agent string structure, JavaScript fingerprints, etc.) are concentrated. This shapes a cluster of fingerprints with similar but abnormal characteristics that is distinguishable from humans.

DataDome’s detection leverages an encrypted cookie that holds information about the user session, as well as multiple JavaScript (JS) signals. All of this enables us to do behavioral detection per session. In the context of the BaaS, JS is executed at the very first request on the homepage of the website, just before scraping.

Test 1: Default Mode

For the first test, we used the default mode of the BaaS, where the DataDome cookie is handled by the BaaS. After scraping and collecting our own fingerprints, we observed that:

The BaaS sends only the DataDome cookie and no other cookies related to the targeted website.
The URL is scraped at the very first request. Contrary to certain BaaS that make several requests before making a request to the targeted URL (to appear more human), this BaaS directly targets the requested URL.
The BaaS initiates a fresh DataDome session for each URL.

The graph below shows a sample of the requests blocked between March 14 and 19. However, this detection pattern is quite generic and may match more bots than the BaaS we tested.

Default Mode_BaaS

Test 2: Standard Mode

For our second test, we wanted to detect the standard mode of the BaaS. We provided a DataDome cookie and observed the behavior of the session:

The BaaS only forwards the DataDome cookie provided by the bot.
The DataDome cookie travels among several IPs.
The DataDome cookie indicates that the session already made requests that were blocked.

The figure below shows a sample of the requests blocked between March 14 and 19.

Standard Mode_BaaS

We observed that attackers choose standard mode ~10 times more than the default. In standard mode, the most common configuration chosen by the BaaS are the geolocation (the country the bot wants the requests originate from) and the global language (e.g. the attacker indicates the locale fr-fr and the BaaS translates this to a complete language header like fr-FR,fr;q=0.9).

However, we observed that the origin of the IP address and the language were inconsistent 95% of the time. This discrepancy likely comes from either a bot configuration mistake or a low-quality proxy provider. Nevertheless, this discrepancy allowed us to reinforce the detection.

Learning About Proxy Networks

In this section, we study the proxy IPs leveraged by the BaaS—the kind of IPs used (data center or residential), the distribution of IP ranges, the not-yet-seen IPs we can still tie to the proxy provider, and the overlap of IPs used by other BaaS.

Firstly, BaaS is not a “one size fits all” solution for all attacks. They enforce the way the URL to scrape will be requested. For instance, some providers will prefer to rotate the IP address as soon as they get challenged by a CAPTCHA, instead of fulfilling the challenge. Even with a delay between two URLs, the attacker has no control on how each will be submitted to the target. This often results in the same URL being requested from bots with the exact same signature but 10+ distinct IP addresses, all in a one-minute time frame—without even loading the CAPTCHA challenge.

We used these strong signals to collect a sample of roughly one million requests identified as a scraping activity from the BaaS provider, and studied the proxy networks behind the IP addresses used to make the requests. The sample contains ~330,000 unique IP addresses, as the BaaS makes an average of three requests per IP before rotating.

Proxy Type

For our testing, we analyzed the type and quality of proxies leveraged by the BaaS provider, then divided them into two main categories:

Data Center Proxy: IP address originates from hosting services like OVH, GCP, AWS, etc. This type of proxy usually offers a large pool of cheaper IP addresses.
Residential Proxy: IP address originates from an internet service provider (ISP) like AT&T, Orange, Deutsche Telekom, etc. This type of proxy allows bots to be better disguised as humans, but they are more expensive and the available pool is smaller.

Based on this classification, we found 90% of the IPs in the sample of BaaS traffic came from data center proxies. In other words, when the user isn’t paying extra, the BaaS routes bot requests through low quality proxies. This allows us to reinforce the detection, as humans tend to visit websites from an ISP rather than a server in a data center.

Subnet Reuse

Next, we used the same sample to measure the IP ranges distribution to determine if IPs are concentrated in the same block or dispatched over multiple blocks. In the latter case, we can infer that an entire block belongs to a proxy provider.

When a proxy provider builds its network, it will buy or lease entire blocks of IP addresses. Also, to take care of its IP reputation, it will refresh its network by reselling part of its blocks to buy/lease others, or use only a subset at a time.

From a network point of view, a block of IP addresses consists of contiguous addresses. For example, the block of 256 IP addresses running from 1.2.3.0 to 1.2.3.255 is denoted 1.2.3.0/24 and called a “C-class subnet”. A proxy provider might buy/lease a bigger block of 2,048 IPs like 1.2.3.0 to 1.2.10.255, and split it into eight C-class subnets: 1.2.3.0/24, 1.2.4.0/24, 1.2.5.0/24, etc. This way, the provider can use one subnet, and rotate once the reputation becomes unsatisfactory.

The split into C-class subnet (block of 256 IPs) is special; the provider could split their block further, but this is the atomic size in the Internet world. The Internet is a mesh of networks, each of them owning a set of blocks and communicating with its neighbors. Hence, to get a request traveling from a user (the source) to the website visited (the destination), the Internet networks will forward the request from peer to peer, from the source to the destination. This implies that each peer knows who is next. To do so, they continuously announce which IP blocks they own, and which ones they are able to forward. To keep these announcements efficient, the smallest block size announced is C-class, 256 IPs.

Therefore, we can learn how proxy providers organize their networks by studying the distribution of IPs among C-class subnets. From the sample, we grouped the IP addresses into their C-class subnets: the one million requests wrapped into 16,528 such subnets.

The figure below counts the number of subnets in the sample, depending on how much the subnet is occupied. Approximately 10,000 subnets were observed, with requests from only one or two IPs out of the 256 available. That represents 60% of the subnets, and indicates that proxy providers have large networks and dispatch IPs a lot to protect their IP reputation.

Distinct IP Subnets_BaaS

However, for data center IPs (90% of the sample), we learned something valuable: it’s safe to assume the entire C-class subnet belongs to the proxy provider. Hence, we can anticipate and challenge requests that will come from other unseen IPs in the subnet.

Overlap Between BaaS

DataDome monitors BaaS providers’ activity so we can analyze the overlap of IPs from one proxy provider to another. From the sample of ~330,000 unique IPs seen during four days from a BaaS, we observed that 98% of the IPs are tied only to this proxy provider, with no overlap—making it quite unique among BaaS providers. The BaaS tested is also its own proxy provider: it builds the IP networks itself without relying on other network providers.

IPs Associated with Proxy Providers

Conclusion

While BaaS services ease the IP rotation for bots, making attacks scalable even for untrained attackers, they’re not the best solution to stay undetected. BaaS attacks use behaviors and fingerprints distinguishable from humans. In addition, through these fingerprints, proxy providers disclose their networks. By default, unless users pay more, BaaS providers route requests through low-quality data center proxies—making them even easier to notice and stop.

Want to see how BaaS attacks might be affecting your website, mobile app, or APIs? Try our BotTester tool for a look at simple bots, and try DataDome for free to get a more in-depth look at the automated traffic affecting your business.

文章来源: https://securityboulevard.com/2024/06/proxies-as-a-service-how-to-identify-proxy-providers-via-bots-as-a-service/
如有侵权请联系:admin#unsafe.sh