Disclaimer: This research was conducted strictly independent of my employer (excluded from scope). All opinions and views in this article are my own. When citing, please call me an Independent Security Researcher.
Modern technologies like the cloud have made it easier than ever to rapidly develop scalable software. What took thousands of dollars in investment is now accessible through a free trial. We've optimized infrastructure-as-a-service (IaaS) providers to reduce friction to entry, but what happens when security clashes with productivity? What risks have we introduced for the sake of convenience?
For the last three years, I've investigated how the insecure defaults built into cloud services has led to widespread & systemic weaknesses in tens of thousands of organizations, including some of the world's largest like Samsung, CrowdStrike, NVIDIA, HP, Google, Amazon, the NY Times, and more! I focused on two categories: dangling DNS records and hardcoded secrets. The former is well-traversed and will serve as a good introduction to finding bugs using unconventional data sources. Work towards the latter, however, has often been limited in scope & diversity.
Unfortunately, both vulnerability classes run rampant in production cloud environments. Dangling DNS records occur when a website has a DNS record that points at a cloud host that is no longer in its control. This project applied a variant of historic approaches to discover 66,000+ unique top level domains (TLDs) that still host dangling records. Leveraging a similar "big data" approach for hardcoded secrets revealed 15,000+ unique, verified secrets, for various API services.
While we will review findings later, the key idea is simple: cloud providers are not doing enough to protect customers against misconfigurations they incentivize. These vulnerabilities are created by the customer, but how platforms are designed directly controls whether such issues can exist at all.
Instead of taking accountability and enforcing secure defaults, most providers expect that a few documentation warnings a majority will never read is sufficient to mitigate their liability. This research demonstrates how this is far from enough and the compounding risk of abuse with hardcoded secrets.
Cloud computing lets you create infrastructure on demand, including servers, websites, storage, and so on. Under the hood, they're just an abstracted version of what many internet-facing businesses had to do a few decades ago. What makes it work are high margins and more importantly, the fact that everything is shared.
A cloud resource is dangling if it is deallocated from your environment while still referenced by a DNS record. For example, let's say I create an AWS EC2 instance and an A record, project1.example.com
, pointing at its public IP. I use the subdomain for some project and a few months later, deallocate the EC2 instance while cleaning up old servers I'm not using.
Remember, you paid to borrow someone else's infrastructure. Once you're done, the IP address assigned to your EC2 instance simply goes back into the shared pool for use by any other customer. Since DNS records are not bound to their targets by default, unless you remember that project1.example.com
needs to be deleted, the moment the public IP is released is the moment it is dangling. These DNS records are problematic because if an attacker can capture the IP you released, they can now host anything at project1.example.com
.
A/AAAA records are not the only type at risk. A* records are relevant when your cloud resource is assigned a dedicated IP address. Not all managed cloud services involve a dedicated IP, however. For example, when using a managed storage service like AWS S3 or Google Cloud (GCP) Storage, you're assigned a dedicated hostname like example.s3.amazonaws.com
. Under the hood, these hostnames point at IP addresses shared across many customers.
DNS record types like CNAME, which accept hostnames, can similarly become dangling if the hostname is released (e.g., you delete a bucket, allowing an attacker to recreate with same name). Unlike dedicated endpoints, it is easier to guard shared endpoints against attacks because the provider can prohibit the registration of a deallocated identifier. Reserving IP addresses, however, is far less feasible.
Why should you care about dangling DNS records? Unfortunately, if an attacker can control a trusted subdomain, there is a substantial risk of abuse:
example.com
does not restrict access to session cookies from subdomains, an attacker may be able to execute malicious JavaScript to impersonate a logged in user.According to RIPE's article, Dangling Resource Abuse on Cloud Platforms:
The main abuse (75%) of hijacked, dangling resources is to generate traffic to adversarial services. The attackers target domains with established reputation and exploit that reputation to increase the ranking of their malicious content by search engines and as a result to generate page impressions to the content they control. The content is mostly gambling and other adult content.
...
The other categories of abuse included malware distribution, cookie theft and fraudulent certificates. Overall, we find that the hacking groups successfully attacked domains in 31% of the Fortune 500 companies and 25.4% of the Global 500 companies, some over long periods of time.
Dangling DNS records are most commonly exploited en masse, but targeted attacks still exist. Fortunately, to achieve a high impact beyond trivial search engine optimization, an attacker would need to investigate your organization's relationship with the domain they've compromised. Unfortunately, while trivial abuse like search engine optimization matters less in isolated incidents, it becomes a major problem when scaled.
Cloud environments, including any API service, are managed over the Internet. Even if you use dedicated resources where possible, you're still forced to manage them through a shared gateway. How do we secure this access?
Since the inception of cloud services, one of the most common methods of authentication is using a 16 to 256 character secret key. For example, until late 2022, the recommended authentication scheme for AWS programmatic access was an access and secret key pair.
[example]
aws_access_key_id = AKIAIBJGN829BTALSORQ
aws_secret_access_key = dGhlcmUgYXJlIGltcG9zdGVycyBhbW9uZyB1cw
While AWS today warns against long-lived secret keys, you still need them to issue short-lived session tokens when running outside of an AWS instance. Also, it's still the path of least resistance.
An alternative example is Google Cloud (GCP).
While GCP also has API keys, these are only supported for benign services like Google Maps because they do not "identify a principal" (i.e., a user). The closest they have to a secret you can hardcode, like AWS access/secret keys, are service account JSON files.
{
"type": "service_account",
"project_id": "project-id-REDACTED",
"private_key_id": "0dfREDACTED",
"private_key": "-----BEGIN PRIVATE KEY-----\nMIIEvwREDACTED\n-----END PRIVATE KEY-----\n",
"client_email": "[email protected]",
"client_id": "106REDACTED...",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/REDACTED%40project-id-REDACTED.iam.gserviceaccount.com",
"universe_domain": "googleapis.com"
}
Unlike a short secret, Google embeds a full size RSA 2048 private key in PEM format. While you could still hardcode this, there is an important distinction in how GCP and AWS approached short-lived secrets. Google incorporated strong defaults in their design, not just in their documentation.
We'll dissect these differences later, but the takeaway here is not that AWS is fundamentally less secure than GCP. AWS' path of least resistance being insecure only matters at scale. If you follow AWS recommendations, your security posture is likely close to that of a GCP service account. At scale, however, it is inevitable that some customers will follow where the incentives lead them. In AWS' case, that's hardcoding long-lived access and secret keys in production code. GCP's layered approach reduces this risk of human error by making it annoying to be insecure.
Unfortunately, hardcoded secrets can lead to far worse impact than dangling DNS records because, by default, they have little restriction on who can use them.
By their very nature, secrets are "secret" for a reason. They're designed to grant privileged access to cloud services like production servers, databases, and storage buckets. Cloud provider keys could let an attacker access and modify your infrastructure, potentially exposing sensitive user data. A leaked Slack token could let an attacker read all internal communication in your organization. While impact still varies on context, secrets are much easier to abuse and often lead to an immediate security impact. We'll further demonstrate what this looks like when we review findings.
This blog is not intended to serve as a thorough literary analysis of all past work regarding dangling domains and hardcoded secrets, but let's review a few key highlights.
Dangling DNS records are a common problem discussed for at least a decade.
One of the first works pivotal to our understanding of the bug class was published in 2015 by Matt Bryant, Fishing the AWS IP Pool for Dangling Domains. This project explored dangling DNS records that point at a dedicated endpoint, like an AWS EC2 public IP (vs a hostname for a shared endpoint, like *.s3.amazonaws.com
).
Bryant found that he could continuously allocate and release AWS elastic IPs to enumerate the shared customer pool. Why is this important? To exploit a dangling DNS record, an attacker needs to somehow control its target, e.g., the EC2 IP that was previously allocated/released by the victim. While enumerating these IPs, Bryant searched Bing using the ip:
quantifier to see if any cached domains pointed at it. This effectively allowed Bryant to look for dangling DNS records without a specific target.
A few years later, AWS implemented a mitigation that restricts accounts to a small pool of IPs, instead of the entire shared address space. For example, if you allocate, free, and reallocate an elastic IP today, you'll notice that you'll keep getting the same IPs over and over again. This prevents enumeration of the AWS IP pool using standard elastic IP allocation. Google Cloud has a very similar mitigation to deter dangling record abuse, but they extend the "small account pool" to apply to any component that allocates an IP.
Besides dangling records that point at a generic cloud IP, more recent work into "hosting-based" records (e.g., CNAME to a hostname, aka "shared" endpoint) includes, DareShark: Detecting and Measuring Security Risks of Hosting-Based Dangling Domains.
Researchers from Tsinghua University used DNS databases to identify records that point at "service endpoints", like example.s3.amazonaws.com
. Next, if the managed service is "vulnerable", they check whether the identifier is registered (i.e., does the example
s3 bucket exist?). If it is not, and the managed service is "vulnerable", they can register the identifier under their attacker account and hijack the record! Whether a service is vulnerable varies. For example, some providers require proof of ownership (like a TXT record) or disallow registering identifiers that were previously held by another customer.
There are a lot of other projects against dangling DNS records we won't review for brevity. While the space is well traversed, there are some limitations of historical approaches. In general, past work usually runs into at least one of the following:
Later, we'll use dangling records as an example of how you can apply unconventional data sources to find vulnerabilities at scale. Additionally, as we'll soon see, I also attempted to avoid these common pitfalls. With that said, let's move on to hardcoded secrets!
Unlike dangling DNS records, work into hardcoded secrets was a lot more limited than I expected. In general, research into this category falls into two camps:
There are dozens of trivial examples of the former including Gitleaks, Git-secrets, and TruffleHog. These tools listen for new Git commits, or are ran ad hoc against existing data, like the contents of a Git repository, S3 bucket, Docker image, and so on.
Nearly all secret scanning tools work by looking for a regex pattern for various secrets. For example, some providers include a unique identifier in their API keys that makes accurate identification trivial, e.g., long-term AWS user access keys start with AKIA
. You can see this in the example regex patterns above, such as AKIA[0-9A-Z]{16}
.
Moving on from self-managed tooling, some source control platforms, particularly GitHub, proactively scan for secrets in all public data/code. In fact, GitHub goes a step beyond identifying a potential secret by partnering with several large cloud providers to verify and revoke leaked secrets! This is pretty neat because it can prevent abuse before the customer can take action.
In general, all work into hardcoded secrets faces the same problem: it's against a data set limited in scope and diversity. Even GitHub's secret scanning work, one of the largest projects of its kind to date, only has visibility into a small fraction of leaked secrets. For example, most closed source applications will likely never encounter GitHub's scanning. Other tools like TruffleHog are better at accepting a diverse set of file formats, but lack a large, diverse source, of those files.
How can we do better?
When we consider the conventional approaches to vulnerability discovery, be it in software or websites, we tend to confine ourselves to a specific target or platform. In the case of software, we might reverse engineer an application's attack surfaces for untrusted input, aiming to trigger edge cases. For websites, we might enumerate a domain for related assets and seek out unpatched, less defended, or occasionally abandoned resources.
To be fair, this thinking is an intuitive default. For example, when was the last time you read a blog about how to find privilege escalation vulnerabilities at scale? They exist, but unsurprisingly, the industry is biased towards vulnerability hunting individual targets (including me!). This is the result of simple incentives. It's way easier to hunt for vulnerabilities at a micro level, particularly complex types like software privilege escalation. Monetary incentives like bug bounty are also target oriented.
To be fair, automatically identifying software vulnerabilities is extremely challenging. What I've noticed with cloud vulnerabilities is that they're not only easier to comprehend, but they also tend to lead to a much larger impact. Why? Remember- in the cloud, everything is shared! I can remotely execute code in a software application? Cool, I can now pop a single victim if I meet a laundry list of other requirements like network access or user interaction. I can execute code in a cloud provider? Chances are, the bug impacts more than one customer.
The industry lacks focus on finding bugs at scale- it's a common pattern across most security research. Can we shift our perspective away from a specific target?
Perspective | Dangling Resources |
---|---|
Traditional | Start with a target and capture vulnerable assets. |
At Scale | Capture first & identify impact with “big data”. |
With dangling DNS records, the "traditional" approach is to enumerate a target for subdomains and only then identify vulnerable records. Fortunately, dangling DNS records are one of the few vulnerability classes we've been able to identify at scale.
For example, we previously discussed Matt Bryant's blog about finding records that point at deallocated AWS IPs. Instead of starting with a target, Bryant enumerates potential vulnerabilities- the shared pool of available AWS IPs, many of which were likely assigned to a another customer at some point. Bryant worked backwards. What DNS records point at the IP I allocated in my AWS environment? If any exist, I know for a fact that they are dangling, because I control the target!
Bryant's methodology had other problems, like a limited source of DNS data (e.g., Bing's ip:
search filter), but the key takeaways are simple.
For example, the two perspectives for leaked secrets include:
Perspective | Leaked Secrets |
---|---|
Traditional | Scan a limited scope for secret patterns. |
At Scale | Find diverse “big data” sources with no target restriction. |
Today, most work towards identifying leaked cloud credentials focus on individual targets, or lack diversity (e.g., GitHub secret scanning). To identify dangling DNS vulnerabilities at scale, we start with the records, not a target. We do this by using "big data" sources of DNS intelligence, e.g., the capability to figure out what records point at an IP.
Are there similar "big data" sources that would let us identify leaked secrets at scale without starting with a limited scope?
A DNS record is dangling if it points at a deallocated resource. The ideal data source must include a large volume of DNS records. While we covered Bing as an example, are there better alternatives?
Passive DNS replication data is a fascinating DNS metadata source I learned about a few years ago. Long story short, some DNS providers sell anonymized DNS data to threat intelligence services who then resell it to people like me. Anonymization means they usually don't include user identifiers (client IP) and rather the DNS records themselves.
Passive DNS data has many uses. Most importantly, its diversity and scope is almost always better than alternatives to enumeration, like brute-forcing subdomains. If someone has resolved the DNS record, chances are that the record's metadata is available through passive DNS data. For our purposes, we can use it to find records that point at an IP or hostname we've captured while enumerating a provider's shared pool of network identifiers.
To be clear, using passive DNS data to find dangling records is not novel, but it's a great example of the security at scale mindset in practice. We will apply this technique against modern deterrents in the implementation section.
Going back to the drawing board- what unorthodox data sources would potentially contain leaked secrets? Well, secrets can be included in all sorts of files. For example, if you use a secret in your client-side application, it's not just your source code that has it- any compiled version will include it too. Websites, particularly JavaScript, can use secrets to access cloud services too.
Where can we find a large collection of applications, scripts, websites, and other artifacts?
What about… virus scanning platforms? "By submitting data ... you are agreeing ... to the sharing of your Sample submission with the security community ..."
Virus scanning websites like VirusTotal allow you to upload & inspect a file for malicious content using dozens of anti-virus software providers. The reason they caught my eye was because they have everything. Documents, desktop software, iOS and Android apps, text files, configuration files, etc. all frequently find their way to them. The best part? These websites often allow privileged access to these files to improve detection products, reduce false positives, and identify malware.
While we aren't after malware, we are after vulnerabilities. Virus scanning platforms are simply a means of identifying them. For example, platforms could block my access, but if a secret in some app was uploaded in a file to them, chances are that app is accessible off platform too.
Even if we have a candidate data source, we're far from finished. It would be infeasible to scan every file on larger platforms directly. We need to reduce scope.
This challenge is not exclusive to our use case. If we were looking for malware, we'd run into the same feasibility problem. Fortunately, platforms that allow data access typically also provide a means of searching that data. In VirusTotal's case, this feature is called Retrohunt.
Retrohunt lets you scan files using something called "YARA rules". YARA provides a "rules-based approach to create descriptions of malware families based on regular expression, textual or binary patterns". The above example from YARA's documentation shows what these rules look like.
In the past work section, we discussed how secrets can sometimes have an identifiable pattern, like AKIA
for AWS access keys. In fact, all secret scanning tools use regex based on these identifiers to find secrets. You know what else supports regex? YARA! While originally designed to "identify and classify malware samples", we can repurpose it to reduce petabytes of data to "just" a few million files that potentially contain credentials.
Enough theory, let's write some code!
The plan should be simple. Scan for secret patterns across several virus scanning platforms and validate potential credentials with the provider. How hard can it be? (tm)
One problem I encountered early on was that not all cloud providers use an identifiable pattern in their secrets. How are we supposed to identify candidate samples for scanning? In the overview of the security at scale mindset, note how I said the data source, "must contain relationships indicative of the targeted vulnerability class".
Just because we can't identify some secrets directly, doesn't mean we can't identify potential files indirectly. For example, let's say I have a Python script that makes a GET request to some API endpoint using a generic [a-zA-Z0-9]{32}
key with no identifying marks. How would I identify this file for scanning? One relationship in the file is between the API key and the API endpoint.
While I can't search for the API key, what if I searched for the endpoint instead? The example above is a rule I wrote for Linode, a cloud provider with generic secret keys. I can't process every sample uploaded to VirusTotal, but chances are I can feasibly process every sample with the string api.linode.com
.
Generic keys are also annoying because even if we cut our scope, 32 consecutive alphanumerical characters can easily match plenty of non-secret strings too. In this project, beyond identifying potential secrets, I wanted accurate identification. How do we know if a generic match is a secret? Only one way to find out- give it a go!
It would obviously be inappropriate to query customer data, but what about metadata? For example, many APIs will have benign endpoints to retrieve basic metadata like your username. We can use these authenticated endpoints as a bare minimum validity test to meet our technical requirements without going further than we have to! These keys are already publicly leaked after all.
To review:
Our Retrohunt jobs will produce a large number of samples that may contain secrets. Extracting strings and validating every possible key will take time and computational resources. It would be incredibly inefficient to scan hundreds of thousands of samples synchronously. How do we build infrastructure to support our large volume?
When I first started this work in 2021, I started by scanning samples locally. It turned out to be incredibly impractical. I needed approach that could maximize the number of samples we could scan in parallel without breaking the bank. What if we leveraged cloud computing?
A cloud execution model that has been growing in the past decade is serverless computing. At a high level, serverless computing is where your applications are only allocated and running when they are needed on a cloud provider's managed infrastructure. You don't need to worry about setting up your own server or pay for idle time. For example, you could run a web server where you only pay for the time it takes your application to respond to a request.
AWS's serverless offering is called AWS Lambda. What if we created an AWS Lambda function that scans a given sample for secrets? AWS Lambda has a default concurrency limit of 1,000. This means we could scan at least 1,000 samples concurrently! As long as our scans did not take more than a few minutes, AWS Lambda is fairly cost efficient as well. I ended up using AWS Lambda, but you don't have to! Most providers have equivalents; AWS was just where I had the most experience.
We need four major components for our secret scanning project.
Our coordinator and Lambda function will require details for each type of service we are targeting.
When I originally started this project, tools like TruffleHog which similarly detect and validate potential keys had not yet been created. More importantly, they weren't designed to maximize efficient scanning. With Serverless, you are charged for every second of execution. We needed an optimized solution.
For the Coordinator, I went with Python, but for the scanner, I went with C++. This component will download a given sample into memory, extract any ASCII or Unicode strings, and then scan these strings for secrets. For each supported provider, I implemented a class with two functions: 1) check if a string contains a potential secret key, and 2), validate a secret using a benign metadata endpoint (usually via REST). Generically, the scanner runs an internal version of strings and passes each to the first function. Any matches are tracked and asynchronously verified using the second.
std::vector<KeyMatch> LinodeKeyType::FindKeys(std::string String)
{
const std::regex linodeApiKeyRegex("(?:[^a-fA-F0-9]|^)([a-fA-F0-9]{64})(?:[^a-fA-F0-9]|$)");
std::sregex_iterator rend;
std::vector<KeyMatch> potentialKeys;
std::smatch currentMatch;
//
// Find all API keys.
//
for (std::sregex_iterator i(String.begin(), String.end(), linodeApiKeyRegex); i != rend; ++i)
{
currentMatch = *i;
if (currentMatch.size() > 1 && currentMatch[1].matched)
{
potentialKeys.push_back(KeyMatch(currentMatch[1].str(), LinodeKeyCategory::LinodeApiKey));
}
}
return potentialKeys;
}
bool LinodeKeyType::ValidateKeyPair(std::vector<KeyMatch> KeyPair)
{
KeyMatch apiKey;
//
// Retrieve the API key from the key pair.
//
apiKey = KeyPair[0];
//
// Attempt to retrieve the account's details using the API key.
//
HttpResponse accountResponse = WebHelper::Get("https://api.linode.com/v4/account", {}, {{"Authorization", "Bearer " + apiKey.GetKey()}});
if (accountResponse.GetStatusCode() == 200)
{
return true;
}
return false;
}
To detect secrets, I use hyperscan, Intel's "high-performance regular expression matching library". I originally started with C++'s <regex>
implementation seen above, but I was shocked to find that hyperscan was faster by an order of magnitude. What took 300 seconds (e.g., scanning a large app with many strings) now took 10! Regex patterns were manually crafted based largely on public references, like that secret pattern repository we covered in Past Work.
The Coordinator, used to manage our entire architecture, was pretty straightforward. Triggered on a timer, it looks in a database for Retrohunt scans that are pending and monitors their completion. A cool trick: VirusTotal Retrohunt has a 10,000 sample limit per job. To avoid this, around ~5% into the job, you can check whether the number of matches times 20 (for 5%) is greater than or equal to 10,000. If it is, you can abort the job and dispatch two new Retrohunt jobs split by time. For example, if I'm scanning the past year, I'll instead create two jobs to scan the first and second half of the year respectively. You can recursively keep splitting jobs, which was critical for secrets which lacked an identifiable pattern and thus led to many samples.
# Check if our scan reached VirusTotal's match limit.
if (scan_status == "finished" and scan.get_num_matched_samples() == 10000) or \
(scan_status == "running" and scan.is_scan_stalled()):
logging.warning(f"Retrohunt scan {scan_id} reached match limit. Splitting and re-queueing.")
# Calculate the mid point date for the scan.
scan_start_time = scan.start_time
scan_end_time = scan.end_time
scan_mid_time = scan_start_time + (scan_end_time - scan_start_time) / 2
# Queue the first half scan (start to mid).
self.add_retrohunt_scan(scan.key_type_name, scan_start_time, scan_mid_time)
# Queue the second half scan (mid to end).
self.add_retrohunt_scan(scan.key_type_name, scan_mid_time, scan_end_time)
logging.info(f"Queued two new retrohunt scans for key type {scan.key_type_name}.")
# If the scan is running, abort it.
if scan_status == "running":
scan.abort_scan()
logging.info(f"Aborted running scan {scan_id} due to stall.")
logging.info(f"Marking retrohunt scan {scan_id} as aborted.")
self.db.add_aborted_scan(scan_id)
The message broker and database are nothing special. I started with RabbitMQ for the former, but use AWS SQS today. I use MySQL for the latter.
To recap our end-to-end workflow:
FindKeys
implementation.ValidateKeyPair
in parallel.This approach let me scan several million samples quickly and with a reasonable cost. While there are many optimizations I could do to the C++ scanner, or infrastructure by avoiding Serverless, it does not matter enough to warrant dealing with that complexity.
Moving on to dangling DNS records, I was most interested in challenging modern provider mitigations. Of note, as far as I could see, attempts to enumerate Google Cloud's pool of IPs has failed because of this. There has been work into hijacking dangling records for managed services, like Google Cloud DNS, just not the dedicated endpoint equivalent.
AWS also has a few mitigations, like the small pool of IPs you can access by allocating/releasing elastic IPs, but there are known ways around this. For example, instead of allocating elastic IPs, we can restart EC2 instances with ephemeral IPs to perform enumeration. On restart, you get a brand new IP.
Google Cloud was harder, but their mitigations are deterrents. Limitations include...
How do we get around these? Well...
For most quota limits, I figured the easiest way around them would be to create several accounts. The billing restrictions made this difficult, but for account creation, I ended up creating a fake Google Workspace tenant. Google Workspace is just Google products as a service for enterprises. The neat part about a Workspace is that you can programmatically create fake employee accounts, each with 10-15 projects!
For the billing restrictions, I used virtual credit cards. Long story short, services like Privacy let you generate random credit card numbers with custom limits to avoid exposing your own. These services still follow know your customer (KYC) practices; my identity was verified, but they let us get past any vendor-set limits based on your credit card.
By creating several accounts using Google Workspace and unique virtual cards, I was able to get past Google's deterrents. Note that I didn't abuse any vulnerability- these mitigations simply are just mitigations. With the quotas out of the way, I successfully enumerated Google's public IP pools (per-region and global) by constantly recreating forwarding rules!
CloudRot, what I called my dangling enumeration work, enumerated the public IP pools of Google Cloud and Amazon Web Services. In total, I've captured 1,770,495 IPs to date. 1,485,211 IPs (~84%) for AWS and 285,284 IPs (~16%) for GCP. The reason for the differences is largely due to Google's technical mitigations.
For every 1,000 IPs in AWS's public pool, 24.73 of them were associated with a domain. For every 1,000 IPs in Google Cloud's public pool, 35.52 of them were associated with a domain.
While purely speculative, I suspect that the noticeable increase in the rate of impacted Google Cloud's IPs is largely due to the fact that this is likely the first publication that has successfully enumerated the public IP pool for Google Cloud's "compute" instances at scale. There has been research into taking over managed applications in Google Cloud, but I was unable to find any publication that enumerated IPs, likely due to the extensive technical mitigations we encountered. Since there has been a lack of research into this IP pool, it's possible that the systemic vulnerability of dangling domains has gone unnoticed.
In total, I discovered over 78,000 dangling cloud resources corresponding to 66,000 unique top-level domains (excluding findings associated with dynamic DNS providers). There were thousands of notable impacted organizations like Google, Amazon, the New York Times, Harvard, MIT, Samsung, Qualys, Hewlett-Packard, etc. Based on the Tranco ranking of popular domains, CloudRot has discovered 5,434 unique dangling resources associated with a top 50,000 apex domain.
Here are a few archived examples!
The New York Times: https://web.archive.org/web/20230328201322/http://intl.prd.nytimes.com/index.html
Dior: https://web.archive.org/web/20230228194202/https://preprod-elk.dior.com/
State Government of California: https://web.archive.org/web/20230228162431/https://tableau.cdt.ca.gov/
U.S. District Court for the Western District of Texas: https://web.archive.org/web/20230226010822/https://txwd.uscourts.gov/
Even Chuck-e-Cheese! https://web.archive.org/web/20230228033155/https://qa.chuckecheese.com/
One of the first keys I encountered was for Samsung Bixby's Slack environment.
Long story short, the com.samsung.android.bixby.agent
app has a "logcat" mechanism to upload the agent's log file to a dedicated Slack channel. It does this using the Slack REST API and a bot token for dumpstater
.
The vulnerability was that the bot had a default bot scope, which an attacker could abuse to read from or write to every channel in Samsung's Slack!
Another early example includes CrowdStrike! An old version of a free utility CrowdStrike distributes on crowdstrike.com
, CrowdInspect, contained an hardcoded API key for the VirusTotal service. The API key granted full access to both the VirusTotal account of CrowdStrike employee and the broader CrowdStrike organization inside of VirusTotal.
An attacker can use this API key to leak information about ongoing CrowdStrike investigations and to gain significant premium access to VirusTotal, such as the ability to download samples, access to VirusTotal's intelligence hunting service, access to VirusTotal's private API, etc. For a full list of privileges, you can use the user API endpoint.
As an example of how this key could be abused to gain confidential information about ongoing CrowdStrike investigations, I queried the VirusTotal Graphs API with the filter group:crowdstrike
. VirusTotal offers Graphs as a feature which allows defenders to graph out relationships between files, links, and other entities in one place. For example, a defender could use VirusTotal graphs to document the different binaries seen from a specific APT group, including the C2 servers those binaries connect to. Since our API key has full access to the CrowdStrike organization, we can query the active graphs defenders are working on and even see the content of these graphs.
Obviously, this was quickly reported and got fixed in under 24 hours!
This was a fun one. At the end of 2022, I found an interesting CSV export of an employee of Nebraska's Supreme Court with over 300 unique credentials including...
These were seriously the keys to the kingdom and then some. While metadata showed the file was uploaded in Lincoln, Nebraska, by a web user, it's unclear whether this was an accident or an automated tool. I worked directly with Nebraska's State Information Security Officer to promptly rotate these credentials.
An important goal I have with any research project is to approach problems holistically. For example, I not only like finding vulnerabilities, but thinking about how to address them too. Leaked secrets are a major problem- they are far easier to abuse than dangling DNS records. Was there anything I could do to mitigate abuse?
I wasn't the only one trying to fix leaked secrets. GitHub's secret scanning program had already addressed this. GitHub partners with many cloud providers to implement automatic revocations. Partners provide regex patterns for their secrets and an endpoint to validate/report keys. For example, if you accidentally leak your Slack token in a GitHub commit, it will be revoked in minutes. This automation is critical because otherwise, attackers could search GitHub (or VirusTotal) for credentials and abuse them before the customer has an opportunity to react.
I asked whether they could provide an endpoint for reporting keys directly. GitHub had done a great job creating relationships with vendors to automatically rotate keys. After all, I could already report a key by just posting it on GitHub, and all hosting an endpoint would do is help secure the Internet.
Unfortunately, they said no. Fortunately, I was only asking as a courtesy. I created a throwaway account and system that would use GitHub's Gist API to create a public note with a secret for only a split second. In theory, this would trigger GitHub's scans, and in turn a provider revocation.
This actually worked during testing, but I ran into the above error when trying it with ~500 keys.
It turns out GitHub doesn't like it when you create hundreds of Gists in a matter of minutes and suspended my account. This was a problem due to the scale of keys I was finding. Unfortunately, once a test account is flagged, most API calls fail. How can we get around this?
When debugging, I noticed something weird. For some reason, posting a Gist at the web interface, https://gist.github.com
still worked. Even stranger? When I visited a public Gist in an incognito session, I faced a 404 error. It turns out that when your account is suspended, GitHub hides it (and any derivative public content like gists) from every other user. You're basically shadow banned! What's interesting was that they still allowed you to upload content through the UI, but not the API. Whatever the reason, this got me thinking...
Who needs an API? Using Python Selenium, a library to control a browser in test suites, I created a script that automatically logged me into GitHub and used the undocumented API to publish a Gist. While my test account was still shadow banned, this was actually a feature! How? I could now create public gists that triggered secret scanning without any risk of exposure!! (also why I don't mind sharing the account's name or ID)
I used GitHub to revoke most keys I discovered, if the provider had enrolled in their program, going well beyond the minimum by protecting victims! Above is the small website I included in the name of the Gist which is embedded in key exposure notification emails. Today, any new secrets leaked on VirusTotal and a few other platforms are automatically reported, helping mitigate abuse.
Besides GitHub, I also worked with several partners directly to report leaked key material.
To start, I'd like to give a special thanks to the vendors who worked with me to help protect their customers. Above is an example of an endpoint for revoking leaked OpenAI keys. Unfortunately, the process was not so smooth in all cases. For example, AWS, the largest cloud provider by market share, refused to share the endpoint used to report leaked secrets, despite the fact we could access it indirectly by posting a secret in a gist. This was nothing but politics getting in the way of customer security.
To add insult to injury, when AWS detects a leaked secret on GitHub, they do not revoke it. Instead, they restrict the key and create a support case with the customer who might not even see it. The former sounds good in theory, but in practice, limits very little. It really only prevents write access, like being able to start an EC2 instance or upload files to S3 buckets. There is almost no restriction to downloading data, like a customer database backup on an S3 bucket.
Before I developed the GitHub reporting system, I used to manually email batches of keys to AWS. In late July, while preparing key metrics, I noticed a few keys in my database that I thought I had seen before. In fact, I had seen them! Many new keys my system detected were the same keys I had reported to AWS months before. What was going on?
According to my estimates, ~32% of the keys I reported to you over 2-3 months ago are unrevoked and were forgotten about by your fraud team. The ~32% figure is based on the number of unique keys I reported to you by email which again appeared active at the end of July. It appears that the support cases your fraud team created in many cases were automatically resolved, leaving the keys exposed for abuse.
~ Email to AWS Security
AWS not only fails to revoke publicly leaked secrets, but the support cases they create are horribly mismanaged. It turned out that ~32% of the keys I reported to AWS were not addressed 4 months later. The reason? If the customer didn't respond, the support case for the leaked secret is automatically closed! You can see one example above I obtained while investigating a previously reported secret.
I've continued to work with AWS on these concerns, but have yet to see any meaningful action. I suspect the choice to leave keys exposed for abuse is due to the risk of impacting legitimate workflows that depend on the key. This would be a fair concern, but what's worse, a short-term service disruption or the theft of sensitive user data?
How do I know for a fact AWS' approach is unreasonable? Every other vendor enrolled in secret validity checks in GitHub's program I tested revoked leaked keys, including AWS' direct competitor, Google Cloud. The ~32% unrevoked figure after 4 months also worries me because I wonder if it applies to keys on GitHub, which are easy to spot. I think this could end very badly for AWS, but at the end of the day, this is their call.
In general, make it as hard as possible to insecurely use secrets within your organization. The trivial approach to solving leaked secrets within your organization is to use existing tooling to scan your code. This does not address the root cause of leaked secrets. To prevent them, the most effective approach is to focus on design decisions to disincentivize insecure usage of credentials.
Start by understanding "how is my organization currently managing secrets?"
Have a central authority for managing secrets. OWASP has an incredible cheat sheet on this. I strongly recommend following their advice.
For dangling domains, there is no one size fits all. Generally, you should track the cloud resource associated with any DNS record you create, but this will vary dependent on your environment. I believe a better solution would come at a platform/provider level, but this requires cross-industry collaboration.
This article explored two common, yet critical vulnerability classes: dangling DNS records and leaked secrets. The former was well traversed, but served as a useful example of applying what I call the "security at scale mindset". Instead of starting with a target, we start with the vulnerability. We demonstrated how dangling records are still a prominent issue across the Internet, despite existing for over a decade.
We then targeted hardcoded secrets, an area far less explored. By leveraging unconventional "big data" sources, in this case virus scanning platforms, we were able to discover secrets at an unprecedented scale. Like dangling DNS records, the root cause of these systemic weaknesses are poor incentives. The status quo of using a short token for authentication incentivizes poor security practices, like embedding them in raw code, no matter how much we warn against it.
We finished the project by going end-to-end. Not only did we discover these problems, we mitigated a majority by taking advantage of GitHub's existing relationships and automating their UI to trigger secret revocation. It's rare to have an opportunity to directly protect victims, yet so important to deterring abuse.
I'd encourage you to think about the mindset we applied in this article for other types of bugs. Examples of other interesting data sources include netflow data and OSINT/publicly available data you can scrape. In general, here are two key areas to focus on:
I hope you enjoyed this research as much as I did! I'm so glad to have had the opportunity to share this work with you. Would love to hear what you think- feel free to leave a reply on the article tweet!