One vulnerability a use-after-free in the Linux nftable subsystem, exploitable on the three kernelCTF targets: latest Long-term Stable (LTS) release, Container-optimized build as used by Google Cloud, and a Mitigation build that isn’t as up-to-date but includes experimentation mitigations to be bypassed.
The vulnerability exists in the Netfilter tables subsystem of the Linux kernel. The issue occurs during processing of a NFT_MSG_NEWRULE operation inside of a transaction/batch; as the name implies you are adding a new rule to a set. if an error happens during this it can fall into the err_release_rule
path, which calls into nf_tables_rule_release
Which makes sense from a developer point of view, the rule is bad, you want to release it. However this function calls into nft_rule_expr_deactivate
which takes in a parameter for the current phase
. It is hard-coded to use the NFT_TRANS_RELEASE
phase so when the function is called, for that phase it’ll end up unbinding the nft_set
object the rule was being added to. However a reference to that set is still kept earlier in the chain processing the transaction, leading to the use-after-free.
The patch seems fairly straight forward, rather than using the nf_tables_rule_release
function, they call the two functions that function would call, and change the phase
for the call to nft_rule_expr_deactivate
to the appropriate NFT_TRANS_PREPARE
With this vulnerability there is the initial use-after-free, but if execution keeps going, the prematurely freed nft_set
structure will be freed again after everything has been processed creating a double free situation. A double-free is a much more friendly primitive to have for exploitation so the authors pursued that route. They did have to introduce an extra set object into the process to interweave the frees in order to bypass a naive double-free check (can’t free the same pointer twice in a row).
I won’t be diving too far into the exploitation here because usage of themsg_msg
and msg_msgseg
structures has been well explored. It is a very powerful object that can be sprayed from userland with a high-degree of control over the data by a user. Ultimately they corrupt the pipe_buf_operations
structure which contains various function pointers which can be triggered from operations on the pipe in userland. And then went for a ROP chain to escalate privileges.
I will call out one thing I found kinda fun, while on the LTS kernelCTF box they did a standard escalation via a commit_creds
call. On the Cloud-optimized build, while they used some different objects for their corruption, they still corrupted an operations structure and got in position from a ROP. Instead of doing a commit_creds
call they called set_memory_x
to set some heap memory as executable and just ran plain shellcode they wrote into the heap that did the usual escalation technique.
A very powerful bug in the io_uring
driver of the linux kernel. In this case, the vulnerability is in the handling of registering fixed buffers via the IORING_REGISTER_BUFFERS
opcode, which allows an application to ‘pin’ and register memory for long-term use, which includes making it exempt from paging mechanics. The user can pass an iovec
of an address and length, which the kernel will then take to construct a bio_vec
(essentially an iovec
but for physical memory). The problem comes in when the driver tries to optimize the buffer for compound pages.
Background on compound pages / folio
Typically a page refers to the minimum sized block of physical memory that the kernel can map, which in most cases these days is 4KB. In linux though, a page
can refer to a singular page or compound pages, which are a group of pages that are contiguous in memory. With compound pages, the first page holds information about the group of pages comprising the compound page, and the tailing pages point back to the first page. This leads to a problem with any kernel function that has to handle pages, as it needs to know if it’s a tail page of a compound page or not. To solve this problem, the folio
object was created. Singular pages and the first page of compound pages are wrapped with the folio
type to distinguish them from tail pages.
Vulnerability
When registering buffers, if you try to register a buffer that’s larger than a physical page size, it’ll check to see if those pages are part of a compound page by checking their folio
pointers, and if they are, it’ll reduce the number of pages to 1
and mark it as a folio buffer. The bug is that when they do this checking, they don’t make sure the pages are physically contiguous. You can have multiple virtual pages map to the same physical page, which leads to a situation where the buffer is virtually contiguous but not physically contiguous. Ultimately this gives you an out of bounds access on adjacent physical pages, which is an insanely powerful primitive.
Exploitation
Exploiting this was not only fairly straightforward, but super reliable. By spraying pages filled with socket
objects and tagging the sockets with setting crafted pacing rates, they can use the fixed buffer access to read and check for their sprayed socket objects, defeat kernel ASLR via the various pointers, and also write into and replace the operations function table to gain code execution.